Whisper: Fix decoder inputs for static pipeline #1469

eshiryae · 2025-01-03T10:38:56Z

Tickets:

CVS-159064

ilya-lavrenov · 2025-01-03T11:57:57Z

src/cpp/src/whisper_pipeline_static.hpp

@@ -25,6 +25,7 @@ class WhisperPipeline::StaticWhisperPipeline : public WhisperPipeline::WhisperPi

 private:
    WhisperInitializedModels m_models;
+    std::shared_ptr<ov::Model> m_decoder_model;


why do we need to store model? once model is compiled, we need to release ov::Model to free memory consumed by its weights

We can't compile this model until generate() called

It's a temporal solution as wee need to reshape and recompile decoder model for specific number of input tokens, but we know it only on generation stage.

TolyaTalamanov

Well, I'd assume we have something like this:

class DecoderCache {
public:
    ov::CompiledModel get_model(uint8_t input_id_size) {
        // Get from hash table, otherwise compile and store...
    }
private:
    // [input_ids_size -> CompiledModel]
    std::unordered_map<uint8_t, ov::CompiledModel> m_cache;
    std::shared_ptr<ov::Model> decoder_model; // <- this is dynamic w/o transformation applied    
}


// Whenever we need a model:
auto decoder = m_decoder_cache.get(input_ids_size);

src/cpp/src/whisper_pipeline_static.cpp

TolyaTalamanov · 2025-01-03T13:03:40Z

src/cpp/src/whisper_pipeline_static.cpp

@@ -654,7 +657,13 @@ WhisperDecodedResults WhisperPipeline::StaticWhisperPipeline::generate(

        // prepare init_ids just once for whole input
        if (init_ids.empty()) {
+            OPENVINO_ASSERT(m_models.decoder.get_tensor("input_ids").get_shape().back() == 1);


What does this check do?

That we have correct shape for input_ids for decoder model.
In prepare_init_ids() infer request for decoder can be called for language detection (it runs with 1 token as input).

I assume it can't be incorrect, as we explicitly reshape model for input_ids size.

Besides, I'd rather check: get_tensor("input_ids").get_size()

TolyaTalamanov · 2025-01-07T15:23:17Z

Btw, do we still need this?

openvino.genai/src/cpp/src/whisper_pipeline_static.cpp

Lines 553 to 556 in 7e8bbfe

    
           // TODO: Support models produced by optimum-cli 
        
           if (!check_decoder_model_compatibility(decoder_model)) { 
        
               OPENVINO_THROW("StaticWhisperPipeline expects decoder model has \"attention_mask\" input!"); 
        
           }

TolyaTalamanov

What if language detection isn't needed? But we already compiled model for [1,1] shape here:

openvino.genai/src/cpp/src/whisper_pipeline_static.cpp

Lines 553 to 556 in 7e8bbfe

    
           // TODO: Support models produced by optimum-cli 
        
           if (!check_decoder_model_compatibility(decoder_model)) { 
        
               OPENVINO_THROW("StaticWhisperPipeline expects decoder model has \"attention_mask\" input!"); 
        
           }

What if user runs the same model, but on multiple audio? So, we will compile it every time here:

openvino.genai/src/cpp/src/whisper_pipeline_static.cpp

Line 663 in 7e8bbfe

m_models.decoder = core.compile_model(m_decoder_model, "NPU").create_infer_request();
What if every of such runs from the 2) example, requires language detection?

In order to encapsulate this logic, you can introduce something like DecoderCache as I proposed previously.

At any point, when you need CompiledModel for particular shape, you can ask DecoderCache and it will check if such model already exists (compiled) or reshape and compile a new one.

eshiryae · 2025-01-08T11:10:48Z

check_decoder_model_compatibility() checks only if decoder_model has attention_mask as an input, shape will not affects this check. Btw i suppose we don't need this check and we can remove it.

Agree with other points, it's a great idea to store already compiled models in cache, will do that.

src/cpp/src/whisper_pipeline_static.cpp

src/cpp/src/whisper_pipeline_static.hpp

src/cpp/src/whisper_pipeline_static.cpp

TolyaTalamanov · 2025-01-10T09:09:52Z

src/cpp/src/whisper_pipeline_static.cpp

@@ -654,7 +626,11 @@ WhisperDecodedResults WhisperPipeline::StaticWhisperPipeline::generate(

        // prepare init_ids just once for whole input
        if (init_ids.empty()) {
+            m_models.decoder = m_decoder_cache.get_model(1).create_infer_request(); // for detect_language()


Here the model compiled upfront again. What if it's not even needed and language already known?

I believe you need it somewhere here:

openvino.genai/src/cpp/src/whisper_pipeline_static.cpp

Line 269 in 614e6d9

language_token_id = detect_language(encoder_hidden_state, decoder, config, raw_metrics);

TolyaTalamanov · 2025-01-10T09:10:29Z

src/cpp/src/whisper_pipeline_static.cpp

@@ -541,6 +502,23 @@ std::shared_ptr<ov::Model> redirect_new_kv_to_output(const std::shared_ptr<ov::M
 namespace ov {
 namespace genai {

+ov::CompiledModel DecoderCache::get_model(uint8_t input_ids_size) {
+    if (m_cache.find(input_ids_size) == m_cache.cend()) {
+        if (m_decoder_model->is_dynamic()) { // model is dynamic, reshaping it to static


Weird, I don't expect we need this check...

Let's discuss locally

TolyaTalamanov · 2025-01-10T09:15:13Z

src/cpp/src/whisper_pipeline_static.hpp

+    DecoderCache() = default;
+    DecoderCache(std::shared_ptr<ov::Model> model, ov::PartialShape shape)
+     : m_decoder_model(model)
+     , m_lhs_shape(shape) {}


Well, decoder model has two input layers:

encoder_hidden_states - does it change? I believe it depends on the encoder model and once we know it it's no longer change?

decoder_input_ids - This is what we track in hash map. Changes depends on the case

Can we set encoder_hidden_states in the ctor? If so, I believe we don't need to change if model dynamic or not in the get() method

I believe the size of encoder_hiddne_states depends only on feature_size and it's set in StaticWhisperPipeline and no longer changes since then:

openvino.genai/src/cpp/src/whisper_pipeline_static.cpp

Line 535 in 614e6d9

reshape_to_static_encoder(encoder_model, m_feature_extractor.feature_size);

. Can you confirm this?

If so, I believe we definitely can set shape for encoder_hidden_states once.

// Also, it should probably be `ov::Shape` rather than ov::PartialShape as we don't have dynamism (input is fixed) DecoderCache(std::shared_ptr<ov::Model> model, ov::PartialShape encoder_hidden_shape) { model->reshape({{"encoder_hidden_states", encoder_hidden_shape}}); } DecoderCache::get(ov::CompiledModel DecoderCache::get_model(uint8_t input_ids_size) { if (m_cache.counts(input_ids_size) == 0) { m_model->reshape({{"decoder_input_ids", ov::Shape({1, input_ids_size})}}); m_cache.emplace(input_ids_size, core.compile_model(m_model)); } return m_cache.at(input_ids_size); }

github-actions bot added the category: whisper Whisper pipeline label Jan 3, 2025

Fix decoder inputs for static pipeline

dfba871

ilya-lavrenov reviewed Jan 3, 2025

View reviewed changes

ilya-lavrenov assigned dmatveev, TolyaTalamanov and as-suvorov Jan 3, 2025

TolyaTalamanov reviewed Jan 3, 2025

View reviewed changes

dmatveev changed the title ~~Fix decoder inputs for static pipeline~~ Whisper: Fix decoder inputs for static pipeline Jan 3, 2025

dmatveev added this to the 2025.0 milestone Jan 3, 2025

ilya-lavrenov added the category: NPU label Jan 4, 2025

Whisper: address comments

7e8bbfe

eshiryae mentioned this pull request Jan 7, 2025

Add tests for Whisper static pipeline #1250

Open

TolyaTalamanov reviewed Jan 7, 2025

View reviewed changes

eshiryae force-pushed the b_whisper_diff_issue branch from 3c901d7 to ac16558 Compare January 9, 2025 15:44

Added decoder cache, removed decoder's attn mask

ac16558

TolyaTalamanov approved these changes Jan 9, 2025

View reviewed changes

Address comments

614e6d9

TolyaTalamanov reviewed Jan 10, 2025

View reviewed changes

Address comments

9e91e32

TolyaTalamanov enabled auto-merge January 10, 2025 10:46

TolyaTalamanov added this pull request to the merge queue Jan 10, 2025

Merged via the queue into openvinotoolkit:master with commit 77611da Jan 10, 2025
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper: Fix decoder inputs for static pipeline #1469

Whisper: Fix decoder inputs for static pipeline #1469

eshiryae commented Jan 3, 2025

ilya-lavrenov Jan 3, 2025

TolyaTalamanov Jan 3, 2025 •

edited

Loading

eshiryae Jan 7, 2025

TolyaTalamanov left a comment

TolyaTalamanov Jan 3, 2025

eshiryae Jan 7, 2025

TolyaTalamanov Jan 8, 2025

TolyaTalamanov commented Jan 7, 2025

TolyaTalamanov left a comment

eshiryae commented Jan 8, 2025

TolyaTalamanov Jan 10, 2025

TolyaTalamanov Jan 10, 2025

TolyaTalamanov Jan 10, 2025

TolyaTalamanov Jan 10, 2025

	// TODO: Support models produced by optimum-cli
	if (!check_decoder_model_compatibility(decoder_model)) {
	OPENVINO_THROW("StaticWhisperPipeline expects decoder model has \"attention_mask\" input!");
	}

Whisper: Fix decoder inputs for static pipeline #1469

Whisper: Fix decoder inputs for static pipeline #1469

Conversation

eshiryae commented Jan 3, 2025

Tickets:

Choose a reason for hiding this comment

TolyaTalamanov Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TolyaTalamanov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TolyaTalamanov commented Jan 7, 2025

TolyaTalamanov left a comment

Choose a reason for hiding this comment

eshiryae commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TolyaTalamanov Jan 3, 2025 •

edited

Loading