Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper: Fix decoder inputs for static pipeline #1469

Merged
merged 5 commits into from
Jan 10, 2025

Conversation

eshiryae
Copy link
Contributor

@eshiryae eshiryae commented Jan 3, 2025

Tickets:

@github-actions github-actions bot added the category: whisper Whisper pipeline label Jan 3, 2025
@@ -25,6 +25,7 @@ class WhisperPipeline::StaticWhisperPipeline : public WhisperPipeline::WhisperPi

private:
WhisperInitializedModels m_models;
std::shared_ptr<ov::Model> m_decoder_model;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to store model? once model is compiled, we need to release ov::Model to free memory consumed by its weights

Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't compile this model until generate() called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a temporal solution as wee need to reshape and recompile decoder model for specific number of input tokens, but we know it only on generation stage.

Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'd assume we have something like this:

class DecoderCache {
public:
    ov::CompiledModel get_model(uint8_t input_id_size) {
        // Get from hash table, otherwise compile and store...
    }
private:
    // [input_ids_size -> CompiledModel]
    std::unordered_map<uint8_t, ov::CompiledModel> m_cache;
    std::shared_ptr<ov::Model> decoder_model; // <- this is dynamic w/o transformation applied    
}


// Whenever we need a model:
auto decoder = m_decoder_cache.get(input_ids_size);

src/cpp/src/whisper_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/whisper_pipeline_static.cpp Show resolved Hide resolved
src/cpp/src/whisper_pipeline_static.cpp Show resolved Hide resolved
@@ -654,7 +657,13 @@ WhisperDecodedResults WhisperPipeline::StaticWhisperPipeline::generate(

// prepare init_ids just once for whole input
if (init_ids.empty()) {
OPENVINO_ASSERT(m_models.decoder.get_tensor("input_ids").get_shape().back() == 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this check do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That we have correct shape for input_ids for decoder model.
In prepare_init_ids() infer request for decoder can be called for language detection (it runs with 1 token as input).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it can't be incorrect, as we explicitly reshape model for input_ids size.

Besides, I'd rather check: get_tensor("input_ids").get_size()

@dmatveev dmatveev changed the title Fix decoder inputs for static pipeline Whisper: Fix decoder inputs for static pipeline Jan 3, 2025
@dmatveev dmatveev added this to the 2025.0 milestone Jan 3, 2025
@TolyaTalamanov
Copy link
Collaborator

Btw, do we still need this?

// TODO: Support models produced by optimum-cli
if (!check_decoder_model_compatibility(decoder_model)) {
OPENVINO_THROW("StaticWhisperPipeline expects decoder model has \"attention_mask\" input!");
}

Copy link
Collaborator

@TolyaTalamanov TolyaTalamanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What if language detection isn't needed? But we already compiled model for [1,1] shape here:

    // TODO: Support models produced by optimum-cli
    if (!check_decoder_model_compatibility(decoder_model)) {
    OPENVINO_THROW("StaticWhisperPipeline expects decoder model has \"attention_mask\" input!");
    }

  2. What if user runs the same model, but on multiple audio? So, we will compile it every time here:

    m_models.decoder = core.compile_model(m_decoder_model, "NPU").create_infer_request();

  3. What if every of such runs from the 2) example, requires language detection?

In order to encapsulate this logic, you can introduce something like DecoderCache as I proposed previously.

At any point, when you need CompiledModel for particular shape, you can ask DecoderCache and it will check if such model already exists (compiled) or reshape and compile a new one.

@eshiryae
Copy link
Contributor Author

eshiryae commented Jan 8, 2025

  1. check_decoder_model_compatibility() checks only if decoder_model has attention_mask as an input, shape will not affects this check. Btw i suppose we don't need this check and we can remove it.

Agree with other points, it's a great idea to store already compiled models in cache, will do that.

@eshiryae eshiryae force-pushed the b_whisper_diff_issue branch from 3c901d7 to ac16558 Compare January 9, 2025 15:44
src/cpp/src/whisper_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/whisper_pipeline_static.cpp Outdated Show resolved Hide resolved
src/cpp/src/whisper_pipeline_static.hpp Outdated Show resolved Hide resolved
src/cpp/src/whisper_pipeline_static.cpp Outdated Show resolved Hide resolved
@@ -654,7 +626,11 @@ WhisperDecodedResults WhisperPipeline::StaticWhisperPipeline::generate(

// prepare init_ids just once for whole input
if (init_ids.empty()) {
m_models.decoder = m_decoder_cache.get_model(1).create_infer_request(); // for detect_language()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the model compiled upfront again. What if it's not even needed and language already known?

I believe you need it somewhere here:

language_token_id = detect_language(encoder_hidden_state, decoder, config, raw_metrics);

@@ -541,6 +502,23 @@ std::shared_ptr<ov::Model> redirect_new_kv_to_output(const std::shared_ptr<ov::M
namespace ov {
namespace genai {

ov::CompiledModel DecoderCache::get_model(uint8_t input_ids_size) {
if (m_cache.find(input_ids_size) == m_cache.cend()) {
if (m_decoder_model->is_dynamic()) { // model is dynamic, reshaping it to static
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I don't expect we need this check...

Let's discuss locally

DecoderCache() = default;
DecoderCache(std::shared_ptr<ov::Model> model, ov::PartialShape shape)
: m_decoder_model(model)
, m_lhs_shape(shape) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, decoder model has two input layers:

  1. encoder_hidden_states - does it change? I believe it depends on the encoder model and once we know it it's no longer change?
  2. decoder_input_ids - This is what we track in hash map. Changes depends on the case

Can we set encoder_hidden_states in the ctor? If so, I believe we don't need to change if model dynamic or not in the get() method

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the size of encoder_hiddne_states depends only on feature_size and it's set in StaticWhisperPipeline and no longer changes since then:

reshape_to_static_encoder(encoder_model, m_feature_extractor.feature_size);
. Can you confirm this?

If so, I believe we definitely can set shape for encoder_hidden_states once.

// Also, it should probably be `ov::Shape` rather than ov::PartialShape as we don't have dynamism (input is fixed)
DecoderCache(std::shared_ptr<ov::Model> model, ov::PartialShape encoder_hidden_shape) {
    model->reshape({{"encoder_hidden_states", encoder_hidden_shape}});
}
DecoderCache::get(ov::CompiledModel DecoderCache::get_model(uint8_t input_ids_size) {
    if (m_cache.counts(input_ids_size) == 0) {
        m_model->reshape({{"decoder_input_ids", ov::Shape({1, input_ids_size})}});
        m_cache.emplace(input_ids_size, core.compile_model(m_model));
    }
    return m_cache.at(input_ids_size);
}

@TolyaTalamanov TolyaTalamanov added this pull request to the merge queue Jan 10, 2025
Merged via the queue into openvinotoolkit:master with commit 77611da Jan 10, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants