-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper: Fix decoder inputs for static pipeline #1469
Whisper: Fix decoder inputs for static pipeline #1469
Conversation
@@ -25,6 +25,7 @@ class WhisperPipeline::StaticWhisperPipeline : public WhisperPipeline::WhisperPi | |||
|
|||
private: | |||
WhisperInitializedModels m_models; | |||
std::shared_ptr<ov::Model> m_decoder_model; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to store model? once model is compiled, we need to release ov::Model to free memory consumed by its weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't compile this model until generate()
called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a temporal solution as wee need to reshape and recompile decoder model for specific number of input tokens, but we know it only on generation stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'd assume we have something like this:
class DecoderCache {
public:
ov::CompiledModel get_model(uint8_t input_id_size) {
// Get from hash table, otherwise compile and store...
}
private:
// [input_ids_size -> CompiledModel]
std::unordered_map<uint8_t, ov::CompiledModel> m_cache;
std::shared_ptr<ov::Model> decoder_model; // <- this is dynamic w/o transformation applied
}
// Whenever we need a model:
auto decoder = m_decoder_cache.get(input_ids_size);
@@ -654,7 +657,13 @@ WhisperDecodedResults WhisperPipeline::StaticWhisperPipeline::generate( | |||
|
|||
// prepare init_ids just once for whole input | |||
if (init_ids.empty()) { | |||
OPENVINO_ASSERT(m_models.decoder.get_tensor("input_ids").get_shape().back() == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this check do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That we have correct shape for input_ids for decoder model.
In prepare_init_ids() infer request for decoder can be called for language detection (it runs with 1 token as input).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it can't be incorrect, as we explicitly reshape model for input_ids
size.
Besides, I'd rather check: get_tensor("input_ids").get_size()
Btw, do we still need this? openvino.genai/src/cpp/src/whisper_pipeline_static.cpp Lines 553 to 556 in 7e8bbfe
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
What if language detection isn't needed? But we already compiled model for
[1,1]
shape here:openvino.genai/src/cpp/src/whisper_pipeline_static.cpp
Lines 553 to 556 in 7e8bbfe
// TODO: Support models produced by optimum-cli if (!check_decoder_model_compatibility(decoder_model)) { OPENVINO_THROW("StaticWhisperPipeline expects decoder model has \"attention_mask\" input!"); } -
What if user runs the same model, but on multiple audio? So, we will compile it every time here:
m_models.decoder = core.compile_model(m_decoder_model, "NPU").create_infer_request(); -
What if every of such runs from the
2)
example, requires language detection?
In order to encapsulate this logic, you can introduce something like DecoderCache
as I proposed previously.
At any point, when you need CompiledModel
for particular shape, you can ask DecoderCache
and it will check if such model already exists (compiled) or reshape and compile a new one.
Agree with other points, it's a great idea to store already compiled models in cache, will do that. |
3c901d7
to
ac16558
Compare
@@ -654,7 +626,11 @@ WhisperDecodedResults WhisperPipeline::StaticWhisperPipeline::generate( | |||
|
|||
// prepare init_ids just once for whole input | |||
if (init_ids.empty()) { | |||
m_models.decoder = m_decoder_cache.get_model(1).create_infer_request(); // for detect_language() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the model compiled upfront again. What if it's not even needed and language already known?
I believe you need it somewhere here:
language_token_id = detect_language(encoder_hidden_state, decoder, config, raw_metrics); |
@@ -541,6 +502,23 @@ std::shared_ptr<ov::Model> redirect_new_kv_to_output(const std::shared_ptr<ov::M | |||
namespace ov { | |||
namespace genai { | |||
|
|||
ov::CompiledModel DecoderCache::get_model(uint8_t input_ids_size) { | |||
if (m_cache.find(input_ids_size) == m_cache.cend()) { | |||
if (m_decoder_model->is_dynamic()) { // model is dynamic, reshaping it to static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, I don't expect we need this check...
Let's discuss locally
DecoderCache() = default; | ||
DecoderCache(std::shared_ptr<ov::Model> model, ov::PartialShape shape) | ||
: m_decoder_model(model) | ||
, m_lhs_shape(shape) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, decoder
model has two input layers:
encoder_hidden_states
- does it change? I believe it depends on theencoder
model and once we know it it's no longer change?decoder_input_ids
- This is what we track in hash map. Changes depends on the case
Can we set encoder_hidden_states
in the ctor
? If so, I believe we don't need to change if model dynamic
or not in the get()
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the size of encoder_hiddne_states
depends only on feature_size
and it's set in StaticWhisperPipeline
and no longer changes since then:
reshape_to_static_encoder(encoder_model, m_feature_extractor.feature_size); |
If so, I believe we definitely can set shape for encoder_hidden_states
once.
// Also, it should probably be `ov::Shape` rather than ov::PartialShape as we don't have dynamism (input is fixed)
DecoderCache(std::shared_ptr<ov::Model> model, ov::PartialShape encoder_hidden_shape) {
model->reshape({{"encoder_hidden_states", encoder_hidden_shape}});
}
DecoderCache::get(ov::CompiledModel DecoderCache::get_model(uint8_t input_ids_size) {
if (m_cache.counts(input_ids_size) == 0) {
m_model->reshape({{"decoder_input_ids", ov::Shape({1, input_ids_size})}});
m_cache.emplace(input_ids_size, core.compile_model(m_model));
}
return m_cache.at(input_ids_size);
}
Tickets: