-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add MiniCPM-o #37029
Comments
cc @eustlb , I remember you were working on it |
i'll create a PR for this model and see if they accept it in, fingers crossed for you! |
I would love to review such a PR! 🤗 I haven’t gotten to it yet, it’s been stuck on my TODOs |
@eustlb haha thank you! todo's are a true prison, it ill be just a quick and dirty stub hope you find it of quality :) |
First of all, sorry for the delayed response. Before starting, I analyzed the MiniCPM-o model network and the train pipeline code, Image Section (can be considered as miniCPM-V):
Text2Speech Section:
Streaming Speech2Text Section (the most difficult):
Since MiniCPM-o is trained with Image + Audio to Text, The biggest challenge is the inference pipeline. It’s necessary to integrate the text decoder’s generate utilities with the streaming STT pipeline to make them work together. Additionally, during inference, there’s a need for an active pipeline capable of performing TTS on the predicted text. With these features, could it be integrated into the existing Transformer? What are your thoughts as core maintainers? |
@jp1924 really cool insights, thanks for laying it out!
We don't support streaming for audio models yet, right. In terms of generation code, I don't think it is planned in the near future. Generation will focus more on making text streaming better. Unless audio team plans to add support (cc @eustlb ). My personal opinion, not worth the hassle currently. We are also adding a similar Qwen-Omni model with no streaming generation.
Yeah, making generation work will be the hard part and we will most probably override and add custom generation on top. I am not familiar with how Mini-CPM generates, so cmiiw. We can call Also it's nice to have ability to load model without |
Model description
As discussed in #31836, I would like to add the miniCPM-o model.
I believe the miniCPM family has had a significant impact on the LMM, LLM field.
Currently, the miniCPM-o code is uploaded to the Hugging Face Hub, which makes maintenance very difficult.
Therefore, I want to add models like miniCPM-o to Transformers so they can receive ongoing support and maintenance.
While there are many vision LMM models available on Hugging Face, a considerable number of any-to-any models,
such as Qwen2.5-Omni-7B, MiniCPM-o-2_6, and Janus-Pro-7B, are often implemented by creating their own repositories.
I want to implement these any-to-any models in Transformers so that they can leverage the various features of Hugging Face.
Additionally, adding an any-to-any pipeline could serve as a good template for future any-to-any models to be added.
If you have any good suggestions, let me know!
Open source status
Provide useful links for the implementation
https://github.com/OpenBMB/MiniCPM-o
https://huggingface.co/openbmb/MiniCPM-o-2_6
The text was updated successfully, but these errors were encountered: