Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add MiniCPM-o #37029

Open
2 tasks done
jp1924 opened this issue Mar 27, 2025 · 6 comments · May be fixed by #37049
Open
2 tasks done

add MiniCPM-o #37029

jp1924 opened this issue Mar 27, 2025 · 6 comments · May be fixed by #37049

Comments

@jp1924
Copy link
Contributor

jp1924 commented Mar 27, 2025

Model description

As discussed in #31836, I would like to add the miniCPM-o model.

I believe the miniCPM family has had a significant impact on the LMM, LLM field.

Currently, the miniCPM-o code is uploaded to the Hugging Face Hub, which makes maintenance very difficult.
Therefore, I want to add models like miniCPM-o to Transformers so they can receive ongoing support and maintenance.

While there are many vision LMM models available on Hugging Face, a considerable number of any-to-any models,
such as Qwen2.5-Omni-7B, MiniCPM-o-2_6, and Janus-Pro-7B, are often implemented by creating their own repositories.
I want to implement these any-to-any models in Transformers so that they can leverage the various features of Hugging Face.

Additionally, adding an any-to-any pipeline could serve as a good template for future any-to-any models to be added.

If you have any good suggestions, let me know!

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

https://github.com/OpenBMB/MiniCPM-o
https://huggingface.co/openbmb/MiniCPM-o-2_6

@zucchini-nlp
Copy link
Member

cc @eustlb , I remember you were working on it

@jecrs
Copy link

jecrs commented Mar 27, 2025

i'll create a PR for this model and see if they accept it in, fingers crossed for you!

@eustlb
Copy link
Contributor

eustlb commented Mar 27, 2025

I would love to review such a PR! 🤗 I haven’t gotten to it yet, it’s been stuck on my TODOs

@jecrs
Copy link

jecrs commented Mar 27, 2025

@eustlb haha thank you! todo's are a true prison, it ill be just a quick and dirty stub hope you find it of quality :)

@jecrs jecrs linked a pull request Mar 27, 2025 that will close this issue
1 task
@jp1924
Copy link
Contributor Author

jp1924 commented Apr 3, 2025

First of all, sorry for the delayed response.
@zucchini-nlp @eustlb Thank you for thinking positively. @jecrs Thank you for acting passionately.

Before starting, I analyzed the MiniCPM-o model network and the train pipeline code,
and I determined that the following tasks are necessary:

Image Section (can be considered as miniCPM-V):

  1. Development of an ImageProcessor for miniCPM-o
    • An ImageProcessor incorporating the High-Resolution Image Partition Strategy from LLaVA-UHD.
  2. Addition of the SigLIP-1 model supporting navit.
    • Considering that SigLIP-2, which supports dynamic resolution, has recently been introduced, code needs to be created to load it via AutoModel.
  3. Implementation of the Token Compression module (Resampler) from LLaVA-UHD for MiniCPM-o.

Text2Speech Section:

  1. The TTS module should be trained separately from MiniCPM-o.
    • During inference, it’s sufficient to write code that loads and performs inference with the TTS module separately.
  2. Addition of a pipeline for LMM + TTS inference.

Streaming Speech2Text Section (the most difficult):

  1. Whisper with the use_cache feature is required.
    • This seems to require opening a separate PR for implementation.
    • This is just my personal opinion, but I think it’s necessary to add flash-attn functionality or position_ids to Whisper.
    • This would allow audio to be processed in a padding-free manner, similar to LLM packing.
  2. Addition of a pipeline for Streaming STT + LLM.
    • I’m at a loss as to how to approach this. If streaming STT functionality has already been added and I’m unaware of it, please let me know.

Since MiniCPM-o is trained with Image + Audio to Text,
there doesn’t seem to be significant difficulty in adding models and processors.

The biggest challenge is the inference pipeline.

It’s necessary to integrate the text decoder’s generate utilities with the streaming STT pipeline to make them work together. Additionally, during inference, there’s a need for an active pipeline capable of performing TTS on the predicted text.

With these features, could it be integrated into the existing Transformer?
Or should we first add only the model network required for training and then open a separate PR later to create the pipeline?

What are your thoughts as core maintainers?

@zucchini-nlp
Copy link
Member

@jp1924 really cool insights, thanks for laying it out!

Streaming Speech2Text Section

We don't support streaming for audio models yet, right. In terms of generation code, I don't think it is planned in the near future. Generation will focus more on making text streaming better. Unless audio team plans to add support (cc @eustlb ). My personal opinion, not worth the hassle currently. We are also adding a similar Qwen-Omni model with no streaming generation.

The biggest challenge is the inference pipeline.

Yeah, making generation work will be the hard part and we will most probably override and add custom generation on top. I am not familiar with how Mini-CPM generates, so cmiiw. We can call super().generate(input_ids) to get text generation and then call self.tts(audio_inputs) for generating audio with custom loop.

Also it's nice to have ability to load model without TTS for those who don't want to waste compute. It can be done by adding several classes XXXForConditionalGeneration for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants