add MiniCPM-o #37029

jp1924 · 2025-03-27T07:27:42Z

Model description

As discussed in #31836, I would like to add the miniCPM-o model.

I believe the miniCPM family has had a significant impact on the LMM, LLM field.

Currently, the miniCPM-o code is uploaded to the Hugging Face Hub, which makes maintenance very difficult.
Therefore, I want to add models like miniCPM-o to Transformers so they can receive ongoing support and maintenance.

While there are many vision LMM models available on Hugging Face, a considerable number of any-to-any models,
such as Qwen2.5-Omni-7B, MiniCPM-o-2_6, and Janus-Pro-7B, are often implemented by creating their own repositories.
I want to implement these any-to-any models in Transformers so that they can leverage the various features of Hugging Face.

Additionally, adding an any-to-any pipeline could serve as a good template for future any-to-any models to be added.

If you have any good suggestions, let me know!

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

https://github.com/OpenBMB/MiniCPM-o
https://huggingface.co/openbmb/MiniCPM-o-2_6

zucchini-nlp · 2025-03-27T09:55:13Z

cc @eustlb , I remember you were working on it

jecrs · 2025-03-27T15:49:32Z

i'll create a PR for this model and see if they accept it in, fingers crossed for you!

eustlb · 2025-03-27T16:35:30Z

I would love to review such a PR! 🤗 I haven’t gotten to it yet, it’s been stuck on my TODOs

jecrs · 2025-03-27T16:51:32Z

@eustlb haha thank you! todo's are a true prison, it ill be just a quick and dirty stub hope you find it of quality :)

jp1924 · 2025-04-03T04:50:06Z

First of all, sorry for the delayed response.
@zucchini-nlp @eustlb Thank you for thinking positively. @jecrs Thank you for acting passionately.

Before starting, I analyzed the MiniCPM-o model network and the train pipeline code,
and I determined that the following tasks are necessary:

Image Section (can be considered as miniCPM-V):

Development of an ImageProcessor for miniCPM-o
- An ImageProcessor incorporating the High-Resolution Image Partition Strategy from LLaVA-UHD.
Addition of the SigLIP-1 model supporting navit.
- Considering that SigLIP-2, which supports dynamic resolution, has recently been introduced, code needs to be created to load it via AutoModel.
Implementation of the Token Compression module (Resampler) from LLaVA-UHD for MiniCPM-o.

Text2Speech Section:

The TTS module should be trained separately from MiniCPM-o.
- During inference, it’s sufficient to write code that loads and performs inference with the TTS module separately.
Addition of a pipeline for LMM + TTS inference.

Streaming Speech2Text Section (the most difficult):

Whisper with the use_cache feature is required.
- This seems to require opening a separate PR for implementation.
- This is just my personal opinion, but I think it’s necessary to add flash-attn functionality or position_ids to Whisper.
- This would allow audio to be processed in a padding-free manner, similar to LLM packing.
Addition of a pipeline for Streaming STT + LLM.
- I’m at a loss as to how to approach this. If streaming STT functionality has already been added and I’m unaware of it, please let me know.

Since MiniCPM-o is trained with Image + Audio to Text,
there doesn’t seem to be significant difficulty in adding models and processors.

The biggest challenge is the inference pipeline.

It’s necessary to integrate the text decoder’s generate utilities with the streaming STT pipeline to make them work together. Additionally, during inference, there’s a need for an active pipeline capable of performing TTS on the predicted text.

With these features, could it be integrated into the existing Transformer?
Or should we first add only the model network required for training and then open a separate PR later to create the pipeline?

What are your thoughts as core maintainers?

zucchini-nlp · 2025-04-03T08:47:46Z

@jp1924 really cool insights, thanks for laying it out!

Streaming Speech2Text Section

We don't support streaming for audio models yet, right. In terms of generation code, I don't think it is planned in the near future. Generation will focus more on making text streaming better. Unless audio team plans to add support (cc @eustlb ). My personal opinion, not worth the hassle currently. We are also adding a similar Qwen-Omni model with no streaming generation.

The biggest challenge is the inference pipeline.

Yeah, making generation work will be the hard part and we will most probably override and add custom generation on top. I am not familiar with how Mini-CPM generates, so cmiiw. We can call super().generate(input_ids) to get text generation and then call self.tts(audio_inputs) for generating audio with custom loop.

Also it's nice to have ability to load model without TTS for those who don't want to waste compute. It can be done by adding several classes XXXForConditionalGeneration for example

jp1924 added the New model label Mar 27, 2025

jecrs linked a pull request Mar 27, 2025 that will close this issue

Adding a stub for MiniCPM-o to the models #37049

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add MiniCPM-o #37029

add MiniCPM-o #37029

jp1924 commented Mar 27, 2025 •

edited

Loading

zucchini-nlp commented Mar 27, 2025

jecrs commented Mar 27, 2025

eustlb commented Mar 27, 2025

jecrs commented Mar 27, 2025

jp1924 commented Apr 3, 2025 •

edited

Loading

zucchini-nlp commented Apr 3, 2025

add MiniCPM-o #37029

add MiniCPM-o #37029

Comments

jp1924 commented Mar 27, 2025 • edited Loading

Model description

Open source status

Provide useful links for the implementation

zucchini-nlp commented Mar 27, 2025

jecrs commented Mar 27, 2025

eustlb commented Mar 27, 2025

jecrs commented Mar 27, 2025

jp1924 commented Apr 3, 2025 • edited Loading

zucchini-nlp commented Apr 3, 2025

jp1924 commented Mar 27, 2025 •

edited

Loading

jp1924 commented Apr 3, 2025 •

edited

Loading