Multimodal support in `llama-server` for Gemma 3 #12885

andportnoy · 2025-04-11T01:33:40Z

andportnoy
Apr 11, 2025

@ngxson Thank you so much for continuing to push multimodal support forward with PRs such as #12849. Is support in llama-server on your roadmap, in particular for models like Gemma 3? What would the implementation of it involve, given the new libmtmd library? Thank you again for your work.

ngxson · 2025-04-11T06:42:00Z

ngxson
Apr 11, 2025
Collaborator

Bringing mtmd to server is easy, but the problem is to manage the KV cache with non-text tokens across requests. Currently we are using common prefix algorithm to determine how many tokens in KV can be reused, but obviously doing common prefix on image is not that simple.

I will have a look into it today. Will open a PR so everyone can discuss

1 reply

andportnoy Apr 11, 2025
Author

I don't know which integer datatype is used to represent a token, but maybe you could hash the image and then truncate the hash to an integer of that datatype. That would allow you to pretend that the image is a token. Maybe set the top bit to make sure the resulting token is always out of band compared to real tokens, or something like that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal support in `llama-server` for Gemma 3 #12885

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Multimodal support in llama-server for Gemma 3 #12885

andportnoy Apr 11, 2025

Replies: 1 comment · 1 reply

ngxson Apr 11, 2025 Collaborator

andportnoy Apr 11, 2025 Author

Multimodal support in `llama-server` for Gemma 3 #12885

andportnoy
Apr 11, 2025

Replies: 1 comment 1 reply

ngxson
Apr 11, 2025
Collaborator

andportnoy Apr 11, 2025
Author