-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma3 adding new tokens <image_soft_token> has been added accidentally #37011
Comments
cc @ArthurZucker @itazap for tokenizers |
This is happening when using mlx_lm.lora as well even without adding custom tokens. The resulting fine tune outputs a vocab size of 262145, as opposed to 262144 specified in the config.json. When attempting to use the resulting adapter, ollama will fail validation. |
Hello @Serzhanov, took a deeper look and the from transformers import AutoTokenizer, BitsAndBytesConfig, Gemma3ForCausalLM
import torch
model_id = "google/gemma-3-1b-it"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = Gemma3ForCausalLM.from_pretrained(
model_id, quantization_config=quantization_config,token='#your token'
)
tokenizer = AutoTokenizer.from_pretrained(model_id,token='#your token')
print(len(tokenizer))
print(model.vocab_size)
print(f"'<image_soft_token>' in tokenizer vocab: {'<image_soft_token>' in tokenizer.vocab}") The behavior you're seeing in terms of the vocab size being +4 is because It's important to note that the |
@devdevgoat I'm not familiar about the validation issue you're describing, do you have a code snippet to reproduce the failure? |
@itazap Hello, thank you for the clarification. I see your point — that makes sense. I’m wondering though, do you think this behavior should raise a warning? In most models and previous versions of Gemma, like gemma-2b, the vocab_size and len(tokenizer) are the same, so this discrepancy might catch some people off guard. |
This is true for a lot of models (Bloom, Phi, Gemma, etc.), and so I agree with you that it's important to have a good explanation as to how a model's |
@itazap Great , I can close the issue now. |
System Info
Hello,
When adding custom tokens to the
gemma_3b_1_it
tokenizer, an unexpected token (<image_soft_token>
) appears in the model's embedding matrix — even though it was not explicitly added.Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Model :
To Reproduce:
Output :
Expected behavior
The text was updated successfully, but these errors were encountered: