Low loss but highly disorded output #1

KaiWU5 · 2025-02-19T08:11:15Z

After finetuning on qwen2.5vl , I got a very low train and validation loss around 0.67, but the output is disordered and worse than before.

Could you share some experiences about the training qwen2.5vl series

sandy1990418 · 2025-02-20T12:38:03Z

@KaiWU5 Hi, I’m not entirely sure what you mean by “disordered” output—do you mean that when you use similar data, the results turn out unexpectedly bad? Or are you saying that after fine-tuning, your model performs worse even on Qwen2.5-VL’s original examples? Either way, here are some of my thoughts—hopefully, they address your concerns.

LoRA fine-tuning can sometimes lead to overfitting, where the model becomes too specialized to the fine-tuning dataset and loses its ability to generalize. This often happens when the dataset is too narrow, the LoRA rank is too high (e.g., 64 or 128), or the learning rate is too aggressive. Additionally, training for too many epochs can overwrite the model’s pre-trained knowledge, making performance worse instead of better.

To improve stability, consider increasing data diversity, lowering the LoRA rank (e.g., 8 or 16), and reducing the learning rate (e.g., 1e-5 instead of 1e-4). Using early stopping can help prevent overfitting, and mixing some of the original pre-training data with fine-tuning data can help maintain the model’s generalization ability.

Hope this helps! Looking forward to hearing your thoughts.

KaiWU5 · 2025-02-21T01:53:38Z

@sandy1990418 Thanks for your reply. My training is not lora fine-tuning but full parameter tunning for one epoch on large dataset like llavanext. I even simultaneously train Qwen2VL and Qwen2.5VL to compare.

Observations:

The loss for Qwen2.5VL and Qwen2VL is similar.
Finetuning on LLavanext like sft dataset, Qwen2.5VL is a lot worse than Qwen2VL (on opencompass dataset got 40 vs 50 repectively for 3b model).
When I further test on open domain questions on finetined Qwen2.5vl exhibits more repeating or disorder output than Qwen2VL.

Thoughts:
Before fine-tuning Qwen2VL with transformers, there are bugs like 3d-RPOE actually use 2d-RPOE in transformers. I check the implementation the rope seems right.
Could you share some thoughts on the difference in architecture of Qwen2VL and Qwen2.5VL so that I can check. (There is a 4D attention concept in Qwen2.5VL which I didn't find. )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low loss but highly disorded output #1

Low loss but highly disorded output #1

KaiWU5 commented Feb 19, 2025

sandy1990418 commented Feb 20, 2025 •

edited

Loading

KaiWU5 commented Feb 21, 2025 •

edited

Loading

Low loss but highly disorded output #1

Low loss but highly disorded output #1

Comments

KaiWU5 commented Feb 19, 2025

sandy1990418 commented Feb 20, 2025 • edited Loading

KaiWU5 commented Feb 21, 2025 • edited Loading

sandy1990418 commented Feb 20, 2025 •

edited

Loading

KaiWU5 commented Feb 21, 2025 •

edited

Loading