Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low loss but highly disorded output #1

Open
KaiWU5 opened this issue Feb 19, 2025 · 2 comments
Open

Low loss but highly disorded output #1

KaiWU5 opened this issue Feb 19, 2025 · 2 comments

Comments

@KaiWU5
Copy link

KaiWU5 commented Feb 19, 2025

After finetuning on qwen2.5vl , I got a very low train and validation loss around 0.67, but the output is disordered and worse than before.

Could you share some experiences about the training qwen2.5vl series

@sandy1990418
Copy link
Owner

sandy1990418 commented Feb 20, 2025

@KaiWU5 Hi, I’m not entirely sure what you mean by “disordered” output—do you mean that when you use similar data, the results turn out unexpectedly bad? Or are you saying that after fine-tuning, your model performs worse even on Qwen2.5-VL’s original examples? Either way, here are some of my thoughts—hopefully, they address your concerns.

LoRA fine-tuning can sometimes lead to overfitting, where the model becomes too specialized to the fine-tuning dataset and loses its ability to generalize. This often happens when the dataset is too narrow, the LoRA rank is too high (e.g., 64 or 128), or the learning rate is too aggressive. Additionally, training for too many epochs can overwrite the model’s pre-trained knowledge, making performance worse instead of better.

To improve stability, consider increasing data diversity, lowering the LoRA rank (e.g., 8 or 16), and reducing the learning rate (e.g., 1e-5 instead of 1e-4). Using early stopping can help prevent overfitting, and mixing some of the original pre-training data with fine-tuning data can help maintain the model’s generalization ability.

Hope this helps! Looking forward to hearing your thoughts.

@KaiWU5
Copy link
Author

KaiWU5 commented Feb 21, 2025

@sandy1990418 Thanks for your reply. My training is not lora fine-tuning but full parameter tunning for one epoch on large dataset like llavanext. I even simultaneously train Qwen2VL and Qwen2.5VL to compare.

Observations:

  • The loss for Qwen2.5VL and Qwen2VL is similar.
  • Finetuning on LLavanext like sft dataset, Qwen2.5VL is a lot worse than Qwen2VL (on opencompass dataset got 40 vs 50 repectively for 3b model).
  • When I further test on open domain questions on finetined Qwen2.5vl exhibits more repeating or disorder output than Qwen2VL.

Thoughts:
Before fine-tuning Qwen2VL with transformers, there are bugs like 3d-RPOE actually use 2d-RPOE in transformers. I check the implementation the rope seems right.
Could you share some thoughts on the difference in architecture of Qwen2VL and Qwen2.5VL so that I can check. (There is a 4D attention concept in Qwen2.5VL which I didn't find. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants