-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlong filtering #3229
base: main
Are you sure you want to change the base?
overlong filtering #3229
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!!
A few comments/questions
- ideally we would like some benchmarks to validate the results presented in the DAPO paper. But IMO it's ok to merge before we have these though
- can you add a unittest for this setting?
- what happens if all completions are truncated?
- should we exclude the samples from the reward calculation as well? So that it does bias the reward normalization?
- in the doc could you mention that this principle was added in the DAPO paper?
- could you add the new metrics to the "logged metrics" section of the GRPO doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation, I will launch some benchmarks with it this morning.
I agree it makes sense to exclude truncated sequences from the advantage calculation as well. But perhaps that can be split into another PR?
Another comment, I think the implementation could be simplified by just directly masking the completion_mask
in the _generate_and_score_completions
method, rather than having an additional truncated_samples
tensor. This would mean compute_loss
will be exactly the same and the implementation should be compatible with the liger PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this stability feature @shirinyamani ! I left some nits and a suggestion to remove the logging since I think we already have the metric via @edbeeching's prior PR
Let's see what the conclusion from Ed's experiments are before merging
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@@ -126,7 +126,8 @@ $$ | |||
\text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) | |||
$$ | |||
A higher value means more tokens were affected by clipping, limiting how much the policy can change. | |||
|
|||
- `mask_truncated_completions`: The completions that were truncated due to exceeding the maximum completion length. Therefore, excluded from the loss calculation. This feature was introduced as a metric of training stability in the (DAPO paper)[https://huggingface.co/papers/2503.14476] | |||
- `epsilon_high`: The upper bound of the clipping range. If set to higher value (e.g. 0.28), the policy can have more exploration ability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it logged somewhere?
@@ -126,7 +126,8 @@ $$ | |||
\text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right) | |||
$$ | |||
A higher value means more tokens were affected by clipping, limiting how much the policy can change. | |||
|
|||
- `mask_truncated_completions`: The completions that were truncated due to exceeding the maximum completion length. Therefore, excluded from the loss calculation. This feature was introduced as a metric of training stability in the (DAPO paper)[https://huggingface.co/papers/2503.14476] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't log it anymore
What does this PR do?
In RL training, we typically set a maximum length for generation, with overlong samples truncated accordingly. DAPO paper found that improper reward shaping for truncated samples can introduce reward noise and significantly disrupt the training process. To investigate the impact of this reward noise, DAPO paper first applied an Overlong Filtering strategy which masks the loss of truncated samples. They find that this approach significantly stabilizes training and enhances
performance.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.