Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlong filtering #3229

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

overlong filtering #3229

wants to merge 8 commits into from

Conversation

shirinyamani
Copy link
Member

What does this PR do?

In RL training, we typically set a maximum length for generation, with overlong samples truncated accordingly. DAPO paper found that improper reward shaping for truncated samples can introduce reward noise and significantly disrupt the training process. To investigate the impact of this reward noise, DAPO paper first applied an Overlong Filtering strategy which masks the loss of truncated samples. They find that this approach significantly stabilizes training and enhances
performance.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!!
A few comments/questions

  • ideally we would like some benchmarks to validate the results presented in the DAPO paper. But IMO it's ok to merge before we have these though
  • can you add a unittest for this setting?
  • what happens if all completions are truncated?
  • should we exclude the samples from the reward calculation as well? So that it does bias the reward normalization?
  • in the doc could you mention that this principle was added in the DAPO paper?
  • could you add the new metrics to the "logged metrics" section of the GRPO doc

Copy link
Collaborator

@edbeeching edbeeching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation, I will launch some benchmarks with it this morning.

I agree it makes sense to exclude truncated sequences from the advantage calculation as well. But perhaps that can be split into another PR?

Another comment, I think the implementation could be simplified by just directly masking the completion_mask in the _generate_and_score_completions method, rather than having an additional truncated_samples tensor. This would mean compute_loss will be exactly the same and the implementation should be compatible with the liger PR.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this stability feature @shirinyamani ! I left some nits and a suggestion to remove the logging since I think we already have the metric via @edbeeching's prior PR

Let's see what the conclusion from Ed's experiments are before merging

shirinyamani and others added 6 commits April 4, 2025 09:21
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@@ -126,7 +126,8 @@ $$
\text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right)
$$
A higher value means more tokens were affected by clipping, limiting how much the policy can change.

- `mask_truncated_completions`: The completions that were truncated due to exceeding the maximum completion length. Therefore, excluded from the loss calculation. This feature was introduced as a metric of training stability in the (DAPO paper)[https://huggingface.co/papers/2503.14476]
- `epsilon_high`: The upper bound of the clipping range. If set to higher value (e.g. 0.28), the policy can have more exploration ability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it logged somewhere?

@@ -126,7 +126,8 @@ $$
\text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right)
$$
A higher value means more tokens were affected by clipping, limiting how much the policy can change.

- `mask_truncated_completions`: The completions that were truncated due to exceeding the maximum completion length. Therefore, excluded from the loss calculation. This feature was introduced as a metric of training stability in the (DAPO paper)[https://huggingface.co/papers/2503.14476]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't log it anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants