overlong filtering #3229

shirinyamani · 2025-04-04T00:25:47Z

What does this PR do?

In RL training, we typically set a maximum length for generation, with overlong samples truncated accordingly. DAPO paper found that improper reward shaping for truncated samples can introduce reward noise and significantly disrupt the training process. To investigate the impact of this reward noise, DAPO paper first applied an Overlong Filtering strategy which masks the loss of truncated samples. They find that this approach significantly stabilizes training and enhances
performance.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec

Looks good!!
A few comments/questions

ideally we would like some benchmarks to validate the results presented in the DAPO paper. But IMO it's ok to merge before we have these though
can you add a unittest for this setting?
what happens if all completions are truncated?
should we exclude the samples from the reward calculation as well? So that it does bias the reward normalization?
in the doc could you mention that this principle was added in the DAPO paper?
could you add the new metrics to the "logged metrics" section of the GRPO doc

edbeeching

Thanks for the implementation, I will launch some benchmarks with it this morning.

I agree it makes sense to exclude truncated sequences from the advantage calculation as well. But perhaps that can be split into another PR?

Another comment, I think the implementation could be simplified by just directly masking the completion_mask in the _generate_and_score_completions method, rather than having an additional truncated_samples tensor. This would mean compute_loss will be exactly the same and the implementation should be compatible with the liger PR.

lewtun

Thanks for adding this stability feature @shirinyamani ! I left some nits and a suggestion to remove the logging since I think we already have the metric via @edbeeching's prior PR

Let's see what the conclusion from Ed's experiments are before merging

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

qgallouedec · 2025-04-04T23:59:25Z

docs/source/grpo_trainer.md

@@ -126,7 +126,8 @@ $$
 \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right)
 $$
 A higher value means more tokens were affected by clipping, limiting how much the policy can change.
-
+- `mask_truncated_completions`: The completions that were truncated due to exceeding the maximum completion length. Therefore, excluded from the loss calculation. This feature was introduced as a metric of training stability in the (DAPO paper)[https://huggingface.co/papers/2503.14476]
+- `epsilon_high`: The upper bound of the clipping range. If set to higher value (e.g. 0.28), the policy can have more exploration ability.


is it logged somewhere?

qgallouedec · 2025-04-04T23:59:39Z

docs/source/grpo_trainer.md

@@ -126,7 +126,8 @@ $$
 \text{clip}\left( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,< t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,< t})}, 1 - \epsilon, 1 + \epsilon \right)
 $$
 A higher value means more tokens were affected by clipping, limiting how much the policy can change.
-
+- `mask_truncated_completions`: The completions that were truncated due to exceeding the maximum completion length. Therefore, excluded from the loss calculation. This feature was introduced as a metric of training stability in the (DAPO paper)[https://huggingface.co/papers/2503.14476]


I think you don't log it anymore

overlong filtering

10c2d87

shirinyamani requested review from qgallouedec, edbeeching and lewtun April 4, 2025 00:25

help updated

ebb57ff

qgallouedec reviewed Apr 4, 2025

View reviewed changes

edbeeching reviewed Apr 4, 2025

View reviewed changes

lewtun reviewed Apr 4, 2025

View reviewed changes

shirinyamani and others added 6 commits April 4, 2025 09:21

Update trl/trainer/grpo_config.py

cf95a8a

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update trl/trainer/grpo_config.py

a559c2f

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Update trl/trainer/grpo_config.py

61d030e

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

samples--> completions + %log removed

b2e67ea

epsilon_high+masked_truc_comz added to logged metric

a410941

link to paper added

594fd57

qgallouedec reviewed Apr 4, 2025

View reviewed changes

This was referenced Apr 7, 2025

Add support to new DAPO method #3130

Open

☕ Overlong-filtering for GRPO #3248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overlong filtering #3229

overlong filtering #3229

shirinyamani commented Apr 4, 2025

qgallouedec left a comment

edbeeching left a comment •

edited

Loading

lewtun left a comment

qgallouedec Apr 4, 2025

qgallouedec Apr 4, 2025

overlong filtering #3229

Are you sure you want to change the base?

overlong filtering #3229

Conversation

shirinyamani commented Apr 4, 2025

What does this PR do?

Before submitting

Who can review?

qgallouedec left a comment

Choose a reason for hiding this comment

edbeeching left a comment • edited Loading

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

qgallouedec Apr 4, 2025

Choose a reason for hiding this comment

qgallouedec Apr 4, 2025

Choose a reason for hiding this comment

edbeeching left a comment •

edited

Loading