Skip to content

Commit 947e33a

Browse files
committed
Merge branch 'main' of github.com:NVIDIA/nemo-rl into ashors/cleanup
2 parents 7e53c72 + 7d8ce74 commit 947e33a

File tree

54 files changed

+1096
-59
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+1096
-59
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,16 +33,19 @@ What you can expect:
3333
- **Flexibility** with a modular design that allows easy integration and customization.
3434
- **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
3535

36+
## 📣 News
37+
* [5/14/2025] [Reproduce DeepscaleR with NeMo RL!](docs/guides/grpo-deepscaler.md)
38+
3639
## Features
3740

3841
_Available now_ | 🔜 _Coming in v0.3_
3942

4043
-**Fast Generation** - vLLM backend for optimized inference.
4144
-**HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama).
42-
-**Distributed Training** - FSDP support and Ray-based infrastructure.
45+
-**Distributed Training** - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.
4346
-**Environment Support** - Support for multi-environment training.
4447
-**Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
45-
-**Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc.
48+
-**Multi-Turn RL** - Multi-turn generation and training for RL with tool use, games, etc.
4649
-**Large Model Support** - Native PyTorch support for models up to 32B parameters.
4750
-**Advanced Parallelism** - PyTorch native FSDP2, TP, and SP for efficient training.
4851
-**Worker Isolation** - Process isolation between RL Actors (no worries about global state).
205 KB
Loading
377 KB
Loading
34.5 KB
Loading
Loading

docs/guides/grpo-deepscaler.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# GRPO on DeepScaler
2+
3+
This guide explains how to use NeMo RL to train long Chain of Thought (CoT) reasoning models with Group Relative Policy Optimization (GRPO). To do so, we train [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [AIME24](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) benchmark.
4+
5+
6+
## Train the Model
7+
We follow the DeepScaleR recipe and train the model in three stages. In the first stage, we train with an 8K context window. In the second stage, we train with a 16K context window. In the third stage, we train with a 24K context window.
8+
To train the model using NeMo RL, use the `examples/configs/grpo-deepscaler-1.5b-8K.yaml` config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with `examples/configs/grpo-deepscaler-1.5b-16K.yaml` and `examples/configs/grpo-deepscaler-1.5b-24K.yaml` for the second and third stages, respectively.
9+
10+
```sh
11+
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
12+
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf
13+
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf
14+
```
15+
16+
At the end of each stage, you need to specify the Hugging Face checkpoint to continue training with. To get this checkpoint, we convert a model checkpoint to a Hugging Face checkpoint with the following command:
17+
18+
```sh
19+
uv run examples/convert_dcp_to_hf.py --config=results/grpo-deepscaler-1.5b-8K/step_240/config.yaml --dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/policy/weights --hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/hf
20+
```
21+
22+
When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. We run all experiments on a single 8XH100 80GB node or on a single 8XA100 80GB node.
23+
24+
## Training Curve
25+
When using the above commands, we get the following training curve:
26+
27+
![Training Performance](../assets/deepscaler_training_progress.png)
28+
29+
Notably, we are able to achieve an average training reward of 0.65 in just 400 training steps.
30+
31+
## Evaluate the Model
32+
Throughout training, the checkpoints of the model will be saved to the `results` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format as before. Then, to evaluate on the [AIME24 benchmark](https://huggingface.co/datasets/HuggingFaceH4/aime_2024), use the following command:
33+
34+
```sh
35+
uv run examples/run_eval.py \
36+
generation.model_name=results/grpo-deepscaler-1.5b-8K/step_240/hf
37+
```
38+
39+
Use `generation.model_name` to specify the path to the Hugging Face checkpoint. In addition, we use AIME24 as the validation dataset and calculate pass@1 on it throughout training.
40+
41+
## Evaluation Results
42+
Using the above instructions to train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset, we can track the model's performance on the AIME24 benchmark throughout training. The following plot shows the evaluation metrics as training progresses:
43+
44+
![AIME24 Performance](../assets/aime_training_progress.png)
45+
46+
We are able to surpass OpenAI O1's performance on the AIME24 benchmark with about 600 training steps.

docs/guides/sft-openmathinstruct2.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# SFT on OpenMathInstruct-2
2+
3+
This guide explains how to use NeMo RL to run SFT on the [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) math instruction tuning dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500).
4+
5+
6+
## Train the Model
7+
To train the model using NeMo RL, use the `examples/configs/recipes/tutorials/sft/sft_openmathinstruct2.yaml` config file. This file closely matches the experiment settings in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560).
8+
9+
```
10+
uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml
11+
```
12+
13+
### Dataset Splits
14+
15+
The OpenMathInstruct-2 has several versions of different sizes. Configure the version of the dataset via the `data.split` config:
16+
17+
* `train`: full 14 M problem–solution pairs
18+
* `train_1M`, `train_2M`, `train_5M`: fair-downsampled subsets of 1M, 2M, or 5M examples
19+
20+
By default, the config uses the 1M subset (`data.split=train_1M`).
21+
22+
### Training Time
23+
The default config uses 8 GPUs (`cluster.gpus_per_node`) on 1 node (`cluster.num_nodes`), which should complete 1 epoch of training for the `train_1M` dataset (1855 steps) in around 20 hours. Additional nodes can be used to speed up training. We found in our experiments that using 8 nodes, we can complete 1 epoch of training for the `train_1M` dataset in less than 4 hours.
24+
25+
## Evaluate the Model
26+
Throughout training, the checkpoints of the model will be saved to the `results/sft_openmathinstruct2` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format:
27+
28+
```
29+
uv run examples/convert_dcp_to_hf.py \
30+
--config=results/sft_openmathinstruct2/step_1855/config.yaml \
31+
--dcp-ckpt-path=results/sft_openmathinstruct2/step_1855/policy/weights \
32+
--hf-ckpt-path=results/sft_openmathinstruct2/step_1855/hf
33+
```
34+
35+
Replace `results/sft_openmathinstruct2/step_1855` with the path to the checkpoint you are evaluating. The resulting Hugging Face checkpoint will be saved to `--hf-ckpt-path`.
36+
37+
To evaluate on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), use the following command:
38+
39+
```
40+
uv run examples/run_eval.py \
41+
--config=examples/configs/eval.yaml \
42+
generation.model_name=results/sft_openmathinstruct2/step_1855/hf \
43+
tokenizer.name=meta-llama/Llama-3.1-8B-Instruct \
44+
data.dataset_name=HuggingFaceH4/MATH-500 \
45+
data.dataset_key=test
46+
```
47+
48+
Use `generation.model_name` to specify the path to the Hugging Face checkpoint.
49+
50+
## Results
51+
52+
In this section we present the results of several reference experiments for the `train_1M` and `train` versions of the dataset.
53+
54+
### train_1M
55+
Using the above instructions to train a Llama-3.1-8B model for 1 epoch on the `train_1M` version of the OpenMathInstruct-2 dataset, we get the following loss curve:
56+
57+
![image](../assets/sft-openmathinstruct2-train1M-loss.png)
58+
59+
60+
Evaluating the final checkpoint on MATH-500, we get the following result:
61+
62+
```
63+
============================================================
64+
model_name='hf' dataset_name='MATH-500'
65+
max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
66+
67+
metric='pass@1' num_tests_per_prompt=1
68+
69+
score=0.5020 (251.0/500)
70+
============================================================
71+
```
72+
73+
As a reference, using NeMo-Aligner and NeMo-Skills (as is done in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560)) to train and evaluate the same model on the same dataset achieves the same score of 0.5020 on MATH-500.
74+
75+
### train
76+
We also trained a Llama-3.1-8B model for 10,000 steps on the full `train` version of the OpenMathInstruct-2 dataset. We obtain the following loss curve:
77+
78+
![image](../assets/sft-openmathinstruct2-train-loss.png)
79+
80+
Evaluating the checkpoint after 10,000 steps of training on MATH-500, we get the following result:
81+
82+
```
83+
============================================================
84+
model_name='hf' dataset_name='MATH-500'
85+
max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
86+
87+
metric='pass@1' num_tests_per_prompt=1
88+
89+
score=0.5800 (290.0/500)
90+
============================================================
91+
```
92+
93+
Using NeMo-Aligner and NeMo-Skills to train the model in the same settings achieves a score of 0.5740 (287 / 500).

docs/index.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,13 @@ cluster.md
1111
1212
```
1313

14+
```{toctree}
15+
:caption: 🚀 E2E Examples
16+
:hidden:
17+
18+
guides/sft-openmathinstruct2.md
19+
```
20+
1421
```{toctree}
1522
:caption: 📚 Guides
1623
:hidden:
@@ -19,6 +26,7 @@ adding-new-models.md
1926
guides/sft.md
2027
guides/dpo.md
2128
guides/grpo.md
29+
guides/grpo-deepscaler.md
2230
guides/eval.md
2331
model-quirks.md
2432
```

examples/configs/dpo.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,9 @@ policy:
5252
activation_checkpointing: false
5353
tensor_parallel_size: 1
5454

55+
dynamic_batching:
56+
enabled: false
57+
5558
# makes the training sequence length divisible by the tensor parallel size
5659
# this is useful for sequence parallel training
5760
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# GRPO Algorithm Configuration
2+
defaults: "grpo-deepscaler-1.5b-8K.yaml"
3+
4+
loss_fn:
5+
reference_policy_kl_penalty: 0.001
6+
ratio_clip_max: 0.28
7+
8+
9+
policy:
10+
max_total_sequence_length: 16384
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# GRPO Algorithm Configuration
2+
grpo:
3+
num_prompts_per_step: 128
4+
num_generations_per_prompt: 8
5+
max_rollout_turns: 1 # for multi-turn rollouts. Math Environments just have 1 turn (answering the question)
6+
max_num_steps: 1000000
7+
normalize_rewards: true
8+
use_leave_one_out_baseline: true
9+
val_period: 10
10+
val_at_start: false
11+
max_val_samples: 480
12+
val_batch_size: 32
13+
14+
loss_fn:
15+
reference_policy_kl_penalty: 0.0
16+
ratio_clip_min: 0.2
17+
ratio_clip_max: 0.2
18+
ratio_clip_c: null
19+
# (default off) loss formulation improvements (docs/guides/grpo.md#loss)
20+
use_on_policy_kl_approximation: false
21+
use_importance_sampling_correction: false
22+
token_level_loss: true
23+
24+
checkpointing:
25+
enabled: true
26+
checkpoint_dir: "results/grpo"
27+
metric_name: "val_reward"
28+
higher_is_better: true
29+
keep_top_k: 10
30+
save_period: 10
31+
32+
policy:
33+
# Qwen/Qwen2.5-1.5B has tied weights which are only supported with dtensor policy with tp size 1 (https://github.com/NVIDIA/NeMo-RL/issues/227)
34+
model_name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
35+
tokenizer:
36+
name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
37+
train_global_batch_size: 64
38+
train_micro_batch_size: 1
39+
generation_batch_size: 32 # Only used when generating using HF backend
40+
logprob_batch_size: 4
41+
max_total_sequence_length: 8192
42+
precision: "bfloat16"
43+
fsdp_offload_enabled: false
44+
activation_checkpointing_enabled: false
45+
refit_buffer_size_gb: 4 # used for refitting inference engine, the unit is GB
46+
47+
dtensor_cfg:
48+
enabled: true
49+
cpu_offload: False
50+
sequence_parallel: false
51+
activation_checkpointing: false
52+
tensor_parallel_size: 1
53+
54+
# makes the training sequence length divisible by the tensor parallel size
55+
# this is useful for sequence parallel training
56+
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
57+
max_grad_norm: 1.0
58+
59+
optimizer:
60+
name: "torch.optim.AdamW"
61+
kwargs:
62+
lr: 2.0e-6
63+
weight_decay: 0.01
64+
betas: [0.9, 0.999]
65+
eps: 1e-8
66+
# when using Dtensor, we need to set foreach
67+
# and fused to False
68+
foreach: False
69+
fused: False
70+
71+
scheduler:
72+
- name: "torch.optim.lr_scheduler.LinearLR"
73+
kwargs:
74+
start_factor: 0.1
75+
end_factor: 1.0
76+
total_iters: 50
77+
- name: "torch.optim.lr_scheduler.ConstantLR"
78+
kwargs:
79+
factor: 1.0
80+
total_iters: 10000000000
81+
- milestones: [50]
82+
83+
generation:
84+
backend: "vllm"
85+
max_new_tokens: ${policy.max_total_sequence_length}
86+
temperature: 1.0
87+
top_p: 1.0
88+
top_k: null
89+
stop_token_ids: null
90+
stop_strings: null
91+
vllm_cfg:
92+
precision: ${policy.precision}
93+
tensor_parallel_size: 1
94+
gpu_memory_utilization: 0.6
95+
max_model_len: ${policy.max_total_sequence_length}
96+
# For most cases, use "dummy" to load the initial weights, since they will be overwritten during refit
97+
# For Gemma models, we need to use "auto" due to a vllm bug
98+
load_format: dummy
99+
100+
data:
101+
max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len
102+
prompt_file: "examples/prompts/cot.txt"
103+
system_prompt_file: null
104+
dataset_name: "DeepScaler"
105+
106+
env:
107+
math:
108+
num_workers: 16
109+
110+
logger:
111+
log_dir: "logs" # Base directory for all logs
112+
num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
113+
wandb_enabled: false
114+
tensorboard_enabled: false
115+
monitor_gpus: false # If true, will monitor GPU usage and log to wandb and/or tensorboard
116+
wandb:
117+
project: "grpo-dev"
118+
name: "grpo-dev-logger"
119+
tensorboard: {}
120+
gpu_monitoring:
121+
collection_interval: 10 # How often to collect GPU usage metrics (in seconds)
122+
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)
123+
124+
cluster:
125+
gpus_per_node: 8
126+
num_nodes: 1
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# GRPO Algorithm Configuration
2+
defaults: "grpo-deepscaler-1.5b-8K.yaml"
3+
4+
loss_fn:
5+
reference_policy_kl_penalty: 0.0001
6+
ratio_clip_min: 0.2
7+
ratio_clip_max: 0.28
8+
9+
policy:
10+
max_total_sequence_length: 24576
11+
12+
dtensor_cfg:
13+
enabled: true
14+
cpu_offload: true
15+
sequence_parallel: true
16+
activation_checkpointing: true
17+
tensor_parallel_size: 4
18+
19+
optimizer:
20+
name: "torch.optim.AdamW"
21+
kwargs:
22+
lr: 5.0e-7
23+
24+
generation:
25+
backend: "vllm"
26+
max_new_tokens: ${policy.max_total_sequence_length}
27+
temperature: 1.0
28+
top_p: 1.0
29+
top_k: null
30+
stop_token_ids: null
31+
stop_strings: null
32+
vllm_cfg:
33+
precision: ${policy.precision}
34+
tensor_parallel_size: 1
35+
gpu_memory_utilization: 0.8
36+
max_model_len: ${policy.max_total_sequence_length}
37+
# For most cases, use "dummy" to load the initial weights, since they will be overwritten during refit
38+
# For Gemma models, we need to use "auto" due to a vllm bug
39+
load_format: dummy

examples/configs/grpo_math_1B.yaml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,18 @@ policy:
5050
sequence_parallel: false
5151
activation_checkpointing: false
5252
tensor_parallel_size: 1
53-
53+
54+
# dynamic_batching improves performance by ensuring logprob and training microbatches
55+
# have a sufficent number of tokens to maximize GPU utilization. Specifically, variable length
56+
# responses are sorted by sequence length and bucketed into microbatches with a total
57+
# amount of tokens is approximately close to 'train_mb_tokens' and 'logprob_mb_tokens' for the
58+
# training and logprob stages respectively.
59+
dynamic_batching:
60+
enabled: True
61+
train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
62+
logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
63+
sequence_length_round: 64
64+
5465
# makes the training sequence length divisible by the tensor parallel size
5566
# this is useful for sequence parallel training
5667
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}

0 commit comments

Comments
 (0)