NVIDIA
diff --git a/‎README.md
Lines changed: 5 additions & 2 deletions b/‎README.md
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/assets/aime_training_progress.png
205 KB b/‎docs/assets/aime_training_progress.png
205 KB
diff --git a/‎docs/assets/deepscaler_training_progress.png
377 KB b/‎docs/assets/deepscaler_training_progress.png
377 KB
diff --git a/‎docs/assets/sft-openmathinstruct2-train-loss.png
34.5 KB b/‎docs/assets/sft-openmathinstruct2-train-loss.png
34.5 KB
diff --git a/‎docs/assets/sft-openmathinstruct2-train1M-loss.png
30.5 KB b/‎docs/assets/sft-openmathinstruct2-train1M-loss.png
30.5 KB
diff --git a/‎docs/guides/grpo-deepscaler.md
Lines changed: 46 additions & 0 deletions b/‎docs/guides/grpo-deepscaler.md
Lines changed: 46 additions & 0 deletions
diff --git a/‎docs/guides/sft-openmathinstruct2.md
Lines changed: 93 additions & 0 deletions b/‎docs/guides/sft-openmathinstruct2.md
Lines changed: 93 additions & 0 deletions
diff --git a/‎docs/index.md
Lines changed: 8 additions & 0 deletions b/‎docs/index.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎examples/configs/dpo.yaml
Lines changed: 3 additions & 0 deletions b/‎examples/configs/dpo.yaml
Lines changed: 3 additions & 0 deletions
diff --git a/‎examples/configs/grpo-deepscaler-1.5b-16K.yaml
Lines changed: 10 additions & 0 deletions b/‎examples/configs/grpo-deepscaler-1.5b-16K.yaml
Lines changed: 10 additions & 0 deletions
diff --git a/‎examples/configs/grpo-deepscaler-1.5b-8K.yaml
Lines changed: 126 additions & 0 deletions b/‎examples/configs/grpo-deepscaler-1.5b-8K.yaml
Lines changed: 126 additions & 0 deletions
diff --git a/‎examples/configs/grpo_deepscaler-1.5b-24K.yaml
Lines changed: 39 additions & 0 deletions b/‎examples/configs/grpo_deepscaler-1.5b-24K.yaml
Lines changed: 39 additions & 0 deletions
diff --git a/‎examples/configs/grpo_math_1B.yaml
Lines changed: 12 additions & 1 deletion b/‎examples/configs/grpo_math_1B.yaml
Lines changed: 12 additions & 1 deletion
@@ -33,16 +33,19 @@ What you can expect:
 - **Flexibility** with a modular design that allows easy integration and customization.
 - **Comprehensive documentation** that is both detailed and user-friendly, with practical examples.
 
+## 📣 News
+* [5/14/2025] [Reproduce DeepscaleR with NeMo RL!](docs/guides/grpo-deepscaler.md)
+
 ## Features
 
 ✅ _Available now_ | 🔜 _Coming in v0.3_
 
 - ✅ **Fast Generation** - vLLM backend for optimized inference.
 - ✅ **HuggingFace Integration** - Works with 1-32B models (Qwen2.5, Llama).
-- ✅ **Distributed Training** - FSDP support and Ray-based infrastructure.
+- ✅ **Distributed Training** - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.
 - ✅ **Environment Support** - Support for multi-environment training.
 - ✅ **Learning Algorithms** - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
-- ✅ **Multi-Turn RL** - multi-turn generation and training for RL with tool use, games, etc.
+- ✅ **Multi-Turn RL** - Multi-turn generation and training for RL with tool use, games, etc.
 - ✅ **Large Model Support** - Native PyTorch support for models up to 32B parameters.
 - ✅ **Advanced Parallelism** - PyTorch native FSDP2, TP, and SP for efficient training.
 - ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state).
 
@@ -0,0 +1,46 @@
+# GRPO on DeepScaler
+
+This guide explains how to use NeMo RL to train long Chain of Thought (CoT) reasoning models with Group Relative Policy Optimization (GRPO). To do so, we train [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [AIME24](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) benchmark.
+
+
+## Train the Model
+We follow the DeepScaleR recipe and train the model in three stages. In the first stage, we train with an 8K context window. In the second stage, we train with a 16K context window. In the third stage, we train with a 24K context window.
+To train the model using NeMo RL, use the `examples/configs/grpo-deepscaler-1.5b-8K.yaml` config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with `examples/configs/grpo-deepscaler-1.5b-16K.yaml` and `examples/configs/grpo-deepscaler-1.5b-24K.yaml` for the second and third stages, respectively.
+
+```sh
+uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
+uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf
+uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf
+```
+
+At the end of each stage, you need to specify the Hugging Face checkpoint to continue training with. To get this checkpoint, we convert a model checkpoint to a Hugging Face checkpoint with the following command:
+
+```sh
+uv run examples/convert_dcp_to_hf.py --config=results/grpo-deepscaler-1.5b-8K/step_240/config.yaml --dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/policy/weights --hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/hf
+```
+
+When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. We run all experiments on a single 8XH100 80GB node or on a single 8XA100 80GB node.
+
+## Training Curve
+When using the above commands, we get the following training curve:
+
+![Training Performance](../assets/deepscaler_training_progress.png)
+
+Notably, we are able to achieve an average training reward of 0.65 in just 400 training steps.
+
+## Evaluate the Model
+Throughout training, the checkpoints of the model will be saved to the `results` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format as before. Then, to evaluate on the [AIME24 benchmark](https://huggingface.co/datasets/HuggingFaceH4/aime_2024), use the following command:
+
+```sh
+uv run examples/run_eval.py \
+    generation.model_name=results/grpo-deepscaler-1.5b-8K/step_240/hf
+```
+
+Use `generation.model_name` to specify the path to the Hugging Face checkpoint. In addition, we use AIME24 as the validation dataset and calculate pass@1 on it throughout training.
+
+## Evaluation Results
+Using the above instructions to train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset, we can track the model's performance on the AIME24 benchmark throughout training. The following plot shows the evaluation metrics as training progresses:
+
+![AIME24 Performance](../assets/aime_training_progress.png)
+
+We are able to surpass OpenAI O1's performance on the AIME24 benchmark with about 600 training steps.
@@ -0,0 +1,93 @@
+# SFT on OpenMathInstruct-2
+
+This guide explains how to use NeMo RL to run SFT on the [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) math instruction tuning dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500).
+
+
+## Train the Model
+To train the model using NeMo RL, use the `examples/configs/recipes/tutorials/sft/sft_openmathinstruct2.yaml` config file. This file closely matches the experiment settings in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560).
+
+```
+uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml
+```
+
+### Dataset Splits
+
+The OpenMathInstruct-2 has several versions of different sizes. Configure the version of the dataset via the `data.split` config:
+
+* `train`: full 14 M problem–solution pairs
+* `train_1M`, `train_2M`, `train_5M`: fair-downsampled subsets of 1M, 2M, or 5M examples
+
+By default, the config uses the 1M subset (`data.split=train_1M`).
+
+### Training Time
+The default config uses 8 GPUs (`cluster.gpus_per_node`) on 1 node (`cluster.num_nodes`), which should complete 1 epoch of training for the `train_1M` dataset (1855 steps) in around 20 hours. Additional nodes can be used to speed up training. We found in our experiments that using 8 nodes, we can complete 1 epoch of training for the `train_1M` dataset in less than 4 hours.
+
+## Evaluate the Model
+Throughout training, the checkpoints of the model will be saved to the `results/sft_openmathinstruct2` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format:
+
+```
+uv run examples/convert_dcp_to_hf.py \
+    --config=results/sft_openmathinstruct2/step_1855/config.yaml \
+    --dcp-ckpt-path=results/sft_openmathinstruct2/step_1855/policy/weights \
+    --hf-ckpt-path=results/sft_openmathinstruct2/step_1855/hf
+```
+
+Replace `results/sft_openmathinstruct2/step_1855` with the path to the checkpoint you are evaluating. The resulting Hugging Face checkpoint will be saved to `--hf-ckpt-path`.
+
+To evaluate on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), use the following command:
+
+```
+uv run examples/run_eval.py \
+    --config=examples/configs/eval.yaml \
+    generation.model_name=results/sft_openmathinstruct2/step_1855/hf \
+    tokenizer.name=meta-llama/Llama-3.1-8B-Instruct \
+    data.dataset_name=HuggingFaceH4/MATH-500 \
+    data.dataset_key=test
+```
+
+Use `generation.model_name` to specify the path to the Hugging Face checkpoint.
+
+## Results
+
+In this section we present the results of several reference experiments for the `train_1M` and `train` versions of the dataset.
+
+### train_1M
+Using the above instructions to train a Llama-3.1-8B model for 1 epoch on the `train_1M` version of the OpenMathInstruct-2 dataset, we get the following loss curve:
+
+![image](../assets/sft-openmathinstruct2-train1M-loss.png)
+
+
+Evaluating the final checkpoint on MATH-500, we get the following result:
+
+```
+============================================================
+model_name='hf' dataset_name='MATH-500'
+max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
+
+metric='pass@1' num_tests_per_prompt=1
+
+score=0.5020 (251.0/500)
+============================================================
+```
+
+As a reference, using NeMo-Aligner and NeMo-Skills (as is done in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560)) to train and evaluate the same model on the same dataset achieves the same score of 0.5020 on MATH-500.
+
+### train
+We also trained a Llama-3.1-8B model for 10,000 steps on the full `train` version of the OpenMathInstruct-2 dataset. We obtain the following loss curve:
+
+![image](../assets/sft-openmathinstruct2-train-loss.png)
+
+Evaluating the checkpoint after 10,000 steps of training on MATH-500, we get the following result:
+
+```
+============================================================
+model_name='hf' dataset_name='MATH-500'
+max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
+
+metric='pass@1' num_tests_per_prompt=1
+
+score=0.5800 (290.0/500)
+============================================================
+```
+
+Using NeMo-Aligner and NeMo-Skills to train the model in the same settings achieves a score of 0.5740 (287 / 500).
@@ -11,6 +11,13 @@ cluster.md
 
 ```
 
+```{toctree}
+:caption: 🚀 E2E Examples
+:hidden:
+
+guides/sft-openmathinstruct2.md
+```
+
 ```{toctree}
 :caption: 📚 Guides
 :hidden:
@@ -19,6 +26,7 @@ adding-new-models.md
 guides/sft.md
 guides/dpo.md
 guides/grpo.md
+guides/grpo-deepscaler.md
 guides/eval.md
 model-quirks.md
 ```
 
@@ -52,6 +52,9 @@ policy:
     activation_checkpointing: false
     tensor_parallel_size: 1
 
+  dynamic_batching:
+    enabled: false
+
   # makes the training sequence length divisible by the tensor parallel size
   # this is useful for sequence parallel training
   make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
 
@@ -0,0 +1,10 @@
+# GRPO Algorithm Configuration
+defaults: "grpo-deepscaler-1.5b-8K.yaml"
+
+loss_fn:
+  reference_policy_kl_penalty: 0.001
+  ratio_clip_max: 0.28
+
+
+policy:
+  max_total_sequence_length: 16384
@@ -0,0 +1,126 @@
+# GRPO Algorithm Configuration
+grpo:
+  num_prompts_per_step: 128
+  num_generations_per_prompt: 8
+  max_rollout_turns: 1 # for multi-turn rollouts. Math Environments just have 1 turn (answering the question)
+  max_num_steps: 1000000
+  normalize_rewards: true
+  use_leave_one_out_baseline: true
+  val_period: 10
+  val_at_start: false
+  max_val_samples: 480
+  val_batch_size: 32
+
+loss_fn:
+  reference_policy_kl_penalty: 0.0
+  ratio_clip_min: 0.2
+  ratio_clip_max: 0.2
+  ratio_clip_c: null
+  # (default off) loss formulation improvements (docs/guides/grpo.md#loss)
+  use_on_policy_kl_approximation: false
+  use_importance_sampling_correction: false
+  token_level_loss: true
+
+checkpointing:
+  enabled: true
+  checkpoint_dir: "results/grpo"
+  metric_name: "val_reward"
+  higher_is_better: true
+  keep_top_k: 10
+  save_period: 10
+
+policy:
+  # Qwen/Qwen2.5-1.5B has tied weights which are only supported with dtensor policy with tp size 1 (https://github.com/NVIDIA/NeMo-RL/issues/227)
+  model_name: "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
+  tokenizer:
+    name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
+  train_global_batch_size: 64
+  train_micro_batch_size: 1
+  generation_batch_size: 32 # Only used when generating using HF backend
+  logprob_batch_size: 4
+  max_total_sequence_length: 8192
+  precision: "bfloat16"
+  fsdp_offload_enabled: false
+  activation_checkpointing_enabled: false
+  refit_buffer_size_gb: 4 # used for refitting inference engine, the unit is GB
+
+  dtensor_cfg:
+    enabled: true
+    cpu_offload: False
+    sequence_parallel: false
+    activation_checkpointing: false
+    tensor_parallel_size: 1
+  
+  # makes the training sequence length divisible by the tensor parallel size
+  # this is useful for sequence parallel training
+  make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
+  max_grad_norm: 1.0
+
+  optimizer:
+    name: "torch.optim.AdamW"
+    kwargs:
+      lr: 2.0e-6
+      weight_decay: 0.01
+      betas: [0.9, 0.999]
+      eps: 1e-8
+      # when using Dtensor, we need to set foreach
+      # and fused to False
+      foreach: False
+      fused: False
+
+  scheduler:
+    - name: "torch.optim.lr_scheduler.LinearLR"
+      kwargs:
+        start_factor: 0.1
+        end_factor: 1.0
+        total_iters: 50
+    - name: "torch.optim.lr_scheduler.ConstantLR"
+      kwargs:
+        factor: 1.0
+        total_iters: 10000000000
+    - milestones: [50]
+
+  generation:
+    backend: "vllm"
+    max_new_tokens: ${policy.max_total_sequence_length}
+    temperature: 1.0
+    top_p: 1.0
+    top_k: null
+    stop_token_ids: null
+    stop_strings: null
+    vllm_cfg:
+      precision: ${policy.precision}
+      tensor_parallel_size: 1
+      gpu_memory_utilization: 0.6
+      max_model_len: ${policy.max_total_sequence_length}
+      # For most cases, use "dummy" to load the initial weights, since they will be overwritten during refit
+      # For Gemma models, we need to use "auto" due to a vllm bug
+      load_format: dummy
+
+data:
+  max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len
+  prompt_file: "examples/prompts/cot.txt"
+  system_prompt_file: null
+  dataset_name: "DeepScaler"
+
+env:
+  math:
+    num_workers: 16
+
+logger:
+  log_dir: "logs"  # Base directory for all logs
+  num_val_samples_to_print: 0 # Number of validation samples to pretty print on terminal
+  wandb_enabled: false
+  tensorboard_enabled: false
+  monitor_gpus: false  # If true, will monitor GPU usage and log to wandb and/or tensorboard
+  wandb:
+    project: "grpo-dev"
+    name: "grpo-dev-logger"
+  tensorboard: {}
+  gpu_monitoring:
+    collection_interval: 10  # How often to collect GPU usage metrics (in seconds)
+    flush_interval: 10  # How often to flush GPU usage metrics to the loggers (in seconds)
+
+cluster:
+  gpus_per_node: 8
+  num_nodes: 1
@@ -0,0 +1,39 @@
+# GRPO Algorithm Configuration
+defaults: "grpo-deepscaler-1.5b-8K.yaml"
+
+loss_fn:
+  reference_policy_kl_penalty: 0.0001
+  ratio_clip_min: 0.2
+  ratio_clip_max: 0.28
+
+policy:
+  max_total_sequence_length: 24576
+
+  dtensor_cfg:
+    enabled: true
+    cpu_offload: true
+    sequence_parallel: true
+    activation_checkpointing: true
+    tensor_parallel_size: 4
+
+  optimizer:
+    name: "torch.optim.AdamW"
+    kwargs:
+      lr: 5.0e-7
+
+  generation:
+    backend: "vllm"
+    max_new_tokens: ${policy.max_total_sequence_length}
+    temperature: 1.0
+    top_p: 1.0
+    top_k: null
+    stop_token_ids: null
+    stop_strings: null
+    vllm_cfg:
+      precision: ${policy.precision}
+      tensor_parallel_size: 1
+      gpu_memory_utilization: 0.8
+      max_model_len: ${policy.max_total_sequence_length}
+      # For most cases, use "dummy" to load the initial weights, since they will be overwritten during refit
+      # For Gemma models, we need to use "auto" due to a vllm bug
+      load_format: dummy
@@ -50,7 +50,18 @@ policy:
     sequence_parallel: false
     activation_checkpointing: false
     tensor_parallel_size: 1
-  
+
+  # dynamic_batching improves performance by ensuring logprob and training microbatches
+  # have a sufficent number of tokens to maximize GPU utilization. Specifically, variable length
+  # responses are sorted by sequence length and bucketed into microbatches with a total
+  # amount of tokens is approximately close to 'train_mb_tokens' and 'logprob_mb_tokens' for the
+  # training and logprob stages respectively.
+  dynamic_batching:
+    enabled: True
+    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
+    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
+    sequence_length_round: 64
+
   # makes the training sequence length divisible by the tensor parallel size
   # this is useful for sequence parallel training
   make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}