feat: general fsdp2 on non-MoE models + HF TP plan #352

yuki-666 · 2025-05-12T13:48:04Z

What does this PR do ?

Support FSDP2 on non-MoE models.
Support Hugging Face TP plan.
The priority for using which parallel plan is custom-parallel-plan > opt-parallel-plan (which we implemented for certain models in FSDP2) > hf-tp-plan (HF's _tp_plan).

Convergence test on LlamaForCausalLM, Qwen2ForCausalLM, Qwen3ForCausalLM, Gemma2ForCausalLM, Gemma3ForCausalLM, Phi3ForCausalLM run well.

Convergence Test Detail

Llama-3.1-8B-Instruct (LlamaForCausalLM)
FSDP2-tp8-opt_plan vs FSDP2-tp8-hf_tp_plan

Qwen2ForCausalLM / Qwen3ForCausalLM

Qwen2.5-7B-Instruct (Qwen2ForCausalLM) FSDP2-tp4-opt_plan vs FSDP2-tp4-hf_tp_plan	Qwen3-0.6B (Qwen3ForCausalLM) FSDP1 vs FSDP2-tp1

Gemma2ForCausalLM / Gemma3ForCausalLM

gemma-2-9b-it (Gemma2ForCausalLM) FSDP1 vs FSDP2-tp1 vs FSDP2-tp4-hf_tp_plan	gemma-3-1b-it (Gemma3ForCausalLM) FSDP1 vs FSDP2-tp1

Issues

Closes #156

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

yuki-666 · 2025-05-20T13:57:15Z

File another issue #413 to trace FSDP2 for MoE models.

Qwen3-30B-A3B is obviously slower than Qwen3-32B, especially on the refit process or when using hf-tp-plan with dtensor tp > 1.
DeepseekV2ForCausalLM using fsdp2 will fail on the following error on model.layers.0.self_attn.rotary_emb.cos_cached, said v.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64]) shape mismatch in self.use_reference_model().

DeepseekV2ForCausalLM should use vllm==0.8.5 and add this line https://github.com/vllm-project/vllm/blob/v0.9.0/vllm/v1/attention/backends/mla/common.py#L494 to vllm for test.

docs/design-docs/fsdp2-parallel-plan.md

nemo_rl/models/dtensor/parallelize.py

yuki-666 · 2025-05-22T07:39:04Z

@yuki-666 what were the parameters of your gemma2 run? I can't seem to get it to run correctly:
uv run examples/run_grpo_math.py policy.model_name=google/gemma-2-2b-it logger.wandb_enabled=True cluster.gpus_per_node=8 +policy.generation.vllm_cfg.load_format=auto

@terrykong Thanks very much for pointing out this!

I tested with almost the same script as yours before this commit fdb565c.
After this commit, load_format of vllm during training is default set to dummy, only specific models will change it through nemo_rl/models/huggingface/common.py. The param policy.generation.vllm_cfg.load_format is removed from yaml and has no effect even if we pass it.

It is fixed now, and other models won't be affect since they don't need special handle on load_format.

yuki-666 · 2025-05-22T07:47:57Z

Thanks @jgerh , have updated from your suggestions.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

terrykong · 2025-05-23T16:35:52Z

Thanks for the quick fix @yuki-666 . Gemma2 seems to be okay now from a quick run:

yuki-666 force-pushed the yukih/fsdp2-general branch 14 times, most recently from d5ad00d to 8e2e6f4 Compare May 20, 2025 09:59

github-actions bot added the documentation Improvements or additions to documentation label May 20, 2025

yuki-666 added the CI:L1 Run doctests, unit tests, and functional tests label May 20, 2025

yuki-666 temporarily deployed to nemo-ci May 20, 2025 10:02 — with GitHub Actions Inactive

yuki-666 added the CI:docs Run doctest label May 20, 2025

yuki-666 temporarily deployed to nemo-ci May 20, 2025 13:30 — with GitHub Actions Inactive

yuki-666 changed the title ~~feat: general fsdp2~~ feat: general fsdp2 on non-MoE models + HF TP plan May 20, 2025

yuki-666 marked this pull request as ready for review May 20, 2025 13:59

yuki-666 requested review from terrykong, parthchadha, SahilJain314 and gshennvm May 20, 2025 14:00

parthchadha requested changes May 20, 2025

View reviewed changes

yuki-666 force-pushed the yukih/fsdp2-general branch from fc6cc49 to 05d8cfe Compare May 21, 2025 02:53

yuki-666 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 21, 2025

terrykong reviewed May 21, 2025

View reviewed changes

nemo_rl/models/dtensor/parallelize.py Show resolved Hide resolved

yuki-666 force-pushed the yukih/fsdp2-general branch 3 times, most recently from 08cce8c to 0dc55cc Compare May 22, 2025 06:59

yuki-666 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 22, 2025

yuki-666 temporarily deployed to nemo-ci May 22, 2025 07:03 — with GitHub Actions Inactive

yuki-666 added 9 commits May 23, 2025 17:46

support hf tp plan, add custom_parallel_plan param

d58103d

Signed-off-by: Yuki Huang <yukih@nvidia.com>

tidy up

807a0dc

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix model with model.language_model

440cd35

Signed-off-by: Yuki Huang <yukih@nvidia.com>

special with embed_tokens and lm_head for speed up

be8ddeb

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add doc and update custom_parallel_plan

3e6918f

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add unit test

9f259bb

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update config

10c4f05

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix gemma2

8ae4d32

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update doc and fix type

72a8f35

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-666 force-pushed the yukih/fsdp2-general branch from 0dc55cc to 72a8f35 Compare May 23, 2025 09:46

yuki-666 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels May 23, 2025

yuki-666 temporarily deployed to nemo-ci May 23, 2025 09:48 — with GitHub Actions Inactive

terrykong approved these changes May 23, 2025

View reviewed changes

parthchadha approved these changes May 23, 2025

View reviewed changes

parthchadha added this pull request to the merge queue May 23, 2025

SahilJain314 approved these changes May 23, 2025

View reviewed changes

Merged via the queue into main with commit 3db05c1 May 23, 2025
21 of 23 checks passed

parthchadha deleted the yukih/fsdp2-general branch May 23, 2025 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: general fsdp2 on non-MoE models + HF TP plan #352

feat: general fsdp2 on non-MoE models + HF TP plan #352

Uh oh!

yuki-666 commented May 12, 2025 •

edited by terrykong

Loading

Uh oh!

yuki-666 commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-666 commented May 22, 2025

Uh oh!

yuki-666 commented May 22, 2025

Uh oh!

terrykong commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

feat: general fsdp2 on non-MoE models + HF TP plan #352

feat: general fsdp2 on non-MoE models + HF TP plan #352

Uh oh!

Conversation

yuki-666 commented May 12, 2025 • edited by terrykong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Convergence Test Detail

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

yuki-666 commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-666 commented May 22, 2025

Uh oh!

yuki-666 commented May 22, 2025

Uh oh!

terrykong commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

yuki-666 commented May 12, 2025 •

edited by terrykong

Loading