fix: Changes to support `ray job submit` #432

hemildesai · 2025-05-21T17:22:21Z

What does this PR do ?

Create a venv per each node for the worker group using a ray task
Set CUDA_VISIBLE_DEVICES based on bundle index
Always reuse external clusters

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

terrykong · 2025-05-21T17:31:39Z

nemo_rl/utils/venvs.py

+    num_nodes = len(nodes)
+    # Reserve one CPU on each node using a STRICT_SPREAD placement group
+    bundles = [{"CPU": 1} for _ in range(num_nodes)]
+    pg = placement_group(bundles=bundles, strategy="STRICT_SPREAD")


should these PGs be cleaned up to release the resources?

terrykong · 2025-05-21T17:38:49Z

nemo_rl/distributed/worker_groups.py

@@ -295,6 +307,8 @@ def _create_workers_from_bundle_indices(
                        "MASTER_ADDR": self.master_address,
                        "MASTER_PORT": str(self.master_port),
                        "NODE_RANK": str(node_idx),
+                        "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1",
+                        "CUDA_VISIBLE_DEVICES": str(bundle_idx),


could you leave a comment as to why we need to manually manage the GPUs now?

terrykong · 2025-05-21T17:43:07Z

nemo_rl/distributed/worker_groups.py

-                    else 0
-                )
+                # Set this to 0 to manually control placement group allotment
+                num_gpus = 0


IIUC, setting this non-zero tells ray to set CUDA_VISIBLE_DEVICES (which you're trying to work around above), but wasn't this also used to enforce the maximum number of worker types that could be colocated in a single placement group? So it looks like we can now schedule an unlimited amount of worker types (in theory). Is that correct?

Could this be an issue if someone were to ambitiously want to colocate:

worker_type 1: all gpus (1/4 gpu each worker)

worker_type 2: all gpus (1/4 gpu each worker)

worker_type 3: half gpus (1/2 gpu each worker)

worker_type: 4: half gpus (1/2 gpu each worker)

So before with fractional resources, we'd fully fill out the cluster, but with num_gpus=0, could we end up with some uneven scheduling?

Ah valid point, let me try again to see if I can revert this change and still get the jobs to run in all scenarios.

For the ray job submit scenario, setting num_gpus doesn't seem to set CUDA_VISIBLE_DEVICES in k8s.

terrykong · 2025-05-21T17:43:49Z

nemo_rl/distributed/worker_groups.py

@@ -295,6 +307,8 @@ def _create_workers_from_bundle_indices(
                        "MASTER_ADDR": self.master_address,
                        "MASTER_PORT": str(self.master_port),
                        "NODE_RANK": str(node_idx),
+                        "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1",


does this need to be set for the driver too? https://discuss.ray.io/t/how-to-stop-ray-from-managing-cuda-visible-devices/8767/3

I don't think that's needed as driver has nothing to do with GPUs and can also run on non-GPU nodes.

Signed-off-by: Hemil Desai <hemild@nvidia.com>

gwarmstrong

Tested with NeMo-Run updated and it works well

terrykong reviewed May 21, 2025

View reviewed changes

hemildesai added 2 commits May 21, 2025 19:04

Changes to support ray job submit

35d54ce

Signed-off-by: Hemil Desai <hemild@nvidia.com>

Always reuse external clusters

8e155f9

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai force-pushed the hemil/k8s-changes branch from b249973 to 8e155f9 Compare May 22, 2025 03:05

gwarmstrong approved these changes May 22, 2025

View reviewed changes

gwarmstrong mentioned this pull request May 22, 2025

NeMo-RL support for SFT NVIDIA/NeMo-Skills#482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Changes to support `ray job submit` #432

fix: Changes to support `ray job submit` #432

hemildesai commented May 21, 2025 •

edited

Loading

Uh oh!

terrykong May 21, 2025

Uh oh!

terrykong May 21, 2025

Uh oh!

terrykong May 21, 2025 •

edited

Loading

Uh oh!

hemildesai May 21, 2025

Uh oh!

hemildesai May 21, 2025

Uh oh!

terrykong May 21, 2025

Uh oh!

hemildesai May 22, 2025

Uh oh!

gwarmstrong left a comment

Uh oh!

Uh oh!

fix: Changes to support ray job submit #432

Are you sure you want to change the base?

fix: Changes to support ray job submit #432

Conversation

hemildesai commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

terrykong May 21, 2025

Choose a reason for hiding this comment

Uh oh!

terrykong May 21, 2025

Choose a reason for hiding this comment

Uh oh!

terrykong May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hemildesai May 21, 2025

Choose a reason for hiding this comment

Uh oh!

hemildesai May 21, 2025

Choose a reason for hiding this comment

Uh oh!

terrykong May 21, 2025

Choose a reason for hiding this comment

Uh oh!

hemildesai May 22, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fix: Changes to support `ray job submit` #432

fix: Changes to support `ray job submit` #432

hemildesai commented May 21, 2025 •

edited

Loading

terrykong May 21, 2025 •

edited

Loading