Skip to content

fix: Changes to support ray job submit #432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

hemildesai
Copy link
Collaborator

@hemildesai hemildesai commented May 21, 2025

What does this PR do ?

  • Create a venv per each node for the worker group using a ray task
  • Set CUDA_VISIBLE_DEVICES based on bundle index
  • Always reuse external clusters

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

num_nodes = len(nodes)
# Reserve one CPU on each node using a STRICT_SPREAD placement group
bundles = [{"CPU": 1} for _ in range(num_nodes)]
pg = placement_group(bundles=bundles, strategy="STRICT_SPREAD")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these PGs be cleaned up to release the resources?

@@ -295,6 +307,8 @@ def _create_workers_from_bundle_indices(
"MASTER_ADDR": self.master_address,
"MASTER_PORT": str(self.master_port),
"NODE_RANK": str(node_idx),
"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1",
"CUDA_VISIBLE_DEVICES": str(bundle_idx),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you leave a comment as to why we need to manually manage the GPUs now?

else 0
)
# Set this to 0 to manually control placement group allotment
num_gpus = 0
Copy link
Collaborator

@terrykong terrykong May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, setting this non-zero tells ray to set CUDA_VISIBLE_DEVICES (which you're trying to work around above), but wasn't this also used to enforce the maximum number of worker types that could be colocated in a single placement group? So it looks like we can now schedule an unlimited amount of worker types (in theory). Is that correct?

Could this be an issue if someone were to ambitiously want to colocate:

  • worker_type 1: all gpus (1/4 gpu each worker)
  • worker_type 2: all gpus (1/4 gpu each worker)
  • worker_type 3: half gpus (1/2 gpu each worker)
  • worker_type: 4: half gpus (1/2 gpu each worker)

So before with fractional resources, we'd fully fill out the cluster, but with num_gpus=0, could we end up with some uneven scheduling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah valid point, let me try again to see if I can revert this change and still get the jobs to run in all scenarios.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the ray job submit scenario, setting num_gpus doesn't seem to set CUDA_VISIBLE_DEVICES in k8s.

@@ -295,6 +307,8 @@ def _create_workers_from_bundle_indices(
"MASTER_ADDR": self.master_address,
"MASTER_PORT": str(self.master_port),
"NODE_RANK": str(node_idx),
"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's needed as driver has nothing to do with GPUs and can also run on non-GPU nodes.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai hemildesai force-pushed the hemil/k8s-changes branch from b249973 to 8e155f9 Compare May 22, 2025 03:05
Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with NeMo-Run updated and it works well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants