-
Notifications
You must be signed in to change notification settings - Fork 24
fix: Changes to support ray job submit
#432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
num_nodes = len(nodes) | ||
# Reserve one CPU on each node using a STRICT_SPREAD placement group | ||
bundles = [{"CPU": 1} for _ in range(num_nodes)] | ||
pg = placement_group(bundles=bundles, strategy="STRICT_SPREAD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these PGs be cleaned up to release the resources?
@@ -295,6 +307,8 @@ def _create_workers_from_bundle_indices( | |||
"MASTER_ADDR": self.master_address, | |||
"MASTER_PORT": str(self.master_port), | |||
"NODE_RANK": str(node_idx), | |||
"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", | |||
"CUDA_VISIBLE_DEVICES": str(bundle_idx), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you leave a comment as to why we need to manually manage the GPUs now?
else 0 | ||
) | ||
# Set this to 0 to manually control placement group allotment | ||
num_gpus = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, setting this non-zero tells ray to set CUDA_VISIBLE_DEVICES (which you're trying to work around above), but wasn't this also used to enforce the maximum number of worker types that could be colocated in a single placement group? So it looks like we can now schedule an unlimited amount of worker types (in theory). Is that correct?
Could this be an issue if someone were to ambitiously want to colocate:
- worker_type 1: all gpus (1/4 gpu each worker)
- worker_type 2: all gpus (1/4 gpu each worker)
- worker_type 3: half gpus (1/2 gpu each worker)
- worker_type: 4: half gpus (1/2 gpu each worker)
So before with fractional resources, we'd fully fill out the cluster, but with num_gpus=0, could we end up with some uneven scheduling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah valid point, let me try again to see if I can revert this change and still get the jobs to run in all scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the ray job submit
scenario, setting num_gpus
doesn't seem to set CUDA_VISIBLE_DEVICES
in k8s.
@@ -295,6 +307,8 @@ def _create_workers_from_bundle_indices( | |||
"MASTER_ADDR": self.master_address, | |||
"MASTER_PORT": str(self.master_port), | |||
"NODE_RANK": str(node_idx), | |||
"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be set for the driver too? https://discuss.ray.io/t/how-to-stop-ray-from-managing-cuda-visible-devices/8767/3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's needed as driver has nothing to do with GPUs and can also run on non-GPU nodes.
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
b249973
to
8e155f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with NeMo-Run updated and it works well
What does this PR do ?
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
Additional Information