Skip to content

Conversation

win5923
Copy link
Collaborator

@win5923 win5923 commented Aug 18, 2025

Why are these changes needed?

  • RayJob Volcano support: Adds Volcano scheduler support for RayJob CRD.
  • Gang scheduling: Ensures Ray pods are scheduled together as a unit, preventing partial scheduling issues.
  • Unified interface: Consolidates all batch schedulers to use AddMetadataToChildResource() for consistency.

E2E

  1. Deploy the KubeRay operator with the batch-scheduler volcano:
./ray-operator/bin/manager -leader-election-namespace default -use-kubernetes-proxy -batch-scheduler=volcano
  1. Create a RayJob with a head node (1 CPU + 2Gi of RAM), two workers (1 CPU + 1Gi of RAM each) and one submitter pod (0.5 CPU + ~0.2Gi of RAM), for a total of 3.5 CPU and 4.2Gi of RAM
kubectl apply -f ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml
  1. Add an additional RayJob with the same configuration but with a different name
sed 's/rayjob-sample-0/rayjob-sample-1/' ray-operator/config/samples/ray-job.volcano-scheduler-queue.yaml | kubectl apply -f-
  1. All the pods stuck on pending for new RayJob
image
$ k get podgroup ray-rayjob-sample-1-pg  -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  creationTimestamp: "2025-09-11T16:32:51Z"
  generation: 2
  name: ray-rayjob-sample-1-pg
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: RayJob
    name: rayjob-sample-1
    uid: 3e5d1ad3-0d35-401b-8d50-3d9f850add38
  resourceVersion: "12126"
  uid: 927dc6e4-0b8b-4d7a-8838-19496cc171b7
spec:
  minMember: 3
  minResources:
    cpu: "3"
    memory: 4Gi
  queue: kuberay-test-queue
status:
  conditions:
  - lastTransitionTime: "2025-09-11T16:32:52Z"
    message: '3/3 tasks in gang unschedulable: pod group is not ready, 3 Pending,
      3 minAvailable; Pending: 3 Unschedulable'
    reason: NotEnoughResources
    status: "True"
    transitionID: 7ae91b28-f1ed-4599-8147-f3ca7520f00d
    type: Unschedulable
  phase: Pending
$ k get queue kuberay-test-queue -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}}
  creationTimestamp: "2025-09-11T16:29:27Z"
  generation: 2
  name: kuberay-test-queue
  resourceVersion: "11903"
  uid: 43d0c1d3-fb1b-447a-a4db-3b52007d37c6
spec:
  capability:
    cpu: 4
    memory: 6Gi
  parent: root
  reclaimable: true
  weight: 1
status:
  allocated:
    cpu: "3"
    memory: 4Gi
    pods: "3"
  reservation: {}
  state: Open

Related issue number

Closes #1580

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 changed the title [POC] RayJob Volcano Integration RayJob Volcano Integration Aug 18, 2025
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
@win5923 win5923 force-pushed the rayjob-volcano branch 7 times, most recently from 0b855e1 to bc3811c Compare September 10, 2025 17:04
Signed-off-by: win5923 <ken89@kimo.com>
@win5923 win5923 marked this pull request as ready for review September 11, 2025 16:23
Signed-off-by: win5923 <ken89@kimo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayJob Volcano integration
2 participants