Skip to content

Conversation

machichima
Copy link
Collaborator

@machichima machichima commented Aug 23, 2025

  • Set RayJob status to Failed if validation error
  • Not requeue if validation error

Why are these changes needed?

When trying to create rayjob with name that's too long, we get the validation error. However, the RayJob just keep retrying and does not fail, which cause RayJob to stuck without any status update

Result after my modification, RayJob will directly fail when validation error:

image

Related issue number

Closes #3980

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima marked this pull request as ready for review August 24, 2025 12:58
Signed-off-by: machichima <nary12321@gmail.com>
Comment on lines 136 to 167
// Please do NOT modify `originalRayJobInstance` in the following code.
originalRayJobInstance := rayJobInstance.DeepCopy()

if err := utils.ValidateRayJobSpec(rayJobInstance); err != nil {
logger.Error(err, "The RayJob spec is invalid")
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(utils.InvalidRayJobSpec),
"The RayJob spec is invalid %s/%s: %v", rayJobInstance.Namespace, rayJobInstance.Name, err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
// Perform all validations and directly fail the RayJob if any of the validation fails
validationErrors := []struct {
err error
errType utils.K8sEventType
message string
}{
{utils.ValidateRayJobMetadata(rayJobInstance.ObjectMeta), utils.InvalidRayJobMetadata, "The RayJob metadata is invalid"},
{utils.ValidateRayJobSpec(rayJobInstance), utils.InvalidRayJobSpec, "The RayJob spec is invalid"},
{utils.ValidateRayJobStatus(rayJobInstance), utils.InvalidRayJobStatus, "The RayJob status is invalid"},
}

for _, validation := range validationErrors {
if validation.err != nil {
logger.Error(validation.err, validation.message)
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(validation.errType),
"%s %s/%s: %v", validation.message, rayJobInstance.Namespace, rayJobInstance.Name, validation.err)

rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
rayJobInstance.Status.Reason = rayv1.ValidationFailed
rayJobInstance.Status.Message = fmt.Sprintf("%s: %v", validation.message, validation.err)

if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
logger.Info("Failed to update RayJob status", "error", err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

if err := utils.ValidateRayJobStatus(rayJobInstance); err != nil {
logger.Error(err, "The RayJob status is invalid")
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(utils.InvalidRayJobStatus),
"The RayJob status is invalid %s/%s: %v", rayJobInstance.Namespace, rayJobInstance.Name, err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
return ctrl.Result{}, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that should we wrap this whole part to a function?
maybe the name can be checkRayJobSpecAndUpdateStatusIfNeeded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG! Updated in 3685d4d

Comment on lines 140 to 141
validationErrors := []struct {
err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe validationRule is more accurate than validationErrors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified in 771f198, thanks!

Comment on lines 144 to 148
}{
{utils.ValidateRayJobMetadata(rayJobInstance.ObjectMeta), utils.InvalidRayJobMetadata, "The RayJob metadata is invalid"},
{utils.ValidateRayJobSpec(rayJobInstance), utils.InvalidRayJobSpec, "The RayJob spec is invalid"},
{utils.ValidateRayJobStatus(rayJobInstance), utils.InvalidRayJobStatus, "The RayJob status is invalid"},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slice might cause some unnecessary execution. If the former validation return an error, the latter still gets executed but it only update the status according the first error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5066d66 to prevent executing the next validation if the previous one failed, thank you!

},
},
},
shouldContain: "The RayJob metadata is invalid",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be difficult to include the message of error in shouldContain? That could be more specific that it did cause a certain error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Added in bd6c07d

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima force-pushed the 3980-rayjob-metadata-invalid branch from 3685d4d to 53cb27d Compare August 26, 2025 13:29
Signed-off-by: machichima <nary12321@gmail.com>
return false, ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

return false, ctrl.Result{}, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to return the error?

Copy link
Collaborator Author

@machichima machichima Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that's a mistake, updated here: 4de789e

rayJobInstance.Status.Reason = rayv1.ValidationFailed
rayJobInstance.Status.Message = fmt.Sprintf("%s: %v", validation.message, err)

if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update rayJob with Kubernetes here?
We should only update rayJob here.

// This is the only place where we update the RayJob status. Please do NOT add any code
// between `checkBackoffLimitAndUpdateStatusIfNeeded` and the following code.
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
logger.Info("Failed to update RayJob status", "error", err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried not to update the ray job status here. However, it will then directly goes into case rayv1.JobDeploymentStatusComplete, rayv1.JobDeploymentStatusFailed and return here:

, which will not update the status

I think it's safe to update the status here, as I believe the code you are pointing to is trying to prevent us updating the rayjob status within the switch rayJobInstance.Status.JobDeploymentStatus block. If the rayjob status is updated, here, it will directly return and won't affect anything

// Please do NOT modify `originalRayJobInstance` in the following code.
originalRayJobInstance := rayJobInstance.DeepCopy()

// Perform all validations and directly fail the RayJob if any of the validation fails
if passed, result, err := r.validateRayJobAndUpdateStatusIfNeeded(ctx, rayJobInstance, originalRayJobInstance); !passed || err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use "or " here?
In what scenario would ‎passed be ‎true and ‎err be ‎nil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a mistake I made on not returning the error here return false, ctrl.Result{}, nil. Fixed in 4de789e

Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima force-pushed the 3980-rayjob-metadata-invalid branch from f5b5576 to 4de789e Compare August 27, 2025 01:07
Comment on lines 475 to 478
if updateErr := r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); updateErr != nil {
logger.Info("Failed to update RayJob status", "error", updateErr)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, updateErr
}
Copy link
Collaborator

@rueian rueian Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this out? Ideally, we should have only one updateRayJobStatus usage, the one at the bottom of the reconciliation loop, and we may need to distinguish whether the result is a validation error or a status update error. For validation errors, I think we'd better even stop requeueing reconciliations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this one, this will keep the RayJob reconciler's logic clean.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I added a new status JobDeploymentStatusValidationFailed and ensure there's only one updateRayJobStatus in the very end. Also, if we got validation error, it will not be requeue.

commit: 3002c05


rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed
rayJobInstance.Status.Reason = rayv1.ValidationFailed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider emitting a Condition (type=Ready/Valid) alongside Status.Reason=ValidationFailed so kubectl describe surfaces a machine-readable state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea and maybe more closer to Kubernetes's best practice, however, I think it will be too much change, we should wait for more users give us feedback if this really bother them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Although not exactly your suggestion, I added a new status ValidationFailed in 3002c05

We can see the ValidationFailed as the DEPLOYMENT STATUS when we do ``kubectl get`

image

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
@machichima machichima force-pushed the 3980-rayjob-metadata-invalid branch from a220ac1 to 83d4e2c Compare August 28, 2025 12:57
Comment on lines 449 to 452
// Do not requeue if validation error occurs
if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusValidationFailed {
return ctrl.Result{}, validateErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really weird to add this statement here, we should consider other implementation in my opinion, this makes the code harder to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, I want to keep the logic of reconciliation clean

@@ -462,9 +446,42 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
emitRayJobMetrics(r.options.RayJobMetricsManager, rayJobInstance.Name, rayJobInstance.Namespace, originalRayJobInstance.Status, rayJobInstance.Status)
// Do not requeue if validation error occurs
if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusValidationFailed {
return ctrl.Result{}, validateErr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ctrl.Result{}, validateErr
return ctrl.Result{}, nil

If I remember correctly, returning an error will requeue reconciliation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out! Fixed in e98978b

Comment on lines 434 to 435
case rayv1.JobDeploymentStatusValidationFailed:
// Fall through to status update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer change the deployment status to failed here, so r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); can help use update the rayJobInstace (CR).
and the next reconciliation, we can remove the CR in the queue.

Suggested change
case rayv1.JobDeploymentStatusValidationFailed:
// Fall through to status update
case rayv1.JobDeploymentStatusValidationFailed:
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation error will appear in the CRD events, right? If so, I don’t think we need to update the status.

[update] On second thought, we still need to update the CR status so that platform teams can program against it more easily.

// Please do NOT modify `originalRayJobInstance` in the following code.
originalRayJobInstance := rayJobInstance.DeepCopy()

// Perform all validations and directly fail the RayJob if any of the validation fails
validateErr := r.validateRayJobAndUpdateStatusIfNeeded(ctx, rayJobInstance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling updateRayJobStatus here directly so that we don't need to go into the switch statement? This makes the implementation simpler.

The comment "This is the only place where we update the RayJob status." is a good suggestion, as it helps avoid calling Status().Update() inside the switch statement, where complexity typically increases significantly. However, since validation is not in the switch statement, updating the status immediately after validation is acceptable and actually reduces complexity.

// This is the only place where we update the RayJob status. Please do NOT add any code
// between `checkBackoffLimitAndUpdateStatusIfNeeded` and the following code.
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e98978b, thank you!

@@ -466,6 +456,34 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil
}

func (r *RayJobReconciler) validateRayJobAndUpdateStatusIfNeeded(ctx context.Context, rayJobInstance *rayv1.RayJob) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you move the logic for updating rayJobInstance.Status outside of this function? This function is only responsible to return an error if any.
  2. Rename the function from validateRayJobAndUpdateStatusIfNeeded to validateRayJob.
  3. This no longer needs to be a member function of RayJobReconciler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 70c3e12, thank you!

r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(validation.errType),
"%s/%s: %v", rayJobInstance.Namespace, rayJobInstance.Name, err)

rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting JobStatus to JobStatusFailed is misleading. JobStatus should always reflect the Ray job status retrieved from the Ray dashboard API.

Suggested change
rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this one, since we never run ray job submit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Thanks

@Future-Outlier
Copy link
Member

Future-Outlier commented Sep 2, 2025

I am thinking that if we do this change for RayJob, should we also do it for RayService?

@machichima machichima force-pushed the 3980-rayjob-metadata-invalid branch from a1059ba to 70c3e12 Compare September 2, 2025 05:55
@machichima
Copy link
Collaborator Author

Hi @Future-Outlier ,
The test passed, PTAL.
Thank you!

@machichima machichima force-pushed the 3980-rayjob-metadata-invalid branch from 5bf5706 to 920ae1d Compare September 2, 2025 13:14
Comment on lines +151 to +154
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
logger.Info("Failed to update RayJob status", "error", err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
Copy link
Member

@Future-Outlier Future-Outlier Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
logger.Info("Failed to update RayJob status", "error", err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
// This is the only 2 places where we update the RayJob status.
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
logger.Info("Failed to update RayJob status", "error", err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}

Can you also help me add comments to the code below?

	// This is the only place where we update the RayJob status. Please do NOT add any code
	// between `checkBackoffLimitAndUpdateStatusIfNeeded` and the following code.
	if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
		logger.Info("Failed to update RayJob status", "error", err)
		return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
	}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0663f3e

Comment on lines 629 to 748
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
Containers: []corev1.Container{{Image: "rayproject/ray"}},
},
},
},
},
},
},
expectedMessage: "The RayJob metadata is invalid: RayJob name should be no more than 47 characters",
},
{
name: "spec validation failure - empty containers",
rayJob: &rayv1.RayJob{
ObjectMeta: metav1.ObjectMeta{
Name: "valid-name",
Namespace: "default",
},
Spec: rayv1.RayJobSpec{
RayClusterSpec: &rayv1.RayClusterSpec{
HeadGroupSpec: rayv1.HeadGroupSpec{
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
Containers: []corev1.Container{},
},
},
},
},
},
},
expectedMessage: "The RayJob spec is invalid: headGroupSpec should have at least one container",
},
{
name: "status validation failure - invalid status combination",
rayJob: &rayv1.RayJob{
ObjectMeta: metav1.ObjectMeta{
Name: "valid-name",
Namespace: "default",
},
Spec: rayv1.RayJobSpec{
SubmissionMode: rayv1.K8sJobMode, // not interactive
RayClusterSpec: &rayv1.RayClusterSpec{
HeadGroupSpec: rayv1.HeadGroupSpec{
Template: corev1.PodTemplateSpec{
Spec: corev1.PodSpec{
Containers: []corev1.Container{{Image: "rayproject/ray"}},
},
},
},
},
},
Status: rayv1.RayJobStatus{
JobDeploymentStatus: rayv1.JobDeploymentStatusWaiting,
},
},
expectedMessage: "The RayJob status is invalid: JobDeploymentStatus cannot be `Waiting` when SubmissionMode is not InteractiveMode",
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Create a fake client with the RayJob
fakeClient := clientFake.NewClientBuilder().
WithScheme(scheme).
WithRuntimeObjects(tt.rayJob).
WithStatusSubresource(&rayv1.RayJob{}).
Build()

// Create recorder for events
recorder := record.NewFakeRecorder(10)
reconciler := &RayJobReconciler{
Client: fakeClient,
Recorder: recorder,
Scheme: scheme,
}

// Call Reconcile
ctx := context.Background()
req := ctrl.Request{
NamespacedName: types.NamespacedName{
Name: tt.rayJob.Name,
Namespace: tt.rayJob.Namespace,
},
}
result, err := reconciler.Reconcile(ctx, req)
require.NoError(t, err)

// Make sure no requeue happening
assert.Equal(t, time.Duration(0), result.RequeueAfter)

// Verify the status was properly set
updatedRayJob := &rayv1.RayJob{}
err = fakeClient.Get(ctx, req.NamespacedName, updatedRayJob)
require.NoError(t, err)
assert.Equal(t, rayv1.JobDeploymentStatusValidationFailed, updatedRayJob.Status.JobDeploymentStatus)
assert.Equal(t, rayv1.ValidationFailed, updatedRayJob.Status.Reason)
assert.Equal(t, tt.expectedMessage, updatedRayJob.Status.Message)
})
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add at least an integration test for this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Added in 987b4f0

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me merge master and solve the conflict? I can find time review this next week, thank you!

@machichima
Copy link
Collaborator Author

Can you help me merge master and solve the conflict? I can find time review this next week, thank you!

Sure! I am adding the e2e test you mentioned in the review. Will address all your review this week. Thank you!

@machichima
Copy link
Collaborator Author

After discussed with @Future-Outlier , we decided to remove the unit test as it is somehow covered already by test cases in validation_test.go and rayjob_test.go. Will add tests in the future PR if needed

Removed test in: 6a4c5dd

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@Future-Outlier
Copy link
Member

cc @rueian for review

@rueian rueian merged commit 6e70fd2 into ray-project:master Sep 16, 2025
27 checks passed
@Future-Outlier Future-Outlier changed the title [Fix] Directly fail if RayJob metadata is invalid [RayJob] Directly fail if metadata is invalid Sep 16, 2025
@Future-Outlier Future-Outlier changed the title [RayJob] Directly fail if metadata is invalid [RayJob] Directly fail if spec is invalid Sep 16, 2025
@Future-Outlier Future-Outlier changed the title [RayJob] Directly fail if spec is invalid [RayJob] Directly fail CR is invalid Sep 16, 2025
@Future-Outlier Future-Outlier changed the title [RayJob] Directly fail CR is invalid [RayJob] Directly fail CR if is invalid Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] RayJob not failed when metadata is invalid
7 participants