-
Notifications
You must be signed in to change notification settings - Fork 619
[RayJob] Directly fail CR if is invalid #3981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] Directly fail CR if is invalid #3981
Conversation
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
// Please do NOT modify `originalRayJobInstance` in the following code. | ||
originalRayJobInstance := rayJobInstance.DeepCopy() | ||
|
||
if err := utils.ValidateRayJobSpec(rayJobInstance); err != nil { | ||
logger.Error(err, "The RayJob spec is invalid") | ||
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(utils.InvalidRayJobSpec), | ||
"The RayJob spec is invalid %s/%s: %v", rayJobInstance.Namespace, rayJobInstance.Name, err) | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
} | ||
// Perform all validations and directly fail the RayJob if any of the validation fails | ||
validationErrors := []struct { | ||
err error | ||
errType utils.K8sEventType | ||
message string | ||
}{ | ||
{utils.ValidateRayJobMetadata(rayJobInstance.ObjectMeta), utils.InvalidRayJobMetadata, "The RayJob metadata is invalid"}, | ||
{utils.ValidateRayJobSpec(rayJobInstance), utils.InvalidRayJobSpec, "The RayJob spec is invalid"}, | ||
{utils.ValidateRayJobStatus(rayJobInstance), utils.InvalidRayJobStatus, "The RayJob status is invalid"}, | ||
} | ||
|
||
for _, validation := range validationErrors { | ||
if validation.err != nil { | ||
logger.Error(validation.err, validation.message) | ||
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(validation.errType), | ||
"%s %s/%s: %v", validation.message, rayJobInstance.Namespace, rayJobInstance.Name, validation.err) | ||
|
||
rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed | ||
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | ||
rayJobInstance.Status.Reason = rayv1.ValidationFailed | ||
rayJobInstance.Status.Message = fmt.Sprintf("%s: %v", validation.message, validation.err) | ||
|
||
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { | ||
logger.Info("Failed to update RayJob status", "error", err) | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
} | ||
|
||
if err := utils.ValidateRayJobStatus(rayJobInstance); err != nil { | ||
logger.Error(err, "The RayJob status is invalid") | ||
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(utils.InvalidRayJobStatus), | ||
"The RayJob status is invalid %s/%s: %v", rayJobInstance.Namespace, rayJobInstance.Name, err) | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
return ctrl.Result{}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that should we wrap this whole part to a function?
maybe the name can be checkRayJobSpecAndUpdateStatusIfNeeded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG! Updated in 3685d4d
validationErrors := []struct { | ||
err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe validationRule
is more accurate than validationErrors
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified in 771f198, thanks!
}{ | ||
{utils.ValidateRayJobMetadata(rayJobInstance.ObjectMeta), utils.InvalidRayJobMetadata, "The RayJob metadata is invalid"}, | ||
{utils.ValidateRayJobSpec(rayJobInstance), utils.InvalidRayJobSpec, "The RayJob spec is invalid"}, | ||
{utils.ValidateRayJobStatus(rayJobInstance), utils.InvalidRayJobStatus, "The RayJob status is invalid"}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slice might cause some unnecessary execution. If the former validation return an error, the latter still gets executed but it only update the status according the first error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5066d66 to prevent executing the next validation if the previous one failed, thank you!
}, | ||
}, | ||
}, | ||
shouldContain: "The RayJob metadata is invalid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be difficult to include the message of error in shouldContain
? That could be more specific that it did cause a certain error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Added in bd6c07d
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
3685d4d
to
53cb27d
Compare
Signed-off-by: machichima <nary12321@gmail.com>
return false, ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
} | ||
|
||
return false, ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to return the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that's a mistake, updated here: 4de789e
rayJobInstance.Status.Reason = rayv1.ValidationFailed | ||
rayJobInstance.Status.Message = fmt.Sprintf("%s: %v", validation.message, err) | ||
|
||
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update rayJob with Kubernetes here?
We should only update rayJob here.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 458 to 463 in edf7de8
// This is the only place where we update the RayJob status. Please do NOT add any code | |
// between `checkBackoffLimitAndUpdateStatusIfNeeded` and the following code. | |
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { | |
logger.Info("Failed to update RayJob status", "error", err) | |
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried not to update the ray job status here. However, it will then directly goes into case rayv1.JobDeploymentStatusComplete, rayv1.JobDeploymentStatusFailed
and return here:
return ctrl.Result{}, nil |
I think it's safe to update the status here, as I believe the code you are pointing to is trying to prevent us updating the rayjob status within the switch rayJobInstance.Status.JobDeploymentStatus
block. If the rayjob status is updated, here, it will directly return and won't affect anything
// Please do NOT modify `originalRayJobInstance` in the following code. | ||
originalRayJobInstance := rayJobInstance.DeepCopy() | ||
|
||
// Perform all validations and directly fail the RayJob if any of the validation fails | ||
if passed, result, err := r.validateRayJobAndUpdateStatusIfNeeded(ctx, rayJobInstance, originalRayJobInstance); !passed || err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use "or " here?
In what scenario would passed
be true
and err
be nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a mistake I made on not returning the error here return false, ctrl.Result{}, nil
. Fixed in 4de789e
Signed-off-by: machichima <nary12321@gmail.com>
f5b5576
to
4de789e
Compare
if updateErr := r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); updateErr != nil { | ||
logger.Info("Failed to update RayJob status", "error", updateErr) | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, updateErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this out? Ideally, we should have only one updateRayJobStatus
usage, the one at the bottom of the reconciliation loop, and we may need to distinguish whether the result is a validation error or a status update error. For validation errors, I think we'd better even stop requeueing reconciliations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this one, this will keep the RayJob reconciler's logic clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I added a new status JobDeploymentStatusValidationFailed
and ensure there's only one updateRayJobStatus
in the very end. Also, if we got validation error, it will not be requeue.
commit: 3002c05
|
||
rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed | ||
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed | ||
rayJobInstance.Status.Reason = rayv1.ValidationFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider emitting a Condition (type=Ready/Valid) alongside Status.Reason=ValidationFailed
so kubectl describe
surfaces a machine-readable state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea and maybe more closer to Kubernetes's best practice, however, I think it will be too much change, we should wait for more users give us feedback if this really bother them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Although not exactly your suggestion, I added a new status ValidationFailed
in 3002c05
We can see the ValidationFailed
as the DEPLOYMENT STATUS
when we do ``kubectl get`

Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
a220ac1
to
83d4e2c
Compare
// Do not requeue if validation error occurs | ||
if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusValidationFailed { | ||
return ctrl.Result{}, validateErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be really weird to add this statement here, we should consider other implementation in my opinion, this makes the code harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, I want to keep the logic of reconciliation clean
@@ -462,9 +446,42 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |||
} | |||
emitRayJobMetrics(r.options.RayJobMetricsManager, rayJobInstance.Name, rayJobInstance.Namespace, originalRayJobInstance.Status, rayJobInstance.Status) | |||
// Do not requeue if validation error occurs | |||
if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusValidationFailed { | |||
return ctrl.Result{}, validateErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ctrl.Result{}, validateErr | |
return ctrl.Result{}, nil |
If I remember correctly, returning an error will requeue reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out! Fixed in e98978b
case rayv1.JobDeploymentStatusValidationFailed: | ||
// Fall through to status update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer change the deployment status to failed here, so r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance);
can help use update the rayJobInstace (CR).
and the next reconciliation, we can remove the CR in the queue.
case rayv1.JobDeploymentStatusValidationFailed: | |
// Fall through to status update | |
case rayv1.JobDeploymentStatusValidationFailed: | |
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation error will appear in the CRD events, right? If so, I don’t think we need to update the status.
[update] On second thought, we still need to update the CR status so that platform teams can program against it more easily.
// Please do NOT modify `originalRayJobInstance` in the following code. | ||
originalRayJobInstance := rayJobInstance.DeepCopy() | ||
|
||
// Perform all validations and directly fail the RayJob if any of the validation fails | ||
validateErr := r.validateRayJobAndUpdateStatusIfNeeded(ctx, rayJobInstance) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about calling updateRayJobStatus
here directly so that we don't need to go into the switch
statement? This makes the implementation simpler.
The comment "This is the only place where we update the RayJob status." is a good suggestion, as it helps avoid calling Status().Update()
inside the switch
statement, where complexity typically increases significantly. However, since validation is not in the switch
statement, updating the status immediately after validation is acceptable and actually reduces complexity.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 459 to 461 in f18b1a4
// This is the only place where we update the RayJob status. Please do NOT add any code | |
// between `checkBackoffLimitAndUpdateStatusIfNeeded` and the following code. | |
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e98978b, thank you!
@@ -466,6 +456,34 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, nil | |||
} | |||
|
|||
func (r *RayJobReconciler) validateRayJobAndUpdateStatusIfNeeded(ctx context.Context, rayJobInstance *rayv1.RayJob) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you move the logic for updating
rayJobInstance.Status
outside of this function? This function is only responsible to return an error if any. - Rename the function from
validateRayJobAndUpdateStatusIfNeeded
tovalidateRayJob
. - This no longer needs to be a member function of
RayJobReconciler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 70c3e12, thank you!
r.Recorder.Eventf(rayJobInstance, corev1.EventTypeWarning, string(validation.errType), | ||
"%s/%s: %v", rayJobInstance.Namespace, rayJobInstance.Name, err) | ||
|
||
rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting JobStatus
to JobStatusFailed
is misleading. JobStatus
should always reflect the Ray job status retrieved from the Ray dashboard API.
rayJobInstance.Status.JobStatus = rayv1.JobStatusFailed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this one, since we never run ray job submit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Thanks
I am thinking that if we do this change for RayJob, should we also do it for RayService? |
a1059ba
to
70c3e12
Compare
Hi @Future-Outlier , |
Signed-off-by: Nary <nary@Narys-MacBook-Pro.local>
5bf5706
to
920ae1d
Compare
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { | ||
logger.Info("Failed to update RayJob status", "error", err) | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { | |
logger.Info("Failed to update RayJob status", "error", err) | |
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |
} | |
// This is the only 2 places where we update the RayJob status. | |
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil { | |
logger.Info("Failed to update RayJob status", "error", err) | |
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | |
} |
Can you also help me add comments to the code below?
// This is the only place where we update the RayJob status. Please do NOT add any code
// between `checkBackoffLimitAndUpdateStatusIfNeeded` and the following code.
if err = r.updateRayJobStatus(ctx, originalRayJobInstance, rayJobInstance); err != nil {
logger.Info("Failed to update RayJob status", "error", err)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 0663f3e
Template: corev1.PodTemplateSpec{ | ||
Spec: corev1.PodSpec{ | ||
Containers: []corev1.Container{{Image: "rayproject/ray"}}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
expectedMessage: "The RayJob metadata is invalid: RayJob name should be no more than 47 characters", | ||
}, | ||
{ | ||
name: "spec validation failure - empty containers", | ||
rayJob: &rayv1.RayJob{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: "valid-name", | ||
Namespace: "default", | ||
}, | ||
Spec: rayv1.RayJobSpec{ | ||
RayClusterSpec: &rayv1.RayClusterSpec{ | ||
HeadGroupSpec: rayv1.HeadGroupSpec{ | ||
Template: corev1.PodTemplateSpec{ | ||
Spec: corev1.PodSpec{ | ||
Containers: []corev1.Container{}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
expectedMessage: "The RayJob spec is invalid: headGroupSpec should have at least one container", | ||
}, | ||
{ | ||
name: "status validation failure - invalid status combination", | ||
rayJob: &rayv1.RayJob{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: "valid-name", | ||
Namespace: "default", | ||
}, | ||
Spec: rayv1.RayJobSpec{ | ||
SubmissionMode: rayv1.K8sJobMode, // not interactive | ||
RayClusterSpec: &rayv1.RayClusterSpec{ | ||
HeadGroupSpec: rayv1.HeadGroupSpec{ | ||
Template: corev1.PodTemplateSpec{ | ||
Spec: corev1.PodSpec{ | ||
Containers: []corev1.Container{{Image: "rayproject/ray"}}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
Status: rayv1.RayJobStatus{ | ||
JobDeploymentStatus: rayv1.JobDeploymentStatusWaiting, | ||
}, | ||
}, | ||
expectedMessage: "The RayJob status is invalid: JobDeploymentStatus cannot be `Waiting` when SubmissionMode is not InteractiveMode", | ||
}, | ||
} | ||
|
||
for _, tt := range tests { | ||
t.Run(tt.name, func(t *testing.T) { | ||
// Create a fake client with the RayJob | ||
fakeClient := clientFake.NewClientBuilder(). | ||
WithScheme(scheme). | ||
WithRuntimeObjects(tt.rayJob). | ||
WithStatusSubresource(&rayv1.RayJob{}). | ||
Build() | ||
|
||
// Create recorder for events | ||
recorder := record.NewFakeRecorder(10) | ||
reconciler := &RayJobReconciler{ | ||
Client: fakeClient, | ||
Recorder: recorder, | ||
Scheme: scheme, | ||
} | ||
|
||
// Call Reconcile | ||
ctx := context.Background() | ||
req := ctrl.Request{ | ||
NamespacedName: types.NamespacedName{ | ||
Name: tt.rayJob.Name, | ||
Namespace: tt.rayJob.Namespace, | ||
}, | ||
} | ||
result, err := reconciler.Reconcile(ctx, req) | ||
require.NoError(t, err) | ||
|
||
// Make sure no requeue happening | ||
assert.Equal(t, time.Duration(0), result.RequeueAfter) | ||
|
||
// Verify the status was properly set | ||
updatedRayJob := &rayv1.RayJob{} | ||
err = fakeClient.Get(ctx, req.NamespacedName, updatedRayJob) | ||
require.NoError(t, err) | ||
assert.Equal(t, rayv1.JobDeploymentStatusValidationFailed, updatedRayJob.Status.JobDeploymentStatus) | ||
assert.Equal(t, rayv1.ValidationFailed, updatedRayJob.Status.Reason) | ||
assert.Equal(t, tt.expectedMessage, updatedRayJob.Status.Message) | ||
}) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add at least an integration test for this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Added in 987b4f0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me merge master and solve the conflict? I can find time review this next week, thank you!
Sure! I am adding the e2e test you mentioned in the review. Will address all your review this week. Thank you! |
…job-metadata-invalid
Signed-off-by: machichima <nary12321@gmail.com>
After discussed with @Future-Outlier , we decided to remove the unit test as it is somehow covered already by test cases in Removed test in: 6a4c5dd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
cc @rueian for review |
RayJob
status to Failed if validation errorWhy are these changes needed?
When trying to create rayjob with name that's too long, we get the validation error. However, the RayJob just keep retrying and does not fail, which cause RayJob to stuck without any status update
Result after my modification, RayJob will directly fail when validation error:
Related issue number
Closes #3980
Checks