-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile Ray Workers when VolumeMounts change #945
Conversation
Similar to #527 |
@DmitriGekhtman This isn't 100% ready for review just yet, but think this would be good for you to be one of the reviewers if that's okay. |
Sure, keep @kevin85421 in the loop as well! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! Would you mind adding some unit tests for the function? In addition, a detailed script for manually testing is good. I will also test this PR by myself.
@kevin85421 @DmitriGekhtman This is ready for review. @kevin85421 Feel free to resolve any threads you created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update! Would you mind adding some tests in util_test.go?
@@ -346,6 +347,10 @@ func PodNotMatchingTemplate(pod corev1.Pod, template corev1.PodTemplateSpec) boo | |||
// resource entries do not match | |||
return true | |||
} | |||
if !equality.Semantic.DeepDerivative(container1.VolumeMounts, container2.VolumeMounts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use DeepDerivative
rather than DeepEqual
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepEqual causes issues when default volumeMounts are mounted via the Kubernetes cluster. When using DeepEqual, I noticed pods would continuously reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I also have the problem using reflect.DeepEqual or equality.Semantic.DeepEqual because some fields are set with default non-zero values by some controller (like ImagePullPolicy, RestartPolicy, and so on) if these fields are not set by the operator" - kubernetes-sigs/kubebuilder#592
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes-sigs/kubebuilder#592 (comment)
kubernetes-sigs/kubebuilder#592 (comment)
DeepDerivative
seems to be risky. Would you mind explaining the behavior of this function and why does it work in our case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kevin85421, I think those comments are actually what we want. Kubernetes generates empty or nil values, and using DeepEqual will cause problems since it will detect a drift when it shouldn't..
DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.
Hope this clears things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepDerivative is similar to DeepEqual except that unset fields in a1 are ignored (not compared). This allows us to focus on the fields that matter to the semantic comparison.
I knew this. My point is: are there any situations where the worker update could be triggered unintentionally? I have this concern because:
(1) DeepDeriative
is much more complex than what we want to compare. (code)
(2) Which fields will be updated by Kubernetes APIServer? Will the changes trigger the worker update unintentionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Kevin. I believe we have the same concern. Using DeepDeriative for comparing volumeMounts does not trigger unintentionally.
- Is is more complex, but it seems that the complexity comes for ensuring fields are properly compared based off data type.
- The common example is empty and kubeApiAccess volumeMounts are added via the APIServer causing workers to update unintentionally.
dcce201
to
7de70f3
Compare
@kevin85421 Thanks for taking a look! Added in a test specifically aimed at testing the volumeMounts and changed the envVar to volumeMounts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you ever ran the unit tests on your local environment? You can check this doc for more details.
@kevin85421 Thanks for all the review! The tests have been passing locally, and thought of a couple other cases to add. Let me know if you have anything else. |
@kevin85421 wanted to politely follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I am busy with the release v0.5.0 these weeks. Will get back to you when the release process finishes. I do not review the PR again, but something can be improved:
(1) Please provide a detailed test script in the PR description outlining how you tested this PR. Currently, KubeRay CI does not have any integration tests for -forced-cluster-upgrade
, so I will review the related PRs carefully and test them myself. Please ensure that you thoroughly understand the behavior of both the Kubernetes API server and the DeepDerivative
function. Both of these components make me feel risky to merge this PR. (https://github.com/ray-project/kuberay/pull/945/files#r1139782405)
(2) Can we make the test in util_test.go
become a separate unit test?
Thanks!
@kevin85421 I addressed the above. Thank you for taking another look! |
Hey @kevin85421 wanted to follow up on this. I also noticed no labels on this PR, is that expected at this state? I appreciate the time! |
Why are these changes needed?
Ensures Ray Pods are up-to-date when volumeMounts change in the cluster manifest. This needs to be utilized with the flag:
-forced-cluster-upgrade
Related issue number
#944
Checks
When using the
-forced-cluster-upgrade
argument with the Ray Operator. The following manual steps have taken place ensuring new functionality works as intended:live
Ray Cluster manifest and attach a new VolumeMount to the head pod. Ensure this causes reconciliation of the head node.live
Ray Cluster manifest and attach a new VolumeMount to the worker pod. Ensure this causes reconciliation of the worker node.