-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to suspend Elasticsearch Pods for debugging purposes #4946
Conversation
This reverts commit e1ea60c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor nitpicks, overall it looks great and clean!
|
||
func reconcileSuspendedPods(c k8s.Client, es esv1.Elasticsearch, e *expectations.Expectations) error { | ||
// let's make sure we observe any deletions in the cache to avoid redundant deletion | ||
deletionsSatisfied, err := e.DeletionsSatisfied() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considering we don't care about the change budget here, I feel like we may not need to double-check the pod deletions expectatioins here (a small optimization) and instead just rely on optimistic concurrency control for the deletion?
(this simplifies the mental model required to understand this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I guess we need that because otherwise we would trigger a rolling upgrade whose change budget would ignore that pod we just deleted.
) | ||
if err != nil { | ||
return corev1.PodTemplateSpec{}, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should init containers inherit the main container resources the same way they inherit volumes etc. through WithInitContainerDefaults
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into that. I think that would be nicer than the special case that I constructed. I originally shied away from it because I thought that we do not want that for all initContainers. But given that they already have specific resource requirements this should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one downside of defaulting in the way you proposed would be that all user specified initContainers
would now also inherit the main containers resource requirements (if they don't specify their own) which would constitute a breaking change compared to current behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would be fine with that, it also simplifies the question of "how much memory should we give to init container X" if they all share the main container spec. Considering they run in order I don't think there would be any significant impact on resource usage.
@elastic/cloud-k8s team any thoughts or different opinion on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented it in 32a24fc which we can back out if there are concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one downside of defaulting in the way you proposed would be that all user specified initContainers would now also inherit the main containers resource requirements (if they don't specify their own) which would constitute a breaking change compared to current behaviour.
👍 , if we want to implement a new behaviour I would make it in a dedicated PR, so it appears clearly in the release notes with a correct label on the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a try locally and everything seems to work as expected :)
Let's not forget to write documentation in https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-troubleshooting-methods.html.
LGTM 🚢
I was planning to that in a separate PR once we are happy with the approach, and for easier reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
One thing I noticed is that depending on the state of the cluster (I did some tests on "broken" clusters, which were returning 5xx), it can take a bit more than one minute for the ConfigMap
to be updated and/or Pods to be restarted. There is nothing new here, this is because of the "rate limited" queue used in the controller. But I was wondering if in a "debug/support situation" we might expect less "latency" between the expected state and the actual one, and if it could be worth having a dedicated controller, not tied to the health of Elasticsearch. On a second thought it sounds like a very small improvement which might increase the complexity of the actual code, and "premature optimization is the root of all evil".
So 👍 with the proposed approach.
…stic#4946) Adds support for an new annotation, eck.k8s.elastic.co/suspend, that allows users to suspend the Pods listed in the annotation. Implementation does not treat suspended Pods in any special way. That means most cluster operations like upscale or downscale of master nodes and rolling upgrades of all nodes are not able to make progress while a node in the cluster is suspended.
Fixes #4546
Adds support for an new annotation that allows users to suspend the Pods listed in the annotation:
Implementation does not treat suspended Pods in any special way (e1ea60c explored this option but I reverted it). That means most cluster operations like upscale or downscale of master nodes and rolling upgrades of all nodes are not able to make progress while a node in the cluster is suspended.
Unit test coverage is still a bit light but I added an e2e test.