Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to suspend Elasticsearch Pods for debugging purposes #4946

Merged
merged 20 commits into from
Oct 18, 2021

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Oct 12, 2021

Fixes #4546

Adds support for an new annotation that allows users to suspend the Pods listed in the annotation:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
  annotations: 
        eck.k8s.elastic.co/suspend: "pod-1,pod-2'
spec:
  version: 7.15.0

Implementation does not treat suspended Pods in any special way (e1ea60c explored this option but I reverted it). That means most cluster operations like upscale or downscale of master nodes and rolling upgrades of all nodes are not able to make progress while a node in the cluster is suspended.

Unit test coverage is still a bit light but I added an e2e test.

@botelastic botelastic bot added the triage label Oct 12, 2021
@pebrc pebrc added >enhancement Enhancement of existing functionality >feature Adds or discusses adding a feature to the product labels Oct 13, 2021
@botelastic botelastic bot removed the triage label Oct 13, 2021
@pebrc pebrc added v1.9.0 and removed >enhancement Enhancement of existing functionality labels Oct 13, 2021
@pebrc pebrc marked this pull request as ready for review October 14, 2021 13:27
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor nitpicks, overall it looks great and clean!

pkg/apis/elasticsearch/v1/elasticsearch_types.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/configmap/configmap.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/configmap/configmap.go Outdated Show resolved Hide resolved

func reconcileSuspendedPods(c k8s.Client, es esv1.Elasticsearch, e *expectations.Expectations) error {
// let's make sure we observe any deletions in the cache to avoid redundant deletion
deletionsSatisfied, err := e.DeletionsSatisfied()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering we don't care about the change budget here, I feel like we may not need to double-check the pod deletions expectatioins here (a small optimization) and instead just rely on optimistic concurrency control for the deletion?
(this simplifies the mental model required to understand this function)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I guess we need that because otherwise we would trigger a rolling upgrade whose change budget would ignore that pod we just deleted.

pkg/controller/elasticsearch/driver/suspend.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/driver/suspend.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/initcontainer/suspend.go Outdated Show resolved Hide resolved
)
if err != nil {
return corev1.PodTemplateSpec{}, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should init containers inherit the main container resources the same way they inherit volumes etc. through WithInitContainerDefaults ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into that. I think that would be nicer than the special case that I constructed. I originally shied away from it because I thought that we do not want that for all initContainers. But given that they already have specific resource requirements this should be fine.

Copy link
Collaborator Author

@pebrc pebrc Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one downside of defaulting in the way you proposed would be that all user specified initContainers would now also inherit the main containers resource requirements (if they don't specify their own) which would constitute a breaking change compared to current behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be fine with that, it also simplifies the question of "how much memory should we give to init container X" if they all share the main container spec. Considering they run in order I don't think there would be any significant impact on resource usage.
@elastic/cloud-k8s team any thoughts or different opinion on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented it in 32a24fc which we can back out if there are concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one downside of defaulting in the way you proposed would be that all user specified initContainers would now also inherit the main containers resource requirements (if they don't specify their own) which would constitute a breaking change compared to current behaviour.

👍 , if we want to implement a new behaviour I would make it in a dedicated PR, so it appears clearly in the release notes with a correct label on the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a try locally and everything seems to work as expected :)
Let's not forget to write documentation in https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-troubleshooting-methods.html.

LGTM 🚢

@pebrc
Copy link
Collaborator Author

pebrc commented Oct 18, 2021

Let's not forget to write documentation in https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-troubleshooting-methods.html.

I was planning to that in a separate PR once we are happy with the approach, and for easier reviewing.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

One thing I noticed is that depending on the state of the cluster (I did some tests on "broken" clusters, which were returning 5xx), it can take a bit more than one minute for the ConfigMap to be updated and/or Pods to be restarted. There is nothing new here, this is because of the "rate limited" queue used in the controller. But I was wondering if in a "debug/support situation" we might expect less "latency" between the expected state and the actual one, and if it could be worth having a dedicated controller, not tied to the health of Elasticsearch. On a second thought it sounds like a very small improvement which might increase the complexity of the actual code,  and "premature optimization is the root of all evil".

So 👍 with the proposed approach.

@pebrc pebrc merged commit c15a94e into elastic:master Oct 18, 2021
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
…stic#4946)

Adds support for an new annotation, eck.k8s.elastic.co/suspend, that allows users to suspend the Pods listed in the annotation.

Implementation does not treat suspended Pods in any special way. That means most cluster operations like upscale or downscale of master nodes and rolling upgrades of all nodes are not able to make progress while a node in the cluster is suspended.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product v1.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow users to suspend the Elasticsearch process for debugging purposes
3 participants