Allow users to suspend Elasticsearch Pods for debugging purposes #4946

pebrc · 2021-10-12T19:51:51Z

Adds support for an new annotation that allows users to suspend the Pods listed in the annotation:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
  annotations: 
        eck.k8s.elastic.co/suspend: "pod-1,pod-2'
spec:
  version: 7.15.0

Implementation does not treat suspended Pods in any special way (e1ea60c explored this option but I reverted it). That means most cluster operations like upscale or downscale of master nodes and rolling upgrades of all nodes are not able to make progress while a node in the cluster is suspended.

Unit test coverage is still a bit light but I added an e2e test.

…nnotation

pkg/controller/elasticsearch/driver/suspend.go

pkg/controller/elasticsearch/driver/upgrade_predicates.go

pkg/controller/elasticsearch/initcontainer/suspend.go

pkg/controller/elasticsearch/initcontainer/initcontainer.go

…er finally ran

pkg/apis/elasticsearch/v1/elasticsearch_types_test.go

This reverts commit e1ea60c.

sebgl

Left a few minor nitpicks, overall it looks great and clean!

pkg/apis/elasticsearch/v1/elasticsearch_types.go

pkg/controller/elasticsearch/configmap/configmap.go

pkg/controller/elasticsearch/driver/suspend.go

sebgl · 2021-10-15T13:11:06Z

pkg/controller/elasticsearch/driver/suspend.go

+
+func reconcileSuspendedPods(c k8s.Client, es esv1.Elasticsearch, e *expectations.Expectations) error {
+	// let's make sure we observe any deletions in the cache to avoid redundant deletion
+	deletionsSatisfied, err := e.DeletionsSatisfied()


considering we don't care about the change budget here, I feel like we may not need to double-check the pod deletions expectatioins here (a small optimization) and instead just rely on optimistic concurrency control for the deletion?
(this simplifies the mental model required to understand this function)

Ah I guess we need that because otherwise we would trigger a rolling upgrade whose change budget would ignore that pod we just deleted.

pkg/controller/elasticsearch/driver/suspend.go

pkg/controller/elasticsearch/initcontainer/suspend.go

sebgl · 2021-10-15T13:34:00Z

pkg/controller/elasticsearch/nodespec/podspec.go

+	)
+	if err != nil {
+		return corev1.PodTemplateSpec{}, err
+	}


Should init containers inherit the main container resources the same way they inherit volumes etc. through WithInitContainerDefaults ?

I can look into that. I think that would be nicer than the special case that I constructed. I originally shied away from it because I thought that we do not want that for all initContainers. But given that they already have specific resource requirements this should be fine.

I guess one downside of defaulting in the way you proposed would be that all user specified initContainers would now also inherit the main containers resource requirements (if they don't specify their own) which would constitute a breaking change compared to current behaviour.

I think I would be fine with that, it also simplifies the question of "how much memory should we give to init container X" if they all share the main container spec. Considering they run in order I don't think there would be any significant impact on resource usage.
@elastic/cloud-k8s team any thoughts or different opinion on this?

I have implemented it in 32a24fc which we can back out if there are concerns.

I guess one downside of defaulting in the way you proposed would be that all user specified initContainers would now also inherit the main containers resource requirements (if they don't specify their own) which would constitute a breaking change compared to current behaviour.

👍 , if we want to implement a new behaviour I would make it in a dedicated PR, so it appears clearly in the release notes with a correct label on the PR.

pkg/controller/elasticsearch/driver/suspend.go

sebgl

I gave it a try locally and everything seems to work as expected :)
Let's not forget to write documentation in https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-troubleshooting-methods.html.

LGTM 🚢

pebrc · 2021-10-18T08:17:24Z

Let's not forget to write documentation in https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-troubleshooting-methods.html.

I was planning to that in a separate PR once we are happy with the approach, and for easier reviewing.

barkbay

LGTM !

One thing I noticed is that depending on the state of the cluster (I did some tests on "broken" clusters, which were returning 5xx), it can take a bit more than one minute for the ConfigMap to be updated and/or Pods to be restarted. There is nothing new here, this is because of the "rate limited" queue used in the controller. But I was wondering if in a "debug/support situation" we might expect less "latency" between the expected state and the actual one, and if it could be worth having a dedicated controller, not tied to the health of Elasticsearch. On a second thought it sounds like a very small improvement which might increase the complexity of the actual code, and "premature optimization is the root of all evil".

So 👍 with the proposed approach.

…stic#4946) Adds support for an new annotation, eck.k8s.elastic.co/suspend, that allows users to suspend the Pods listed in the annotation. Implementation does not treat suspended Pods in any special way. That means most cluster operations like upscale or downscale of master nodes and rolling upgrades of all nodes are not able to make progress while a node in the cluster is suspended.

pebrc added 2 commits October 12, 2021 20:28

Suspend Elasticsearch Pods when mentioned in eck.elastic.co/suspend a…

e5e97b4

…nnotation

Ignore suspended pods during up/downscale and rolling upgrades

e1ea60c

botelastic bot added the triage label Oct 12, 2021

Fix existing unit tests and add test for set.Diff

5bbea87

pebrc added >enhancement Enhancement of existing functionality >feature Adds or discusses adding a feature to the product labels Oct 13, 2021

botelastic bot removed the triage label Oct 13, 2021

pebrc added v1.9.0 and removed >enhancement Enhancement of existing functionality labels Oct 13, 2021

pebrc added 3 commits October 13, 2021 11:43

Align annotation name with existing convention

db62ee7

Add unit test for SuspendedPodNames

8bef318

License headers

830c292