Allow users to suspend the Elasticsearch process for debugging purposes #4546

pebrc · 2021-06-02T09:45:43Z

Currently it is not possible to run debug or run disaster recovery operations like the elasticsearch-node tool on clusters managed by ECK.

We should create a mechanism by which users can specify to suspend the Elasticsearch process to then allow users/admins to exec into the the container to run the necessary operations.

A possible implementation approach:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
  annotations: 
        eck.elastic.co/suspend: "pod-1,pod-2'
spec:
  version: 7.13.0

This could be implemented by mounting a ConfigMap into the initcontainer of the Elasticsearch Pods with a list of Pods to be suspended and the initcontainer would simply not terminate as long as the Pods name is in that file or file exists with the Pods name.

sebgl · 2021-10-05T09:20:31Z

I think the approach @pebrc outlined above is probably the best thing we can do; I don't see a better way.

The only other alternative I see is to rely on a different container entrypoint where we can suspend the Elasticsearch process without stopping the container, but this feels very much against the spirit of containers.

Some "implementation details" to consider:

configmap granularity: per cluster vs. per statefulset? I'd argue per-cluster is simpler and leads to less extra K8s resources.
always run the init container vs. tweak the statefulset spec when required to add the init container only when requested? I think it's simpler to always have this init container in place in the StatefulSet spec, although it adds a small time overhead to startup time, so we can completely decouple the stop/start decision from the StatefulSet spec which doesn't change.
I think we want the operator to reconcile the configmap according to the annotation (rather than users manipulating the configmap directly), but also have the operator to delete the corresponding Pods immediately (rather than users having to edit the annotation then delete the Pod). This feature is there for exceptional cases where I think we don't mind much a "brutal" Pod deletion. From a reconciliation loop perspective, we need to handle the case where the configmap has been changed but the Pod has not yet been deleted (or has been already deleted :)). I think the latter can be simplified to: if the ES Pod status reports it has passed the initContainer, then delete the Pod (which will then be stuck in its init container), otherwise do nothing (either we're in the initContainer already, either it has not ran yer).
What should the init container do? I guess run a bash infinite for loop where it mostly sleeps, but also checks for the content of the mounted secret file every X seconds (something like 10 seconds seems reasonable, as we'd like this to be reactive-enough but not necessarily super fast)? If the configmap says the Pod should run, then exit the init container process.
Annotation name: I like the eck.elastic.co/suspend suggestion.

pebrc · 2021-10-05T09:26:37Z

always run the init container vs. tweak the statefulset spec when required to add the init container only when requested? I think it's simpler to always have this init container in place in the StatefulSet spec, although it adds a small time overhead to startup time, so we can completely decouple the stop/start decision from the StatefulSet spec which doesn't change.

Always running this initContainer would mean a rolling restart of all managed workloads on the ECK upgrade that ships that feature. I wonder if that is worth avoiding.

sebgl · 2021-10-05T09:28:37Z

Always running this initContainer would mean a rolling restart of all managed workloads on the ECK upgrade that ships that feature. I wonder if that is worth avoiding.

If we tweak the StatefulSet spec on-demand, it means other Pods we're not interested in restarting will be upgraded/restarted with the new spec once to account for the other Pod to stop; then a second time to account for the other Pod to not-stop anymore?

pebrc · 2021-10-12T17:56:22Z

One additional aspect is what should happen with regular reconciliation while a node is suspended. I see two ways:

either we accept that a suspended node will stop almost all regular reconciliation attempt
or we want to have it so that the suspended node is effectively ignored during regular reconciliation

The first option is pretty straightforward and will happened automatically: once a node is suspended our predicate system will register it as "missing" from the cluster and not make progress with most rolling upgrades and anything that is guarded by change budgets.

The second option would be quite intrusive, we would have to excluded the suspended nodes from all calculations: the expectations mechanism (Statefulset reconciliation will never reach expected state), cached Elasticsearch membership state, Pod readiness check during downscales etc.

sebgl · 2021-10-13T11:44:16Z

I think we should treat the suspended node exactly as we already treat a regular Pod that is in its init container phase, no difference (I think that's your option 1).

pebrc added discuss We need to figure this out >feature Adds or discusses adding a feature to the product labels Jun 2, 2021

pebrc mentioned this issue Jun 2, 2021

Document how to use to the elasticsearch-node tool #3687

Closed

pebrc removed the discuss We need to figure this out label Oct 5, 2021

pebrc self-assigned this Oct 12, 2021

pebrc mentioned this issue Oct 12, 2021

Allow users to suspend Elasticsearch Pods for debugging purposes #4946

Merged

pebrc closed this as completed in #4946 Oct 18, 2021

idanmo added the supportability label Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to suspend the Elasticsearch process for debugging purposes #4546

Allow users to suspend the Elasticsearch process for debugging purposes #4546

pebrc commented Jun 2, 2021 •

edited

Loading

sebgl commented Oct 5, 2021 •

edited

Loading

pebrc commented Oct 5, 2021

sebgl commented Oct 5, 2021

pebrc commented Oct 12, 2021

sebgl commented Oct 13, 2021

Allow users to suspend the Elasticsearch process for debugging purposes #4546

Allow users to suspend the Elasticsearch process for debugging purposes #4546

Comments

pebrc commented Jun 2, 2021 • edited Loading

sebgl commented Oct 5, 2021 • edited Loading

pebrc commented Oct 5, 2021

sebgl commented Oct 5, 2021

pebrc commented Oct 12, 2021

sebgl commented Oct 13, 2021

pebrc commented Jun 2, 2021 •

edited

Loading

sebgl commented Oct 5, 2021 •

edited

Loading