-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow users to suspend the Elasticsearch process for debugging purposes #4546
Comments
I think the approach @pebrc outlined above is probably the best thing we can do; I don't see a better way. The only other alternative I see is to rely on a different container entrypoint where we can suspend the Elasticsearch process without stopping the container, but this feels very much against the spirit of containers. Some "implementation details" to consider:
|
Always running this initContainer would mean a rolling restart of all managed workloads on the ECK upgrade that ships that feature. I wonder if that is worth avoiding. |
If we tweak the StatefulSet spec on-demand, it means other Pods we're not interested in restarting will be upgraded/restarted with the new spec once to account for the other Pod to stop; then a second time to account for the other Pod to not-stop anymore? |
One additional aspect is what should happen with regular reconciliation while a node is suspended. I see two ways:
The first option is pretty straightforward and will happened automatically: once a node is suspended our predicate system will register it as "missing" from the cluster and not make progress with most rolling upgrades and anything that is guarded by change budgets. The second option would be quite intrusive, we would have to excluded the suspended nodes from all calculations: the expectations mechanism (Statefulset reconciliation will never reach expected state), cached Elasticsearch membership state, Pod readiness check during downscales etc. |
I think we should treat the suspended node exactly as we already treat a regular Pod that is in its init container phase, no difference (I think that's your option 1). |
Currently it is not possible to run debug or run disaster recovery operations like the elasticsearch-node tool on clusters managed by ECK.
We should create a mechanism by which users can specify to suspend the Elasticsearch process to then allow users/admins to exec into the the container to run the necessary operations.
A possible implementation approach:
This could be implemented by mounting a ConfigMap into the initcontainer of the Elasticsearch Pods with a list of Pods to be suspended and the initcontainer would simply not terminate as long as the Pods name is in that file or file exists with the Pods name.
The text was updated successfully, but these errors were encountered: