-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle empty NODE_ID in Elasticsearch PreStop hook #7892
Conversation
57ddcbe
to
8f1cea9
Compare
Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>
You got me. Co-authored-by: Thibault Richard <thbkrkr@users.noreply.github.com>
buildkite test this |
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT |
This breaks two e2e tests:
I don't know exactly what's going on yet.
|
The pre-stop hook incorrectly extracted the node id, which created 2 shutdown records with different ids (Q3SwszElnHxaJg and eUcnfdK-Q3SwszElnHxaJg). |
Allow the '-' character.
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT |
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT |
1 similar comment
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT |
Thank you @BobVanB! |
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT |
buildkite test this |
Thank you very much for the contribution and for your patience @BobVanB. |
The main problem
When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.
This piece of code can be empty:
NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')
Result
_nodes
andshutdown
error_exit "failed to call node shutdown API"
and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.What i still want to know
Is the node removed before calling _cat/nodes
When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the
_cat/nodes
query? Or is it possible that the query ends op on the terminated node and doesn't give a result.This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.
request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"
Why is terminationGracePeriodSeconds way less then possible script run time?
The default terminationGracePeriodSeconds is 180 seconds.
The scripts has also has 2
retry 10
calls, witch hascount ** 2
as wait.This can result in alot of wait time:
round 1: 1 second
round 2: 1 second of previous round + 1 + 2 = 4 seconds
round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds
...
round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes
retry 10
be way less, something likeretry 8
and get "retry 8/8 exited 1, no more retries left"
What has been done
I tried to not rewrite it all
Want to know if this should use a
retry 3
or justerror_exit "failed to retrieve node ID"
After cleanup, looks like this was not needed.
PoC Result
Added some debug information to prove that the script is working.
Will add that it is not fun to debug the bash script without 'set -x'.
What has not been done
...