Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix readiness script in case of operator upgrade #2208

Merged
merged 1 commit into from
Dec 4, 2019

Conversation

barkbay
Copy link
Contributor

@barkbay barkbay commented Dec 3, 2019

If some Elasticsearch clusters have been deployed with 1.0.0-beta1 and the operator is upgraded then all pods suddenly became not ready:

NAME                             READY   STATUS    RESTARTS   AGE   IP           NODE                                          NOMINATED NODE   READINESS GATES
pod/es-apm-sample-es-default-0   0/1     Running   0          19m   10.32.1.11   gke-michael-dev3-default-pool-4eb026fe-dt6p   <none>           <none>
pod/es-apm-sample-es-default-1   0/1     Running   0          19m   10.32.0.12   gke-michael-dev3-default-pool-84228e30-qgpl   <none>           <none>
pod/es-apm-sample-es-default-2   0/1     Running   0          19m   10.32.2.12   gke-michael-dev3-default-pool-de22c45a-wsx9  

The reason is that the readiness script has ben updated in #2180 and is propagated dynamically to all the Pods through a configmap, while for most of them PROBE_PASSWORD_PATH is unknown.

@barkbay barkbay added >bug Something isn't working v1.0.0 labels Dec 3, 2019
@anyasabo
Copy link
Contributor

anyasabo commented Dec 3, 2019

This change LGTM. Just to make sure I understand the current behavior:

  • the referenced PR updates the pod template to include the PROBE_PASSWORD_PATH env var
  • the readiness script is also updated to use the new env var

The update to the pod template begins the rolling upgrade process. But as soon as the readiness script config map is updated, all of the pods get the updated script even if they have not been restarted yet, and so do not have the new env var. As such they fail the readiness check until they're restarted, but the rolling upgrade probably also does not proceed because too many pods are not ready. Because we only define a readiness probe and not a liveness probe, k8s never restarts the pods either. Is that correct?

@barkbay
Copy link
Contributor Author

barkbay commented Dec 4, 2019

Is that correct?

👍 on your analysis , the Pods are eventually restarted for upgrade so in the end the situation is fixed. Nevertheless there is a disruption from the user point of view.

@barkbay barkbay self-assigned this Dec 4, 2019
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up after me :-)

@barkbay barkbay merged commit 774c58b into elastic:master Dec 4, 2019
@barkbay barkbay deleted the fix-readiness-script branch December 4, 2019 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants