-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kibana to terminate all Pods before restarting during version change #2137
Conversation
I've realized that the approach to avoid stale cache issue causes a problem in the following sequence of events:
The state after those steps is:
We'd have multiple versions and multiple replicasets, so the operator will keep on requeuing, but it will do so indefinitely because bad replicaset2 won't ever finish its rollout. In such a case user will need to first revert back to good config at v1 and then update to good config at v2. There is a trade-off between the above and potential unnecessary downtime when we hit stale cache at the right time. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Regarding the question around stale caches, unnecessary downtime vs stuck upgrades. Maybe you could somehow surface this condition in an event+log entry? Otherwise it will be very hard to figure out how to make progress in such a situation.
Another option would be to validate version upgrades by checking if there are any bootlooping pods in the Kibana and rejecting the upgrade if there are any, but that of course is also susceptible to cache staleness itself and might not catch all cases. It also makes the validation webhook quite heavy weight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replicaset2 won't ever finish its rollout
It may eventually finish because the config in the secret is updated independently of the Deployment
That said imho it is a little bit odd from a user pov to wait for such a "timeout" or to not be able to upgrade and update the config at the same time.
My feeling is that we are trying to make Kibana "rolling upgrade friendly" while inherently it is not the case. It is an optimization, having some unavailability in some edge cases should be fine.
I think it's difficult - you could do it every time, but then some of the logs would be false positive (ie. we detected it being stuck, but it's not). Or we could track for how long this is happening, but this is also susceptible to errors, needs state, etc.
I think I agree this is hard. Especially for cases where we have config updates coming quickly, there might not be enough time to validate. Given the impact/time needed for this work I think I'd do what @barkbay is proposing, ie. allow Kibana downtime in some cases when we have stale cache. WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-kow sure. Let's iterate on that and see how often this actually becomes a problem in practice. But maybe it's worth exploring an e2e test that executes such a scenario?
instead of risk getting stuck
@pebrc Will do. As to the e2e tests, I don't think it's easily reproducible as it depends on cache timing. |
@david-kow true, but I guess we could have an e2e test that exercises regular updates and version upgrades in succession, which might or might not run into the suspect cache staleness, but is a good test to have anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
I agree with the others here: I think it is safe for us to ignore the corner case of users applying 2 consecutive mutations of Kibana, one being a version upgrade. In such case, a downtime is caused and expected by the version upgrade, and may overlap with other configuration changes made around the same time. I don't think we should optimize more than that; for the sake of code simplicity.
Can we add an E2E test that does a Kibana version upgrade? We have some E2E tests covering ES versions upgrade already. It would be nice for the test to:
- create Kibana with 3 replicas in version A
- upgrade it to 3 replicas in version B
- continuously check during the mutation that we never have 2 different versions of Kibana running at the same time, using the non-cached client
- ensure the deployment strategy is correctly set to RollingUpgrade at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few details to improve. Happy to discuss them. We're getting there! 💪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work!
I left a few small nit-picks.
Updated the description to reflect changes we agreed on in the comments. |
Description
This PR fixes #2049.
During version upgrade, all Kibana instances with old version have to be stopped before starting any instance with the new version. To do that, the operator will supply Recreate strategy to Kibana deployment when it detects that version upgrade is in progress. To avoid Kibana downtime for all other spec changes, RollingUpdate strategy will be used when version doesn't change.
Detection
Operator needs to be consistent throughout the entire upgrade process even if reconciliation loop is ran multiple times. To do that, operator looks at current spec and all the versions in existing Kibana pods. If the version is the same everywhere, the RollingUpdate strategy is used, otherwise Recreate strategy is used.
Pod version
New label is introduced on Kibana pods. It contains the Kibana version that the pod is running. The need for a new label comes from the fact that there is no other source of truth available:
Given the above,
kibana.k8s.elastic.co/version
label was introduced. Both ReplicaSets and Pods have it, but it's being read from Pods.Upgrade case
When the operator version containing new label will be deployed in k8s cluster with already existing Kibana deployment, the version label will not be present. This means that the operator can't reason about the deployment strategy to use. Given that, we have two options for default behavior:
It seems to me we should aim for safety first hence I went with Recreate strategy by default.
Stale cache considerations
If our client cache is stale, the operator might not see the actual state, but outdated one. I think there is just one case where we won't get the expected behavior.
Actual version (version in cache):
To avoid the issue above we inspect the number of replicasets for given deployment. If there is more then one and they have multiple versions we requeue and wait until it stabilizes, ie. all pods are owned by a single replicaset.EDIT: We decided that this potential additional downtime is acceptable, as has low probability of happening and we can get stuck with the proposed solution.