Fix Kibana to terminate all Pods before restarting during version change #2137

david-kow · 2019-11-20T08:57:21Z

Description

This PR fixes #2049.
During version upgrade, all Kibana instances with old version have to be stopped before starting any instance with the new version. To do that, the operator will supply Recreate strategy to Kibana deployment when it detects that version upgrade is in progress. To avoid Kibana downtime for all other spec changes, RollingUpdate strategy will be used when version doesn't change.

Detection

Operator needs to be consistent throughout the entire upgrade process even if reconciliation loop is ran multiple times. To do that, operator looks at current spec and all the versions in existing Kibana pods. If the version is the same everywhere, the RollingUpdate strategy is used, otherwise Recreate strategy is used.

Pod version

New label is introduced on Kibana pods. It contains the Kibana version that the pod is running. The need for a new label comes from the fact that there is no other source of truth available:

Kibana spec has a version, but Pods will outlive any particular version of that spec, so it's not possible to trace back if the spec was updated
Deployment has the same "outliving" issue as Kibana spec
Both ReplicaSet and Pod currently have config checksum and container image. The former changes on every config change and as we care only about version changes it's not useful. The latter can be user provided and might not indicate the version in the name.

Given the above, kibana.k8s.elastic.co/version label was introduced. Both ReplicaSets and Pods have it, but it's being read from Pods.

Upgrade case

When the operator version containing new label will be deployed in k8s cluster with already existing Kibana deployment, the version label will not be present. This means that the operator can't reason about the deployment strategy to use. Given that, we have two options for default behavior:

use RollingUpdate strategy - better experience for users, but slight risk of data corruption/loss (when Kibana spec is updated while operator is not running),
use Recreate strategy - Kibana will have downtime on operator upgrade, but there is no risk of data corruption/loss.

It seems to me we should aim for safety first hence I went with Recreate strategy by default.

Stale cache considerations

If our client cache is stale, the operator might not see the actual state, but outdated one. I think there is just one case where we won't get the expected behavior.

Actual version (version in cache):

spec  | pods
A (A) | A (A), A (A), A (A)
-> version upgrade from A to B
B (A) | A (A), A (A), A (A)
B (B) | A (A), A (A), A (A)
B (B) | B (A), B (A), B (A)
B (B) | B (B), B (B), B (A)
-> non-version config change

spec version gets updated from A to B
all pods get updated to B, but cache still contains a stale entry with Pod at version A
another change to spec comes in. It's not related to version, but as we read the stale cache we consider that version upgrade is still in progress and we use Recreate strategy
Kibana has downtime when it wasn't necessary

To avoid the issue above we inspect the number of replicasets for given deployment. If there is more then one and they have multiple versions we requeue and wait until it stabilizes, ie. all pods are owned by a single replicaset.

EDIT: We decided that this potential additional downtime is acceptable, as has low probability of happening and we can get stuck with the proposed solution.

david-kow · 2019-11-20T11:42:37Z

I've realized that the approach to avoid stale cache issue causes a problem in the following sequence of events:

good Kibana spec running
spec gets updated to bad (bootlooping) config without changing version (v1)
spec get updated to good config with version change (v2)

The state after those steps is:

spec at v2
good replicaset1 at v1
bad replicaset2 at v1

We'd have multiple versions and multiple replicasets, so the operator will keep on requeuing, but it will do so indefinitely because bad replicaset2 won't ever finish its rollout. In such a case user will need to first revert back to good config at v1 and then update to good config at v2.

There is a trade-off between the above and potential unnecessary downtime when we hit stale cache at the right time. Thoughts?

pebrc

LGTM. Regarding the question around stale caches, unnecessary downtime vs stuck upgrades. Maybe you could somehow surface this condition in an event+log entry? Otherwise it will be very hard to figure out how to make progress in such a situation.

Another option would be to validate version upgrades by checking if there are any bootlooping pods in the Kibana and rejecting the upgrade if there are any, but that of course is also susceptible to cache staleness itself and might not catch all cases. It also makes the validation webhook quite heavy weight.

pkg/controller/kibana/driver.go

barkbay

replicaset2 won't ever finish its rollout

It may eventually finish because the config in the secret is updated independently of the Deployment

That said imho it is a little bit odd from a user pov to wait for such a "timeout" or to not be able to upgrade and update the config at the same time.

My feeling is that we are trying to make Kibana "rolling upgrade friendly" while inherently it is not the case. It is an optimization, having some unavailability in some edge cases should be fine.

pkg/controller/kibana/driver.go

david-kow · 2019-11-21T12:04:26Z

@pebrc

Maybe you could somehow surface this condition in an event+log entry?

I think it's difficult - you could do it every time, but then some of the logs would be false positive (ie. we detected it being stuck, but it's not). Or we could track for how long this is happening, but this is also susceptible to errors, needs state, etc.

Another option would be to validate version upgrades by checking if there are any bootlooping pods in the Kibana and rejecting the upgrade if there are any

I think I agree this is hard. Especially for cases where we have config updates coming quickly, there might not be enough time to validate.

Given the impact/time needed for this work I think I'd do what @barkbay is proposing, ie. allow Kibana downtime in some cases when we have stale cache. WDYT?

pebrc

@david-kow sure. Let's iterate on that and see how often this actually becomes a problem in practice. But maybe it's worth exploring an e2e test that executes such a scenario?

instead of risk getting stuck

david-kow · 2019-11-21T14:01:50Z

@pebrc Will do. As to the e2e tests, I don't think it's easily reproducible as it depends on cache timing.

pebrc · 2019-11-21T14:54:51Z

As to the e2e tests, I don't think it's easily reproducible as it depends on cache timing.

@david-kow true, but I guess we could have an e2e test that exercises regular updates and version upgrades in succession, which might or might not run into the suspect cache staleness, but is a good test to have anyway.

sebgl

Overall LGTM.

I agree with the others here: I think it is safe for us to ignore the corner case of users applying 2 consecutive mutations of Kibana, one being a version upgrade. In such case, a downtime is caused and expected by the version upgrade, and may overlap with other configuration changes made around the same time. I don't think we should optimize more than that; for the sake of code simplicity.

Can we add an E2E test that does a Kibana version upgrade? We have some E2E tests covering ES versions upgrade already. It would be nice for the test to:

create Kibana with 3 replicas in version A
upgrade it to 3 replicas in version B
continuously check during the mutation that we never have 2 different versions of Kibana running at the same time, using the non-cached client
ensure the deployment strategy is correctly set to RollingUpgrade at the end

pkg/controller/common/deployment/reconcile.go

pkg/controller/kibana/driver.go

sebgl

I think there are a few details to improve. Happy to discuss them. We're getting there! 💪

test/e2e/test/kibana/steps_mutation.go

test/e2e/test/watcher.go

test/e2e/test/kibana/builder.go

test/e2e/kb/version_upgrade_test.go

pebrc

LGTM, nice work!

test/e2e/kb/version_upgrade_test.go

sebgl

LGTM, great work!
I left a few small nit-picks.

test/e2e/kb/version_upgrade_test.go

test/e2e/test/builder.go

david-kow · 2019-12-06T14:18:55Z

Updated the description to reflect changes we agreed on in the comments.

Fix Kibana to terminate all pods before restarting during version change

33ce53e

david-kow added the >bug Something isn't working label Nov 20, 2019

Specify RequeueAfter, so the result gets propagated back

d1a4931

pebrc reviewed Nov 20, 2019

View reviewed changes

pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved

barkbay reviewed Nov 20, 2019

View reviewed changes

pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved

pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved

pebrc reviewed Nov 21, 2019

View reviewed changes

Allow for Kibana downtime in rare stale cache cases

1725e3e

instead of risk getting stuck

sebgl reviewed Nov 22, 2019

View reviewed changes

pkg/controller/common/deployment/reconcile.go Outdated Show resolved Hide resolved

pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved

David Kowalski added 2 commits November 26, 2019 08:10

Add e2e test, e2e facilities, docs

2d9b4a6

Add e2e test with 2 consecutive upgrades

c5662ed

david-kow requested review from pebrc, anyasabo, sebgl and barkbay and removed request for anyasabo November 26, 2019 09:18

sebgl reviewed Nov 26, 2019

View reviewed changes

PR fixes, refactorings and checks improvements

45ad967

pebrc approved these changes Nov 27, 2019

View reviewed changes

test/e2e/kb/version_upgrade_test.go Show resolved Hide resolved

sebgl approved these changes Nov 27, 2019

View reviewed changes

test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved

test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved

test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved

test/e2e/test/builder.go Outdated Show resolved Hide resolved

PR fixes, comments

ea7e035

david-kow merged commit f440db7 into elastic:master Nov 27, 2019

david-kow deleted the kibana_upgrades branch November 27, 2019 21:02

david-kow added the v1.0.0 label Dec 18, 2019

sebgl mentioned this pull request Jan 6, 2020

Kibana fails to start after upgrading from 7.2.1 to 7.5.0 #2352

Closed

thbkrkr changed the title ~~Fix Kibana to terminate all pods before restarting during version change~~ Fix Kibana to terminate all Pods before restarting during version change Jan 9, 2020

david-kow mentioned this pull request Jan 14, 2020

Add mention about Kibana pods restarting during ECK upgrade #2422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Kibana to terminate all Pods before restarting during version change #2137

Fix Kibana to terminate all Pods before restarting during version change #2137

david-kow commented Nov 20, 2019 •

edited

Loading

david-kow commented Nov 20, 2019

pebrc left a comment •

edited

Loading

barkbay left a comment

david-kow commented Nov 21, 2019 •

edited

Loading

pebrc left a comment

david-kow commented Nov 21, 2019

pebrc commented Nov 21, 2019

sebgl left a comment

sebgl left a comment

pebrc left a comment

sebgl left a comment

david-kow commented Dec 6, 2019

Fix Kibana to terminate all Pods before restarting during version change #2137

Fix Kibana to terminate all Pods before restarting during version change #2137

Conversation

david-kow commented Nov 20, 2019 • edited Loading

Description

Detection

Pod version

Upgrade case

Stale cache considerations

david-kow commented Nov 20, 2019

pebrc left a comment • edited Loading

Choose a reason for hiding this comment

barkbay left a comment

Choose a reason for hiding this comment

david-kow commented Nov 21, 2019 • edited Loading

pebrc left a comment

Choose a reason for hiding this comment

david-kow commented Nov 21, 2019

pebrc commented Nov 21, 2019

sebgl left a comment

Choose a reason for hiding this comment

sebgl left a comment

Choose a reason for hiding this comment

pebrc left a comment

Choose a reason for hiding this comment

sebgl left a comment

Choose a reason for hiding this comment

david-kow commented Dec 6, 2019

david-kow commented Nov 20, 2019 •

edited

Loading

pebrc left a comment •

edited

Loading

david-kow commented Nov 21, 2019 •

edited

Loading