Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kibana to terminate all Pods before restarting during version change #2137

Merged
merged 7 commits into from
Nov 27, 2019

Conversation

david-kow
Copy link
Contributor

@david-kow david-kow commented Nov 20, 2019

Description

This PR fixes #2049.
During version upgrade, all Kibana instances with old version have to be stopped before starting any instance with the new version. To do that, the operator will supply Recreate strategy to Kibana deployment when it detects that version upgrade is in progress. To avoid Kibana downtime for all other spec changes, RollingUpdate strategy will be used when version doesn't change.

Detection

Operator needs to be consistent throughout the entire upgrade process even if reconciliation loop is ran multiple times. To do that, operator looks at current spec and all the versions in existing Kibana pods. If the version is the same everywhere, the RollingUpdate strategy is used, otherwise Recreate strategy is used.

Pod version

New label is introduced on Kibana pods. It contains the Kibana version that the pod is running. The need for a new label comes from the fact that there is no other source of truth available:

  • Kibana spec has a version, but Pods will outlive any particular version of that spec, so it's not possible to trace back if the spec was updated
  • Deployment has the same "outliving" issue as Kibana spec
  • Both ReplicaSet and Pod currently have config checksum and container image. The former changes on every config change and as we care only about version changes it's not useful. The latter can be user provided and might not indicate the version in the name.

Given the above, kibana.k8s.elastic.co/version label was introduced. Both ReplicaSets and Pods have it, but it's being read from Pods.

Upgrade case

When the operator version containing new label will be deployed in k8s cluster with already existing Kibana deployment, the version label will not be present. This means that the operator can't reason about the deployment strategy to use. Given that, we have two options for default behavior:

  • use RollingUpdate strategy - better experience for users, but slight risk of data corruption/loss (when Kibana spec is updated while operator is not running),
  • use Recreate strategy - Kibana will have downtime on operator upgrade, but there is no risk of data corruption/loss.

It seems to me we should aim for safety first hence I went with Recreate strategy by default.

Stale cache considerations

If our client cache is stale, the operator might not see the actual state, but outdated one. I think there is just one case where we won't get the expected behavior.

Actual version (version in cache):

spec  | pods
A (A) | A (A), A (A), A (A)
-> version upgrade from A to B
B (A) | A (A), A (A), A (A)
B (B) | A (A), A (A), A (A)
B (B) | B (A), B (A), B (A)
B (B) | B (B), B (B), B (A)
-> non-version config change
  • spec version gets updated from A to B
  • all pods get updated to B, but cache still contains a stale entry with Pod at version A
  • another change to spec comes in. It's not related to version, but as we read the stale cache we consider that version upgrade is still in progress and we use Recreate strategy
  • Kibana has downtime when it wasn't necessary

To avoid the issue above we inspect the number of replicasets for given deployment. If there is more then one and they have multiple versions we requeue and wait until it stabilizes, ie. all pods are owned by a single replicaset.

EDIT: We decided that this potential additional downtime is acceptable, as has low probability of happening and we can get stuck with the proposed solution.

@david-kow david-kow added the >bug Something isn't working label Nov 20, 2019
@david-kow
Copy link
Contributor Author

I've realized that the approach to avoid stale cache issue causes a problem in the following sequence of events:

  1. good Kibana spec running
  2. spec gets updated to bad (bootlooping) config without changing version (v1)
  3. spec get updated to good config with version change (v2)

The state after those steps is:

  • spec at v2
  • good replicaset1 at v1
  • bad replicaset2 at v1

We'd have multiple versions and multiple replicasets, so the operator will keep on requeuing, but it will do so indefinitely because bad replicaset2 won't ever finish its rollout. In such a case user will need to first revert back to good config at v1 and then update to good config at v2.

There is a trade-off between the above and potential unnecessary downtime when we hit stale cache at the right time. Thoughts?

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Regarding the question around stale caches, unnecessary downtime vs stuck upgrades. Maybe you could somehow surface this condition in an event+log entry? Otherwise it will be very hard to figure out how to make progress in such a situation.

Another option would be to validate version upgrades by checking if there are any bootlooping pods in the Kibana and rejecting the upgrade if there are any, but that of course is also susceptible to cache staleness itself and might not catch all cases. It also makes the validation webhook quite heavy weight.

pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replicaset2 won't ever finish its rollout

It may eventually finish because the config in the secret is updated independently of the Deployment

That said imho it is a little bit odd from a user pov to wait for such a "timeout" or to not be able to upgrade and update the config at the same time.

My feeling is that we are trying to make Kibana "rolling upgrade friendly" while inherently it is not the case. It is an optimization, having some unavailability in some edge cases should be fine.

pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved
pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved
@david-kow
Copy link
Contributor Author

david-kow commented Nov 21, 2019

@pebrc

Maybe you could somehow surface this condition in an event+log entry?

I think it's difficult - you could do it every time, but then some of the logs would be false positive (ie. we detected it being stuck, but it's not). Or we could track for how long this is happening, but this is also susceptible to errors, needs state, etc.

Another option would be to validate version upgrades by checking if there are any bootlooping pods in the Kibana and rejecting the upgrade if there are any

I think I agree this is hard. Especially for cases where we have config updates coming quickly, there might not be enough time to validate.

Given the impact/time needed for this work I think I'd do what @barkbay is proposing, ie. allow Kibana downtime in some cases when we have stale cache. WDYT?

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-kow sure. Let's iterate on that and see how often this actually becomes a problem in practice. But maybe it's worth exploring an e2e test that executes such a scenario?

@david-kow
Copy link
Contributor Author

@pebrc Will do. As to the e2e tests, I don't think it's easily reproducible as it depends on cache timing.

@pebrc
Copy link
Collaborator

pebrc commented Nov 21, 2019

As to the e2e tests, I don't think it's easily reproducible as it depends on cache timing.

@david-kow true, but I guess we could have an e2e test that exercises regular updates and version upgrades in succession, which might or might not run into the suspect cache staleness, but is a good test to have anyway.

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

I agree with the others here: I think it is safe for us to ignore the corner case of users applying 2 consecutive mutations of Kibana, one being a version upgrade. In such case, a downtime is caused and expected by the version upgrade, and may overlap with other configuration changes made around the same time. I don't think we should optimize more than that; for the sake of code simplicity.

Can we add an E2E test that does a Kibana version upgrade? We have some E2E tests covering ES versions upgrade already. It would be nice for the test to:

  • create Kibana with 3 replicas in version A
  • upgrade it to 3 replicas in version B
  • continuously check during the mutation that we never have 2 different versions of Kibana running at the same time, using the non-cached client
  • ensure the deployment strategy is correctly set to RollingUpgrade at the end

pkg/controller/common/deployment/reconcile.go Outdated Show resolved Hide resolved
pkg/controller/kibana/driver.go Outdated Show resolved Hide resolved
@david-kow david-kow requested review from pebrc, anyasabo, sebgl and barkbay and removed request for anyasabo November 26, 2019 09:18
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few details to improve. Happy to discuss them. We're getting there! 💪

test/e2e/test/kibana/steps_mutation.go Show resolved Hide resolved
test/e2e/test/kibana/steps_mutation.go Outdated Show resolved Hide resolved
test/e2e/test/watcher.go Show resolved Hide resolved
test/e2e/test/kibana/builder.go Show resolved Hide resolved
test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved
test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved
test/e2e/kb/version_upgrade_test.go Show resolved Hide resolved
test/e2e/kb/version_upgrade_test.go Show resolved Hide resolved
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

test/e2e/kb/version_upgrade_test.go Show resolved Hide resolved
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!
I left a few small nit-picks.

test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved
test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved
test/e2e/kb/version_upgrade_test.go Outdated Show resolved Hide resolved
test/e2e/test/builder.go Outdated Show resolved Hide resolved
@david-kow david-kow merged commit f440db7 into elastic:master Nov 27, 2019
@david-kow david-kow deleted the kibana_upgrades branch November 27, 2019 21:02
@david-kow
Copy link
Contributor Author

Updated the description to reflect changes we agreed on in the comments.

@thbkrkr thbkrkr changed the title Fix Kibana to terminate all pods before restarting during version change Fix Kibana to terminate all Pods before restarting during version change Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kibana does not support rolling upgrades
5 participants