Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce Elastic stack upgrade order #2600

Closed
pebrc opened this issue Feb 24, 2020 · 3 comments · Fixed by #3537
Closed

Enforce Elastic stack upgrade order #2600

pebrc opened this issue Feb 24, 2020 · 3 comments · Fixed by #3537
Assignees
Labels
discuss We need to figure this out >enhancement Enhancement of existing functionality

Comments

@pebrc
Copy link
Collaborator

pebrc commented Feb 24, 2020

Supersedes #2426 and #2353 (see there for more context and discussion)

Currently each stack application is reconciled independently of each other. When a user upgrades the version of a group of linked resources (Kibana, ES, APMServer) this is violating our own documented stack upgrade procedure which clearly defines an upgrade order. The problems described in #2426 and #2353 would be avoided if the upgrade order would be enforced by ECK:

  1. upgrade Elasticsearch
  2. upgrade Kibana
  3. upgrade APMServer

For that the Kibana and APM controllers would have to

  • inspect the linked Elasticsearch resource, and compare the specified version with their own versions
  • find a way of determining whether or not the version/generation specified is also the version/generation currently rolled out
  • delay their own rollout of the version upgrade until ES and all predecessors in the upgrade graph have completed their upgrade

Problems/Questions:

  • how do we express the upgrade order? APMServer does not hold a reference to the related Kibana instance. How does it know the state of any related Kibana node?
  • how do we know that a given spec is rolled out without looking at individual pods. Can we rely on an extended status subresource that somehow expresses the reconciliation state?
@pebrc pebrc added >bug Something isn't working discuss We need to figure this out labels Feb 24, 2020
@barkbay barkbay self-assigned this Mar 26, 2020
@barkbay barkbay removed their assignment Apr 11, 2020
@pebrc pebrc assigned pebrc and anyasabo and unassigned pebrc Apr 27, 2020
@anyasabo anyasabo removed their assignment Jun 9, 2020
@sebgl sebgl self-assigned this Jul 8, 2020
@sebgl
Copy link
Contributor

sebgl commented Jul 8, 2020

One thing that was mentioned on #2353 (comment) is to reject (as opposed to delay) the version upgrade at the validation webhook level. I think we should not go with that approach since it makes the user experience a bit painful. As a user I would expect I can update both Elasticsearch and Kibana version at the same time in the YAML manifest, then let ECK deal with it.


A while ago we tried to design the association controllers in such a way that they may run with different managed namespaces contexts. For example the Kibana controller may not have access to the Elasticsearch namespace, so it should not deal with Elasticsearch resources at all. I'm not sure where we stand now regarding this "constraint", but I think we should aim at keeping the various controllers responsibilities decoupled so that, for example:

  • the Kibana-ES association controller does not have to deal with Elasticsearch Pods
  • the APM controller does not have to inspect a Kibana resource (nor Kibana Pods)

Here is what I have in mind, I think it's pretty close to ideas @pebrc expressed in the first post:

Currently running version reflected in the status

Add an additional version field in the status all CRDs, set by each resource controller. It represents the lowest currently running version. For example, it would still indicate 7.7.0 if at least one ES node is running 7.7.0, even though other nodes are already running 7.8.0. This can be done by simply looking at the current StatefulSet|Deployment and Pods at the end of the Elasticsearch (Kibana, etc.) reconciliation. I think it can be useful beyond the scope of this PR, just to know what's the state of the current version upgrade.

{
  "availableNodes": 1,
  "version": "7.7.0",
  "health": "green",
  "phase": "ApplyingChanges"
}

Annotation set by the association controller in associated resources

In each association controller (eg. the Kibana-Elasticsearch association controller), annotate each resource where the association is specified (eg. Kibana) with the version of the resource it is associated to (eg. Elasticsearch), retrieved from its status. For example, if an Elasticsearch resource status reports version: 7.7.0, the association controller sets the following annotation in the Kibana resource:

kibanaassociation.k8s.elastic.co/elasticsearch-version: 7.7.0

This would be done generically by the association controller for any association:

The APM-Kibana association controller would set the following annotation on the APM resource:

apmassociation.k8s.elastic.co/kibana-version: 7.7.0

The Beat-ES association controller would set the following annotation on the Beat resource:

beatassociation.k8s.elastic.co/elasticsearch-version: 7.7.0

The Beat-Kibana association controller would set the following annotation on the beat resource:

beatassociation.k8s.elastic.co/kibana-version: 7.7.0

The association controller also ensures the annotation is not set if there is no association specified.

Delaying resources version upgrade

As part of each resource reconciliation (eg. before updating a Deployment in the Kibana controller), inspect the annotations that make sense (eg. kibanaassociation.k8s.elastic.co/elasticsearch-version):

  • if not set, and no Elasticsearch is referenced in the Kibana resource, the controller can move on with any version upgrade
  • if not set, but an Elasticsearch is referenced in the Kibana resource, the association controller will likely set an annotation soon: requeue
  • if version expressed in the annotation is lower than the expected resource version, do not perform any action that would lead to creating an instance of the resource in the newer version. Update the resource status to phase: DelayingChanges (as opposed to: ApplyingChanges).
{
  "availableNodes": 1,
  "version": "7.7.0",
  "health": "green",
  "phase": "DelayingChanges"
}
  • if version expressed in the annotation is higher or equal to the expected resource version: move on with any change

This mechanism does not enforce a global stack upgrade order, but rather an implicit dependency graph between associated resources:

  • Kibana can only be upgraded after Elasticsearch
  • APM can only be upgraded after Elasticsearch, and after Kibana
  • Beats can only be upgraded after Elasticsearch, and after Kibana
  • Enterprise Search can only be upgraded after Elasticsearch

@kfox1111
Copy link

kfox1111 commented Jul 8, 2020

+1 to delaying rather then blocking. But, if delay is going to be significantly harder to implement then just blocking for now, maybe implement blocking first, then remove the blocking once delaying is ready? I didn't expect the breakage I saw when I updated both at once and kibana upgraded significantly faster then elastticsearch, causing an issue.

sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Jul 20, 2020
Report the currently running version in the `status.version` field of
each resource. In case a version upgrade is in progress for that
resource, report the lowest running version.

For Elasticsearch, we check both the Pods and the StatefulSets to report
the lowest version. A StatefulSet might report a higher version whereas
the Pods haven't been upgraded yet. It may also report a lower version
and no Pods are running any more with that version.
For other resources, we just check the Pods.

Parts of the Status is refactored into a common `DeploymentStatus` for
Kibana, ApmServer and EnterpriseSearch. Elasticsearch and Beat have
different requirements which cannot fit this common place.

This is a pre-req for strict stack version upgrade ordering
(elastic#2600).
sebgl added a commit that referenced this issue Jul 22, 2020
* Report the lowest running version in the status of each resource

Report the currently running version in the `status.version` field of
each resource. In case a version upgrade is in progress for that
resource, report the lowest running version.

For Elasticsearch, we check both the Pods and the StatefulSets to report
the lowest version. A StatefulSet might report a higher version whereas
the Pods haven't been upgraded yet. It may also report a lower version
and no Pods are running any more with that version.
For other resources, we just check the Pods.

Parts of the Status is refactored into a common `DeploymentStatus` for
Kibana, ApmServer and EnterpriseSearch. Elasticsearch and Beat have
different requirements which cannot fit this common place.

This is a pre-req for strict stack version upgrade ordering
(#2600).
@sebgl sebgl added >enhancement Enhancement of existing functionality and removed >bug Something isn't working labels Jul 22, 2020
@sebgl
Copy link
Contributor

sebgl commented Jul 27, 2020

Differences between the design proposal above and the actual implementation:

  • we reuse the existing associationConf annotations instead of introducing a new set of annotations
  • phase: DelayingChanges reported in the status could be implemented in a follow-up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss We need to figure this out >enhancement Enhancement of existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants