Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for MaxSurge and MaxUnavailable during scaling #1812

Merged
merged 14 commits into from
Oct 2, 2019

Conversation

david-kow
Copy link
Contributor

@david-kow david-kow commented Sep 27, 2019

This change will cause operator to prevent creating or removing nodes that would cause the pod counts to violate MaxSurge/MaxUnavailable settings in the change budget.

Some notes:

  • MaxSurge (ms) is only checked during scaling up,
  • MaxUnavailable (mu) is only checked during scaling down,
  • allowed counts are calculated based on desired, not current, state,
  • ms defaults to math.MaxInt32, mu defaults to 1 (to align with default during rolling upgrade),
  • only ready pods are counted towards available podes for maxUnavailable calculation,
  • all existing pods are counted towards surge.

Fixes #1292.

@david-kow david-kow requested review from sebgl and barkbay September 27, 2019 09:36
@david-kow
Copy link
Contributor Author

jenkins test this please

@david-kow david-kow force-pushed the max_surge_and_max_unavailability branch from 370f055 to 38a1765 Compare September 27, 2019 13:43
Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I left a few comments.

Some thoughts about the changeBudget schema:

  • Right now, it's a pointer: if unspecified (nil), we use our defaults.
  • Default values hardcoded in several places (see my comment): maxUnavailable=1, maxSurge=math.maxInt32
  • max.MaxInt32 seems pretty hard to write in the spec, and would look weird in the docs: "the default value is 2 147 483 647"
  • I'm 👍 with having a default unbounded value for MaxSurge so far. Users can lower it to something acceptable if they want to, as long as they also understand, thanks to the docs, that this comes with some limitations (ie. the reconciliation can be stuck in some scenarios)
  • In the long-term, I can imagine we improve the logic to, optionally, automatically raise maxSurge to allow stuck reconciliations to move on. Something like changeBudget.allowMaxSurgeBreak. As a user, what I would like is probably something like "please perform that mutation using the smallest possible count of extra pods". But I don't really care what that number is, as long as it's the smallest possible one. And it's OK for ECK to make that vary if required. A feature to be kept for later though.

Because of the reasons above, I'm wondering if we should change the schema to:

changeBudget: # <-- not a pointer
    maxUnavailable: 1  # <---- an int32 pointer, optional, defaults to 1 if nil
    maxSurge: 1 # <---- an int32 pointer, optional, defaults to -1 if nil

Then, I would maybe represent the unbounded value of maxSurge as -1 (instead of 2 147 483 647). We may still internally use math.MaxInt32 in the code if that helps. Or not: maybe checking for -1 in the correct upscaleState function is also pretty simple and cleaner to represent "unbounded" (always accept replica increase).

pkg/controller/elasticsearch/driver/upscale.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/driver/upscale_state_test.go Outdated Show resolved Hide resolved
test/e2e/test/elasticsearch/checks_es.go Outdated Show resolved Hide resolved
test/e2e/test/elasticsearch/checks_k8s.go Outdated Show resolved Hide resolved
test/e2e/test/elasticsearch/checks_limiting.go Outdated Show resolved Hide resolved
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to report a part of the PR description as some comments in the code, I think it would help to understand the algorithms.

I have some concerns about the way the nodes are excluded.
Let's take this example:

spec:                                       spec:
  version: 7.2.0                               version: 7.2.0
  updateStrategy:                              updateStrategy:
    changeBudget:                                changeBudget:
      maxUnavailable: 1                            maxUnavailable: 1
      maxSurge: 1                                  maxSurge: 1
  nodes:                                        nodes:
  - name: masters                   =>          - name: masters 
    config:                                       config:
        node.data: false                            node.data: false
  - name: nodes                                 - name: nodes-2
    config:                                       config:
      node.master: false                            node.master: false
    nodeCount: 5                                  nodeCount: 3

It might not be obvious but in this situation both createdAllowed and expectedReplicas are set to 0: all the data nodes are excluded, none can be created. Even if removalsAllowed is only set to 3 no node can be removed because shards can't be moved.
It's impossible to progress and since all nodes are excluded it is also impossible to create new indices. It might take some time for the user to figure out that it had shoot itself in the foot.

A solution would be to slightly change the way leavingNodes is computed:

  1. Compute leaving nodes by immediately taking into account the budget, including the constraints on the masters
  2. Iterate though the leaving nodes to check which nodes still holds some shards

Playing around with the code it could be something like that: https://gist.github.com/barkbay/bddb531a94088e68049cee7bbcc8c2bc, no node is unnecessarily excluded, it allows for the first 3 nodes to be removed following up by the creation of the new nodes.

Edit: as it has been pointed out by @pebrc there must be some data in the nodes to reproduce this situation

@pebrc
Copy link
Collaborator

pebrc commented Sep 30, 2019

It might not be obvious but in this situation both createdAllowed and expectedReplicas are set to 0: all the data nodes are excluded, none can be created. Even if removalsAllowed is only set to 3 no node can be removed because shards can't be moved.

@barkbay maybe I am missing something in your example, but I was not able to reproduce the livelock situation you mentioned with the example you gave.

@barkbay
Copy link
Contributor

barkbay commented Sep 30, 2019

@barkbay maybe I am missing something in your example, but I was not able to reproduce the livelock situation you mentioned with the example you gave.

I have copy/paste an example here: https://gist.github.com/barkbay/d1ee547d4f79bd9e76c5718ba269778f

You should end up in the following situation:

statefulset.apps/elasticsearch-sample-es-masters   1/1     4m42s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0
statefulset.apps/elasticsearch-sample-es-nodes     5/5     4m41s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0
statefulset.apps/elasticsearch-sample-es-nodes-2   0/0     115s    elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0

GET /cluster/settings:

{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_name" : "elasticsearch-sample-es-nodes-4,elasticsearch-sample-es-nodes-3,elasticsearch-sample-es-nodes-2,elasticsearch-sample-es-nodes-1,elasticsearch-sample-es-nodes-0"
          }
        }
      }
    }
  }
}

@pebrc
Copy link
Collaborator

pebrc commented Sep 30, 2019

You should end up in the following situation:

I helps if one uses a cluster with actual data in it ...

@barkbay
Copy link
Contributor

barkbay commented Sep 30, 2019

I helps if one uses a cluster with actual data in it ...

🤦‍♂ my bad and sorry, forgot to mention that there should be some data in the nodes

if ctx.es.Spec.UpdateStrategy.ChangeBudget != nil {
createsAllowed = int32(ctx.es.Spec.UpdateStrategy.ChangeBudget.MaxSurge)
createsAllowed += expectedResources.StatefulSets().ExpectedNodeCount()
createsAllowed -= actualStatefulSets.ExpectedNodeCount()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. Should this not be the max(actualStatefulSets.ExpectedNodeCount(), expectedResources.StatefulSets().ExpectedNodeCount()) + maxSurge How could we otherwise make progress, if we are going to replace a stateful set?

Copy link
Collaborator

@pebrc pebrc Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂ my bad and sorry, forgot to mention that there should be some data in the nodes

@barkbay no that's on me being a bit slow in the uptake :-) Am I right assuming that your suggestion of excluding nodes step by step as we remove them (which is of course correct) could be complemented by actually creating nodes at the same time. I am not sure if my interpretation of the surge budget of 1 is OK which is to say we can have one additional node in addition to what ever number of nodes exist currently in the cluster.

Otherwise we would just needlessly migrate data to nodes that are about to be removed in the next round. With non-trivial amounts of data there would also be no guarantee that the remaining old nodes could host the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. Should this not be the max(actualStatefulSets.ExpectedNodeCount(), expectedResources.StatefulSets().ExpectedNodeCount()) + maxSurge How could we otherwise make progress, if we are going to replace a stateful set?

I'd think that user applying a spec downscaling from 150 to 100 nodes and 20 surge expects to see at most 120 nodes instead of 170. User can make progress when surge is adjusted so it allows creating nodes despite actual state. Also, in any case, we can't be sure to make progress.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to accept that there is no optimal solution that covers all cases. The question is how contrived the rename + scale down case is. Also I am bit worried that we don't surface the state of being stuck in this way visible enough to the user. A log message might not be enough

David Kowalski added 11 commits October 2, 2019 07:19
In spec, maxUnavailable and maxSurge can have three types of values:
- nil - uses the default value,
- negative - means value is unbounded,
- non-negative - means value is used directly.

In code, nil means unbounded, non-negative is to be used directly and
negative is not expected.

Adjusted tests.
There seems to be an ongoing issue with crd generation, so fixing CRD manually for now
@david-kow david-kow force-pushed the max_surge_and_max_unavailability branch from 4c1ae66 to fa47a4a Compare October 2, 2019 08:30
@david-kow david-kow force-pushed the max_surge_and_max_unavailability branch from 8a42450 to b185dd0 Compare October 2, 2019 11:30
@david-kow david-kow requested a review from sebgl October 2, 2019 14:22
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Contributor

@sebgl sebgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@david-kow david-kow merged commit 0f5a59f into elastic:master Oct 2, 2019
@david-kow david-kow deleted the max_surge_and_max_unavailability branch October 2, 2019 15:20
@pebrc pebrc added >enhancement Enhancement of existing functionality v1.0.0-beta1 labels Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality v1.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Account for maxSurge during StatefulSet replacements
4 participants