Add support for MaxSurge and MaxUnavailable during scaling #1812

david-kow · 2019-09-27T09:30:49Z

This change will cause operator to prevent creating or removing nodes that would cause the pod counts to violate MaxSurge/MaxUnavailable settings in the change budget.

Some notes:

MaxSurge (ms) is only checked during scaling up,
MaxUnavailable (mu) is only checked during scaling down,
allowed counts are calculated based on desired, not current, state,
ms defaults to math.MaxInt32, mu defaults to 1 (to align with default during rolling upgrade),
only ready pods are counted towards available podes for maxUnavailable calculation,
all existing pods are counted towards surge.

Fixes #1292.

david-kow · 2019-09-27T12:03:22Z

jenkins test this please

sebgl

Overall looks good. I left a few comments.

Some thoughts about the changeBudget schema:

Right now, it's a pointer: if unspecified (nil), we use our defaults.
Default values hardcoded in several places (see my comment): maxUnavailable=1, maxSurge=math.maxInt32
max.MaxInt32 seems pretty hard to write in the spec, and would look weird in the docs: "the default value is 2 147 483 647"
I'm 👍 with having a default unbounded value for MaxSurge so far. Users can lower it to something acceptable if they want to, as long as they also understand, thanks to the docs, that this comes with some limitations (ie. the reconciliation can be stuck in some scenarios)
In the long-term, I can imagine we improve the logic to, optionally, automatically raise maxSurge to allow stuck reconciliations to move on. Something like changeBudget.allowMaxSurgeBreak. As a user, what I would like is probably something like "please perform that mutation using the smallest possible count of extra pods". But I don't really care what that number is, as long as it's the smallest possible one. And it's OK for ECK to make that vary if required. A feature to be kept for later though.

Because of the reasons above, I'm wondering if we should change the schema to:

changeBudget: # <-- not a pointer
    maxUnavailable: 1  # <---- an int32 pointer, optional, defaults to 1 if nil
    maxSurge: 1 # <---- an int32 pointer, optional, defaults to -1 if nil

Then, I would maybe represent the unbounded value of maxSurge as -1 (instead of 2 147 483 647). We may still internally use math.MaxInt32 in the code if that helps. Or not: maybe checking for -1 in the correct upscaleState function is also pretty simple and cleaner to represent "unbounded" (always accept replica increase).

pkg/controller/elasticsearch/driver/downscale_invariants.go

pkg/controller/elasticsearch/driver/downscale_invariants_test.go

pkg/controller/elasticsearch/driver/upscale.go

pkg/controller/elasticsearch/driver/upscale_state_test.go

test/e2e/test/elasticsearch/checks_es.go

test/e2e/test/elasticsearch/checks_k8s.go

test/e2e/test/elasticsearch/checks_limiting.go

barkbay

Would be great to report a part of the PR description as some comments in the code, I think it would help to understand the algorithms.

I have some concerns about the way the nodes are excluded.
Let's take this example:

spec:                                       spec:
  version: 7.2.0                               version: 7.2.0
  updateStrategy:                              updateStrategy:
    changeBudget:                                changeBudget:
      maxUnavailable: 1                            maxUnavailable: 1
      maxSurge: 1                                  maxSurge: 1
  nodes:                                        nodes:
  - name: masters                   =>          - name: masters 
    config:                                       config:
        node.data: false                            node.data: false
  - name: nodes                                 - name: nodes-2
    config:                                       config:
      node.master: false                            node.master: false
    nodeCount: 5                                  nodeCount: 3

It might not be obvious but in this situation both createdAllowed and expectedReplicas are set to 0: all the data nodes are excluded, none can be created. Even if removalsAllowed is only set to 3 no node can be removed because shards can't be moved.
It's impossible to progress and since all nodes are excluded it is also impossible to create new indices. It might take some time for the user to figure out that it had shoot itself in the foot.

A solution would be to slightly change the way leavingNodes is computed:

Compute leaving nodes by immediately taking into account the budget, including the constraints on the masters
Iterate though the leaving nodes to check which nodes still holds some shards

Playing around with the code it could be something like that: https://gist.github.com/barkbay/bddb531a94088e68049cee7bbcc8c2bc, no node is unnecessarily excluded, it allows for the first 3 nodes to be removed following up by the creation of the new nodes.

Edit: as it has been pointed out by @pebrc there must be some data in the nodes to reproduce this situation

pkg/controller/elasticsearch/driver/upscale_state.go

pkg/controller/elasticsearch/driver/downscale_invariants_test.go

pkg/controller/elasticsearch/driver/downscale_invariants.go

pebrc · 2019-09-30T11:42:03Z

It might not be obvious but in this situation both createdAllowed and expectedReplicas are set to 0: all the data nodes are excluded, none can be created. Even if removalsAllowed is only set to 3 no node can be removed because shards can't be moved.

@barkbay maybe I am missing something in your example, but I was not able to reproduce the livelock situation you mentioned with the example you gave.

barkbay · 2019-09-30T11:57:19Z

@barkbay maybe I am missing something in your example, but I was not able to reproduce the livelock situation you mentioned with the example you gave.

I have copy/paste an example here: https://gist.github.com/barkbay/d1ee547d4f79bd9e76c5718ba269778f

You should end up in the following situation:

statefulset.apps/elasticsearch-sample-es-masters   1/1     4m42s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0
statefulset.apps/elasticsearch-sample-es-nodes     5/5     4m41s   elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0
statefulset.apps/elasticsearch-sample-es-nodes-2   0/0     115s    elasticsearch   docker.elastic.co/elasticsearch/elasticsearch:7.2.0

GET /cluster/settings:

{
  "persistent" : { },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "exclude" : {
            "_name" : "elasticsearch-sample-es-nodes-4,elasticsearch-sample-es-nodes-3,elasticsearch-sample-es-nodes-2,elasticsearch-sample-es-nodes-1,elasticsearch-sample-es-nodes-0"
          }
        }
      }
    }
  }
}

pebrc · 2019-09-30T12:55:22Z

You should end up in the following situation:

I helps if one uses a cluster with actual data in it ...

barkbay · 2019-09-30T13:01:43Z

I helps if one uses a cluster with actual data in it ...

🤦‍♂ my bad and sorry, forgot to mention that there should be some data in the nodes

pebrc · 2019-09-30T13:26:12Z

pkg/controller/elasticsearch/driver/upscale_state.go

+	if ctx.es.Spec.UpdateStrategy.ChangeBudget != nil {
+		createsAllowed = int32(ctx.es.Spec.UpdateStrategy.ChangeBudget.MaxSurge)
+		createsAllowed += expectedResources.StatefulSets().ExpectedNodeCount()
+		createsAllowed -= actualStatefulSets.ExpectedNodeCount()


I am a bit confused. Should this not be the max(actualStatefulSets.ExpectedNodeCount(), expectedResources.StatefulSets().ExpectedNodeCount()) + maxSurge How could we otherwise make progress, if we are going to replace a stateful set?

🤦‍♂ my bad and sorry, forgot to mention that there should be some data in the nodes

@barkbay no that's on me being a bit slow in the uptake :-) Am I right assuming that your suggestion of excluding nodes step by step as we remove them (which is of course correct) could be complemented by actually creating nodes at the same time. I am not sure if my interpretation of the surge budget of 1 is OK which is to say we can have one additional node in addition to what ever number of nodes exist currently in the cluster.

Otherwise we would just needlessly migrate data to nodes that are about to be removed in the next round. With non-trivial amounts of data there would also be no guarantee that the remaining old nodes could host the data.

I am a bit confused. Should this not be the max(actualStatefulSets.ExpectedNodeCount(), expectedResources.StatefulSets().ExpectedNodeCount()) + maxSurge How could we otherwise make progress, if we are going to replace a stateful set?

I'd think that user applying a spec downscaling from 150 to 100 nodes and 20 surge expects to see at most 120 nodes instead of 170. User can make progress when surge is adjusted so it allows creating nodes despite actual state. Also, in any case, we can't be sure to make progress.

I am willing to accept that there is no optimal solution that covers all cases. The question is how contrived the rename + scale down case is. Also I am bit worried that we don't surface the state of being stuck in this way visible enough to the user. A log message might not be enough

In spec, maxUnavailable and maxSurge can have three types of values: - nil - uses the default value, - negative - means value is unbounded, - non-negative - means value is used directly. In code, nil means unbounded, non-negative is to be used directly and negative is not expected. Adjusted tests.

There seems to be an ongoing issue with crd generation, so fixing CRD manually for now

pkg/apis/elasticsearch/v1beta1/elasticsearch_types.go

pkg/controller/elasticsearch/driver/downscale.go

pkg/controller/elasticsearch/driver/downscale_invariants.go

pkg/controller/elasticsearch/driver/upgrade_pods_deletion.go

pkg/controller/elasticsearch/driver/upscale_state.go

test/e2e/es/mutation_test.go

pkg/controller/elasticsearch/driver/upscale_state.go

barkbay

LGTM 👍

sebgl

LGTM

david-kow requested review from sebgl and barkbay September 27, 2019 09:36

david-kow force-pushed the max_surge_and_max_unavailability branch from 370f055 to 38a1765 Compare September 27, 2019 13:43

sebgl reviewed Sep 27, 2019

View reviewed changes

barkbay reviewed Sep 30, 2019

View reviewed changes

pebrc reviewed Sep 30, 2019

View reviewed changes

David Kowalski added 11 commits October 2, 2019 07:19

Add support for MaxSurge and MaxUnavailable during scaling

4f909cd

Add unit tests

dbc656a

Add e2e test

4fa4d17

Fix some comments and formatting

90b631e

Use some helpers for calculations

813f5d0

Fix PR comments

5c57f95

Fix more PR comments

9408b3e

Fix CRD manually

929cb22

There seems to be an ongoing issue with crd generation, so fixing CRD manually for now

WIP: Fix downscaling not respecting maxUnavailable

e5a77a2

Fix UTs

fa47a4a

david-kow force-pushed the max_surge_and_max_unavailability branch from 4c1ae66 to fa47a4a Compare October 2, 2019 08:30

Merge branch 'master' into max_surge_and_max_unavailability

b185dd0

david-kow force-pushed the max_surge_and_max_unavailability branch from 8a42450 to b185dd0 Compare October 2, 2019 11:30

Add UT for limiting non-master nodes code path

3128a4e