-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for MaxSurge and MaxUnavailable during scaling #1812
Add support for MaxSurge and MaxUnavailable during scaling #1812
Conversation
jenkins test this please |
370f055
to
38a1765
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I left a few comments.
Some thoughts about the changeBudget
schema:
- Right now, it's a pointer: if unspecified (
nil
), we use our defaults. - Default values hardcoded in several places (see my comment):
maxUnavailable=1
,maxSurge=math.maxInt32
max.MaxInt32
seems pretty hard to write in the spec, and would look weird in the docs: "the default value is 2 147 483 647"- I'm 👍 with having a default unbounded value for MaxSurge so far. Users can lower it to something acceptable if they want to, as long as they also understand, thanks to the docs, that this comes with some limitations (ie. the reconciliation can be stuck in some scenarios)
- In the long-term, I can imagine we improve the logic to, optionally, automatically raise maxSurge to allow stuck reconciliations to move on. Something like
changeBudget.allowMaxSurgeBreak
. As a user, what I would like is probably something like "please perform that mutation using the smallest possible count of extra pods". But I don't really care what that number is, as long as it's the smallest possible one. And it's OK for ECK to make that vary if required. A feature to be kept for later though.
Because of the reasons above, I'm wondering if we should change the schema to:
changeBudget: # <-- not a pointer
maxUnavailable: 1 # <---- an int32 pointer, optional, defaults to 1 if nil
maxSurge: 1 # <---- an int32 pointer, optional, defaults to -1 if nil
Then, I would maybe represent the unbounded value of maxSurge
as -1
(instead of 2 147 483 647
). We may still internally use math.MaxInt32
in the code if that helps. Or not: maybe checking for -1 in the correct upscaleState function is also pretty simple and cleaner to represent "unbounded" (always accept replica increase).
pkg/controller/elasticsearch/driver/downscale_invariants_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to report a part of the PR description as some comments in the code, I think it would help to understand the algorithms.
I have some concerns about the way the nodes are excluded.
Let's take this example:
spec: spec:
version: 7.2.0 version: 7.2.0
updateStrategy: updateStrategy:
changeBudget: changeBudget:
maxUnavailable: 1 maxUnavailable: 1
maxSurge: 1 maxSurge: 1
nodes: nodes:
- name: masters => - name: masters
config: config:
node.data: false node.data: false
- name: nodes - name: nodes-2
config: config:
node.master: false node.master: false
nodeCount: 5 nodeCount: 3
It might not be obvious but in this situation both createdAllowed
and expectedReplicas
are set to 0: all the data nodes are excluded, none can be created. Even if removalsAllowed
is only set to 3 no node can be removed because shards can't be moved.
It's impossible to progress and since all nodes are excluded it is also impossible to create new indices. It might take some time for the user to figure out that it had shoot itself in the foot.
A solution would be to slightly change the way leavingNodes
is computed:
- Compute leaving nodes by immediately taking into account the budget, including the constraints on the masters
- Iterate though the leaving nodes to check which nodes still holds some shards
Playing around with the code it could be something like that: https://gist.github.com/barkbay/bddb531a94088e68049cee7bbcc8c2bc, no node is unnecessarily excluded, it allows for the first 3 nodes to be removed following up by the creation of the new nodes.
Edit: as it has been pointed out by @pebrc there must be some data in the nodes to reproduce this situation
pkg/controller/elasticsearch/driver/downscale_invariants_test.go
Outdated
Show resolved
Hide resolved
@barkbay maybe I am missing something in your example, but I was not able to reproduce the livelock situation you mentioned with the example you gave. |
I have copy/paste an example here: https://gist.github.com/barkbay/d1ee547d4f79bd9e76c5718ba269778f You should end up in the following situation:
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"exclude" : {
"_name" : "elasticsearch-sample-es-nodes-4,elasticsearch-sample-es-nodes-3,elasticsearch-sample-es-nodes-2,elasticsearch-sample-es-nodes-1,elasticsearch-sample-es-nodes-0"
}
}
}
}
}
} |
I helps if one uses a cluster with actual data in it ... |
🤦♂ my bad and sorry, forgot to mention that there should be some data in the nodes |
if ctx.es.Spec.UpdateStrategy.ChangeBudget != nil { | ||
createsAllowed = int32(ctx.es.Spec.UpdateStrategy.ChangeBudget.MaxSurge) | ||
createsAllowed += expectedResources.StatefulSets().ExpectedNodeCount() | ||
createsAllowed -= actualStatefulSets.ExpectedNodeCount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused. Should this not be the max(actualStatefulSets.ExpectedNodeCount(), expectedResources.StatefulSets().ExpectedNodeCount()) + maxSurge
How could we otherwise make progress, if we are going to replace a stateful set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂ my bad and sorry, forgot to mention that there should be some data in the nodes
@barkbay no that's on me being a bit slow in the uptake :-) Am I right assuming that your suggestion of excluding nodes step by step as we remove them (which is of course correct) could be complemented by actually creating nodes at the same time. I am not sure if my interpretation of the surge budget of 1 is OK which is to say we can have one additional node in addition to what ever number of nodes exist currently in the cluster.
Otherwise we would just needlessly migrate data to nodes that are about to be removed in the next round. With non-trivial amounts of data there would also be no guarantee that the remaining old nodes could host the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused. Should this not be the max(actualStatefulSets.ExpectedNodeCount(), expectedResources.StatefulSets().ExpectedNodeCount()) + maxSurge How could we otherwise make progress, if we are going to replace a stateful set?
I'd think that user applying a spec downscaling from 150 to 100 nodes and 20 surge expects to see at most 120 nodes instead of 170. User can make progress when surge is adjusted so it allows creating nodes despite actual state. Also, in any case, we can't be sure to make progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am willing to accept that there is no optimal solution that covers all cases. The question is how contrived the rename + scale down case is. Also I am bit worried that we don't surface the state of being stuck in this way visible enough to the user. A log message might not be enough
In spec, maxUnavailable and maxSurge can have three types of values: - nil - uses the default value, - negative - means value is unbounded, - non-negative - means value is used directly. In code, nil means unbounded, non-negative is to be used directly and negative is not expected. Adjusted tests.
There seems to be an ongoing issue with crd generation, so fixing CRD manually for now
4c1ae66
to
fa47a4a
Compare
8a42450
to
b185dd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change will cause operator to prevent creating or removing nodes that would cause the pod counts to violate MaxSurge/MaxUnavailable settings in the change budget.
Some notes:
Fixes #1292.