Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change RKE upgrade logic for zero downtime #1800

Merged
merged 2 commits into from
Feb 6, 2020

Conversation

mrajashree
Copy link
Contributor

@mrajashree mrajashree commented Nov 25, 2019

#1772
Change worker plane components upgrade strategy for zero downtime upgrades

  • Accept maxUnavailable from user, default to 10%, round down
  • Calculate powered down/unreachable hosts first (already done by TunnelHosts). If number of unreachable hosts = maxUnavailable, stop upgrade
  • Adjust maxUnavailable according to number of unreachable nodes. If maxUnavailable is 10 and 3 nodes are unreachable, maxUnavailable for actual upgrade will be 7
  • Upgrade worker components on nodes with etcd/controlplane roles first and one at a time
  • Upgrade nodes in sliding window of size maxUnavailable. So if maxUnavailable=10, 10 nodes can start upgrading in parallel, as soon as x out of 10 are done, start upgrading the next x nodes
  • Cordon a node before upgrade, give user option to drain it too. Uncordon node after upgrade.
  • Node is considered upgraded when it can be listed using kube client and its status is Ready
  • Keep saving names of nodes that get an error during upgrade.
  • If maxUnavailable nodes get error during upgrade, stop the upgrade process

For clusters with a large number of nodes, upgrading a percentage of them based on maxUnavailable will lead to multiple goroutines and errors due to that. This issue has details about it and why RKE switched to worker pool.
So maxUnavailable will be respected as long as it's not too big and capped at 50 which is the current worker threads RKE uses

#1734
Upgrade controlplane components one by one for zero downtime upgrades

Types PR for drain input: rancher/types#1069

@riaan53
Copy link

riaan53 commented Dec 5, 2019

Does this include the ability to drain nodes before an upgrade to do a graceful rolling cluster upgrade?

cmd/up.go Outdated Show resolved Hide resolved
@mrajashree mrajashree force-pushed the workers_upgrade branch 7 times, most recently from d17cb97 to 18849c3 Compare January 17, 2020 23:21
@superseb superseb self-requested a review January 17, 2020 23:22
@kinarashah kinarashah self-requested a review January 19, 2020 01:42
@mrajashree mrajashree force-pushed the workers_upgrade branch 2 times, most recently from 3bba464 to 956c67d Compare January 20, 2020 06:45
@mrajashree mrajashree marked this pull request as ready for review January 20, 2020 06:46
@mrajashree mrajashree force-pushed the workers_upgrade branch 2 times, most recently from 2a47873 to 8a69634 Compare January 20, 2020 06:55
@mrajashree mrajashree changed the title Upgrade workers in user configurable batches Change RKE upgrade logic for zero downtime Jan 20, 2020
@mrajashree mrajashree force-pushed the workers_upgrade branch 2 times, most recently from 0e842b9 to eca47c6 Compare January 20, 2020 18:15
@mrajashree mrajashree requested a review from a team January 20, 2020 19:20
@mrajashree mrajashree force-pushed the workers_upgrade branch 9 times, most recently from df73c49 to 6f4ec3d Compare January 22, 2020 00:16
}
}

func startWorkerPlane(ctx context.Context, kubeClient *kubernetes.Clientset, allHosts []*hosts.Host, localConnDialerFactory hosts.DialerFactory, prsMap map[string]v3.PrivateRegistry, workerNodePlanMap map[string]v3.RKEConfigNodePlan, certMap map[string]pki.CertificatePKI, updateWorkersOnly bool, alpineImage string,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe the method name could be "startWorkerPlaneUpgrade" to make it clear that this is a part of the upgrade process?

services/controlplane.go Show resolved Hide resolved
services/workerplane.go Outdated Show resolved Hide resolved
services/workerplane.go Show resolved Hide resolved
cluster/validation.go Outdated Show resolved Hide resolved
@superseb
Copy link
Contributor

superseb commented Feb 5, 2020

The outcome of upgrading workers with one node that never gets to Ready state:

DEBU[0097] [worker] Now checking status of node 18.202.232.104                                                                            
DEBU[0097] [worker] Found node by name 18.202.232.104                                                                                     
DEBU[0102] [worker] Now checking status of node 18.202.232.104                                                                            
DEBU[0102] [worker] Found node by name 18.202.232.104                                                                                     
ERRO[0107] Failed to upgrade hosts: 18.202.232.104,18.202.232.104 with error [host 18.202.232.104 not ready] 
...
INFO[0130] Finished building Kubernetes cluster successfully 
  • log contains node name twice
  • End of the run doesnt log that there was an error with a node

cluster/defaults.go Outdated Show resolved Hide resolved
@mrajashree mrajashree force-pushed the workers_upgrade branch 3 times, most recently from 8b20529 to 2d7d4ad Compare February 5, 2020 20:11
superseb
superseb previously approved these changes Feb 5, 2020
Copy link
Contributor

@superseb superseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

services/workerplane.go Outdated Show resolved Hide resolved
services/workerplane.go Outdated Show resolved Hide resolved
@mrajashree mrajashree merged commit 92714e5 into rancher:master Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants