Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support rolling upgrade on openstack #9927

Merged

Conversation

olemarkus
Copy link
Member

This is a rather brutish approach running close to full update cluster between rolls.
In order to do a smaller parts of update cluster, I think we first need to split ApplyClusterCmd.Run() into smaller parts, perhaps with cloud-specific functions instead of all the switch statements.

There are lots of good suggestions in #9635 on how to make the update process smaller, such as writing bootstrapscript to VFS. But there are too many other interdependencies between tasks (such as instance -> port -> subnet -> network) that needs splitting first. Constructing tasks for resources we know exists when inside a rolling-pudate should be easy once ApplyClusterCmd has been split out in the various "phases".

Fixes #9635

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2020
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/provider/openstack Issues or PRs related to openstack provider labels Sep 12, 2020
@k8s-ci-robot k8s-ci-robot added area/rolling-update size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 12, 2020
@olemarkus olemarkus force-pushed the openstack-rolling-update-full-apply branch from e466b49 to 683de5f Compare September 12, 2020 17:57
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 12, 2020
@olemarkus olemarkus force-pushed the openstack-rolling-update-full-apply branch from 683de5f to b9b75d0 Compare September 20, 2020 06:02
@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Sep 20, 2020
@olemarkus olemarkus force-pushed the openstack-rolling-update-full-apply branch from b9b75d0 to 65b3899 Compare September 20, 2020 07:33
@olemarkus olemarkus marked this pull request as ready for review September 21, 2020 05:32
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 21, 2020
@olemarkus
Copy link
Member Author

@zetaab maybe worth having a look at the changes here together with the detach instance work?

@@ -72,10 +74,16 @@ type RollingUpdateCluster struct {

// ValidateCount is the amount of time that a cluster needs to be validated after single node update
ValidateCount int

Ctx context.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually there is no need to embed the context within this struct. We can propagate it at every step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got tired of passing on this one all the time :p what is the benefit of keep propagating it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the doc states not to embed a context.Context within a struct. However, I found golang/go#22602 which points out that if it is a parameter struct, i.e. to not get a parameter explosion within the function then it would be fine. In our case it is not really only a parameter struct so I'd prefer to keep the context as a first parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not strictly a parameter struct since we define functions on it, but none of the functions mutate the struct, so for all intents and purposes it is. The concerns in that issue is not applicable to this struct either as far as I can tell.

@olemarkus
Copy link
Member Author

/cc @zetaab

@k8s-ci-robot k8s-ci-robot requested a review from zetaab September 22, 2020 18:06
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2020
@olemarkus olemarkus force-pushed the openstack-rolling-update-full-apply branch from 65b3899 to 6d773d0 Compare September 27, 2020 10:06
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 27, 2020
@hakman hakman added this to the v1.19 milestone Sep 29, 2020
@zetaab
Copy link
Member

zetaab commented Sep 30, 2020

I cannot test this because of #10003 and #10004

@zetaab
Copy link
Member

zetaab commented Oct 1, 2020

% ./kops rolling-update cluster rolling2.k8s.local --yes
NAME		STATUS		NEEDUPDATE	READY	MIN	TARGET	MAX	NODES
bastions	NeedsUpdate	1		0	1	1	1	0
master-zone-1	NeedsUpdate	1		0	1	1	1	1
master-zone-2	NeedsUpdate	1		0	1	1	1	1
master-zone-3	NeedsUpdate	1		0	1	1	1	1
nodes-zone-1	NeedsUpdate	1		0	1	1	1	1
nodes-zone-2	NeedsUpdate	1		0	1	1	1	1
nodes-zone-3	Ready		0		0	0	0	0	0
I1001 17:26:58.490257   19777 instancegroups.go:519] Stopping instance "386bf040-b0a7-4734-a71f-66aefcff220d", in group "rolling2.k8s.local-bastions" (this may take a while).
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x3b42ce0]

goroutine 73 [running]:
k8s.io/kops/upup/pkg/fi/cloudup.(*ApplyClusterCmd).Run(0xc0004dbad0, 0x524a840, 0xc00019c010, 0x0, 0x0)
	/Users/jessehaka/go/src/k8s.io/kops/upup/pkg/fi/cloudup/apply_cluster.go:140 +0x7dc0
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).reconcileInstanceGroup(0xc000fd00c0, 0xc0000ec2a0, 0x0)
	/Users/jessehaka/go/src/k8s.io/kops/pkg/instancegroups/instancegroups.go:407 +0x17d
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).drainTerminateAndWait(0xc000fd00c0, 0xc0000ec2a0, 0x37e11d600, 0xc000a96720, 0x0)
	/Users/jessehaka/go/src/k8s.io/kops/pkg/instancegroups/instancegroups.go:377 +0x238
k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup.func1(0xc000d288a0, 0xc000fd00c0, 0x37e11d600, 0xc0000ec2a0)
	/Users/jessehaka/go/src/k8s.io/kops/pkg/instancegroups/instancegroups.go:168 +0x3f
created by k8s.io/kops/pkg/instancegroups.(*RollingUpdateCluster).rollingUpdateInstanceGroup
	/Users/jessehaka/go/src/k8s.io/kops/pkg/instancegroups/instancegroups.go:167 +0x5d8

Copy link
Member

@zetaab zetaab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like mentioned in last comment: this will panic. The problem is that c.Clientset is NIL in reconcileInstanceGroup function

You need pass clientset to variables in cmd/kops/rollingupdatecluster.go row 330

@olemarkus olemarkus force-pushed the openstack-rolling-update-full-apply branch from 6d773d0 to badf57a Compare October 1, 2020 17:49
@olemarkus olemarkus force-pushed the openstack-rolling-update-full-apply branch from badf57a to aa66c4f Compare October 1, 2020 18:07
@olemarkus
Copy link
Member Author

Looks like a bit went missing when I factored out some of the BuildCloud stuff. Could you try again now, @zetaab ?

Copy link
Member

@zetaab zetaab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is not perfect, but it is already much better than before this PR.

Two issues:

  1. if something happens between single instance delete & apply cluster run command. You need run kops update cluster once to get machines up and running.
  2. running whole kops cluster update is not the perfect way. We should have possibility to run only some parts of it

However, this is good start!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit c02d37c into kubernetes:master Oct 2, 2020
@kciredor
Copy link

Will this make it into 1.18.2?

@olemarkus olemarkus deleted the openstack-rolling-update-full-apply branch October 15, 2020 13:08
@olemarkus
Copy link
Member Author

No this will be not be backported. The change is pretty big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/aws Issues or PRs related to aws provider area/provider/openstack Issues or PRs related to openstack provider area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenStack specific rolling update should not require separate update cluster process
6 participants