UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups #14

frobware · 2018-11-30T15:05:00Z

This change switches from using node.Name to node.Spec.ProviderID when building a cloudprovider.NodeGroup. Using node.Name has been a historical mistake.

There have been several suggestions for adding the provider ID to the machine object at the cluster-api level (kubernetes-sigs/cluster-api#565) and by an actuator implementation (openshift/cluster-api-provider-aws#86). For the moment we can make this mapping without either of those PRs assuming there's the "machine" annotation on the node object. This annotation is currently added by the nodelink-controller (https://github.com/openshift/machine-api-operator).

Switching cloudprovider.NodeGroupForNode() to indexing on node.Spec.ProviderID and also returning provider ID values in cloudprovider.Nodes() means we no longer experience the case where the nodegroup/node becomes unregistered, which ultimately leads to the autoscaler to stop both scale up and down operations as it deems the state of the cluster to be unhealthy.

We may want to revisit how this is done in the longer term because it does assume that the nodelink-controller is running on the cluster. But this change requires no additional changes from the cluster-api (so no revendoring) nor does it require a change to the actuators that we use (AWS, libvirt).

frobware · 2018-11-30T15:11:43Z

/cc @derekwaynecarr @ingvagabund @bison @vikaschoudhary16

ingvagabund · 2018-12-04T11:12:40Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

-	objs, err := machineIndexer.ByIndex(nodeNameIndexKey, node.Name)
+func (c *machineController) findMachine(id string) (*v1alpha1.Machine, error) {
+	store := c.machineInformer.Informer().GetStore()
+	item, exists, err := store.GetByKey(id)


Given the store is used only here, it's more transparent to do c.machineInformer.Informer().GetStore().GetByKey(id) here.

ingvagabund · 2018-12-04T11:16:29Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

 	return machineSet.DeepCopy(), nil
 }

-// run starts shared informers and waits for the shared informer cache
-// to synchronize.
+// Run starts shared informers and waits for the informer cache to


s/Run/run

ingvagabund · 2018-12-04T11:49:00Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

+// machineSet. For each machine in the set a DeepCopy() of the object
+// is returned.
+func (c *machineController) MachinesInMachineSet(machineSet *v1alpha1.MachineSet) ([]*v1alpha1.Machine, error) {
+	machines, err := c.machineInformer.Lister().Machines(machineSet.Namespace).List(labels.Everything())


This is fine with not so many machines. Though, with hundreds of machines in the same namespace (which would be insine but still) this can be slow. What's the benefit of listing all instead of listing all machines that has labels matched by the machineset's label selector?

ingvagabund · 2018-12-04T11:52:00Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_provider.go

+		}
+
+		if node != nil {
+			nodeNames = append(nodeNames, node.Spec.ProviderID)


Why do you append the ProviderID into the list of node names?

The variable name is badly named now.

frobware · 2018-12-05T19:04:40Z

@ingvagabund addressed your issues. @tnozicka had some feedback that I should also be checking UUID values which may be better to do in a follow-up PR as it's important that we switch to indexing via node.spec.ProviderID ASAP. Thoughts? @derekwaynecarr

bison · 2018-12-05T22:35:39Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

+	if err != nil {
+		return nil, err
+	}
+	if len(objs) != 1 {


Should this return an error if the length is greater than one? However unlikely, it would be good to know.

bison · 2018-12-05T22:39:05Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

 	}

-	return nil
+	machineName, found := node.Annotations["machine"]


Seems like this should probably try the newer cluster.k8s.io/machine annotation first, then fallback to machine.

Done. Thanks for the link.

frobware · 2018-12-06T12:14:25Z

@tnozicka had some feedback that I should also be checking UUID values

This is now done too.

frobware · 2018-12-06T15:07:58Z

/retest

This switches the lookup of `node` objects to using the node.Spec.ProviderID. The previous index was node.Name but this is wrong as it leads to nodes going unregistered.

derekwaynecarr

just a question, generally looks good.

/lgtm
/approve

derekwaynecarr · 2018-12-06T15:22:50Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go

+
+	machineName, found := node.Annotations["cluster.k8s.io/machine"]
+	if !found {
+		machineName, found = node.Annotations["machine"]


q: we seem to fall back on this always currently, is there a separate pr that updates our annotation usage?

Our nodelink-controller is only adding "machine" at the moment, so no other PR right now but will add a Jira ticket.

Our nodelink-controller is only adding "machine" at the moment, so no other PR right now but will add a Jira ticket.

https://jira.coreos.com/browse/CLOUD-306

openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 30, 2018

frobware force-pushed the clusterapi-use-providerid branch from 9e16280 to 3e44d91 Compare November 30, 2018 15:05

frobware changed the title ~~UPSTREAM: <carry>: use node.Spec.ProviderID for identifying nodegroups~~ UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups Nov 30, 2018

frobware mentioned this pull request Nov 30, 2018

Extend MachineStatus to add ProviderID kubernetes-sigs/cluster-api#565

Merged

openshift-ci-robot requested review from bison, derekwaynecarr, ingvagabund and vikaschoudhary16 November 30, 2018 15:11

frobware force-pushed the clusterapi-use-providerid branch from 3e44d91 to ec43a97 Compare December 2, 2018 21:29

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2018

frobware force-pushed the clusterapi-use-providerid branch from ec43a97 to 8b89e4a Compare December 2, 2018 21:57

ingvagabund reviewed Dec 4, 2018

View reviewed changes

frobware mentioned this pull request Dec 5, 2018

cloud provider for cluster-api kubernetes/autoscaler#1481

Closed

bison reviewed Dec 5, 2018

View reviewed changes

Use node.Spec.ProviderID instead of node.Name

8247728

This switches the lookup of `node` objects to using the node.Spec.ProviderID. The previous index was node.Name but this is wrong as it leads to nodes going unregistered.

frobware force-pushed the clusterapi-use-providerid branch from 4e26565 to 8247728 Compare December 6, 2018 15:09

derekwaynecarr approved these changes Dec 6, 2018

View reviewed changes

openshift-ci-robot assigned derekwaynecarr Dec 6, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2018

openshift-merge-robot merged commit 47787c4 into openshift:master Dec 6, 2018

frobware mentioned this pull request Dec 13, 2018

Add providerID annotation to machine object openshift/cluster-api-provider-aws#86

Closed

frobware deleted the clusterapi-use-providerid branch March 22, 2019 06:19

stuartnelson3 mentioned this pull request Apr 16, 2019

Unregistered nodes present kubernetes/autoscaler#775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups #14

UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups #14

frobware commented Nov 30, 2018

frobware commented Nov 30, 2018

ingvagabund Dec 4, 2018

frobware Dec 5, 2018

ingvagabund Dec 4, 2018

frobware Dec 5, 2018

ingvagabund Dec 4, 2018

frobware Dec 5, 2018

ingvagabund Dec 4, 2018

frobware Dec 4, 2018

frobware Dec 5, 2018

frobware commented Dec 5, 2018

bison Dec 5, 2018

frobware Dec 6, 2018

bison Dec 5, 2018

frobware Dec 6, 2018

frobware commented Dec 6, 2018

frobware commented Dec 6, 2018

derekwaynecarr left a comment

derekwaynecarr Dec 6, 2018

frobware Dec 6, 2018

frobware Dec 6, 2018

UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups #14

UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups #14

Conversation

frobware commented Nov 30, 2018

frobware commented Nov 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frobware commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frobware commented Dec 6, 2018

frobware commented Dec 6, 2018

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment