Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups #14

Merged

Conversation

frobware
Copy link

This change switches from using node.Name to node.Spec.ProviderID when building a cloudprovider.NodeGroup. Using node.Name has been a historical mistake.

There have been several suggestions for adding the provider ID to the machine object at the cluster-api level (kubernetes-sigs/cluster-api#565) and by an actuator implementation (openshift/cluster-api-provider-aws#86). For the moment we can make this mapping without either of those PRs assuming there's the "machine" annotation on the node object. This annotation is currently added by the nodelink-controller (https://github.com/openshift/machine-api-operator).

Switching cloudprovider.NodeGroupForNode() to indexing on node.Spec.ProviderID and also returning provider ID values in cloudprovider.Nodes() means we no longer experience the case where the nodegroup/node becomes unregistered, which ultimately leads to the autoscaler to stop both scale up and down operations as it deems the state of the cluster to be unhealthy.

We may want to revisit how this is done in the longer term because it does assume that the nodelink-controller is running on the cluster. But this change requires no additional changes from the cluster-api (so no revendoring) nor does it require a change to the actuators that we use (AWS, libvirt).

@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 30, 2018
@frobware frobware force-pushed the clusterapi-use-providerid branch from 9e16280 to 3e44d91 Compare November 30, 2018 15:05
@frobware frobware changed the title UPSTREAM: <carry>: use node.Spec.ProviderID for identifying nodegroups UPSTREAM: <carry>: cluster-api: use node.Spec.ProviderID for identifying nodegroups Nov 30, 2018
@frobware
Copy link
Author

@frobware frobware force-pushed the clusterapi-use-providerid branch from 3e44d91 to ec43a97 Compare December 2, 2018 21:29
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2018
@frobware frobware force-pushed the clusterapi-use-providerid branch from ec43a97 to 8b89e4a Compare December 2, 2018 21:57
objs, err := machineIndexer.ByIndex(nodeNameIndexKey, node.Name)
func (c *machineController) findMachine(id string) (*v1alpha1.Machine, error) {
store := c.machineInformer.Informer().GetStore()
item, exists, err := store.GetByKey(id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the store is used only here, it's more transparent to do c.machineInformer.Informer().GetStore().GetByKey(id) here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return machineSet.DeepCopy(), nil
}

// run starts shared informers and waits for the shared informer cache
// to synchronize.
// Run starts shared informers and waits for the informer cache to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Run/run

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// machineSet. For each machine in the set a DeepCopy() of the object
// is returned.
func (c *machineController) MachinesInMachineSet(machineSet *v1alpha1.MachineSet) ([]*v1alpha1.Machine, error) {
machines, err := c.machineInformer.Lister().Machines(machineSet.Namespace).List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine with not so many machines. Though, with hundreds of machines in the same namespace (which would be insine but still) this can be slow. What's the benefit of listing all instead of listing all machines that has labels matched by the machineset's label selector?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

if node != nil {
nodeNames = append(nodeNames, node.Spec.ProviderID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you append the ProviderID into the list of node names?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name is badly named now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

@frobware
Copy link
Author

frobware commented Dec 5, 2018

@ingvagabund addressed your issues. @tnozicka had some feedback that I should also be checking UUID values which may be better to do in a follow-up PR as it's important that we switch to indexing via node.spec.ProviderID ASAP. Thoughts? @derekwaynecarr

if err != nil {
return nil, err
}
if len(objs) != 1 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an error if the length is greater than one? However unlikely, it would be good to know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

return nil
machineName, found := node.Annotations["machine"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should probably try the newer cluster.k8s.io/machine annotation first, then fallback to machine.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the link.

@frobware
Copy link
Author

frobware commented Dec 6, 2018

@tnozicka had some feedback that I should also be checking UUID values

This is now done too.

@frobware
Copy link
Author

frobware commented Dec 6, 2018

/retest

This switches the lookup of `node` objects to using the
node.Spec.ProviderID. The previous index was node.Name but this is
wrong as it leads to nodes going unregistered.
@frobware frobware force-pushed the clusterapi-use-providerid branch from 4e26565 to 8247728 Compare December 6, 2018 15:09
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a question, generally looks good.

/lgtm
/approve


machineName, found := node.Annotations["cluster.k8s.io/machine"]
if !found {
machineName, found = node.Annotations["machine"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: we seem to fall back on this always currently, is there a separate pr that updates our annotation usage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our nodelink-controller is only adding "machine" at the moment, so no other PR right now but will add a Jira ticket.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our nodelink-controller is only adding "machine" at the moment, so no other PR right now but will add a Jira ticket.

https://jira.coreos.com/browse/CLOUD-306

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2018
@openshift-merge-robot openshift-merge-robot merged commit 47787c4 into openshift:master Dec 6, 2018
@frobware frobware deleted the clusterapi-use-providerid branch March 22, 2019 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants