Fix DockerMachinePool rollout flow and deletion behaviour #9655

killianmuldoon · 2023-11-01T14:25:17Z

When using Docker infrastructure, when rolling out a MachinePool the underlying infrastructure is deleted straight away without allowing for any other deletion steps e.g. draining the node, deleting the node.

cluster-api/test/infrastructure/docker/exp/internal/docker/nodepool.go

Line 92 in a314957

    
           if totalNumberOfMachines > desiredReplicas || !np.isMachineMatchingInfrastructureSpec(machine) {

Once the underlying infrastructure is deleted the MachinePool controller is supposed to clean up the Kubernetes Node as retired:

cluster-api/exp/internal/controllers/machinepool_controller_noderef.go

Lines 120 to 123 in 051b006

    
           // deleteRetiredNodes deletes nodes that don't have a corresponding ProviderID in Spec.ProviderIDList. 
        
           // A MachinePool infrastructure provider indicates an instance in the set has been deleted by 
        
           // removing its ProviderID from the slice. 
        
           func (r *MachinePoolReconciler) deleteRetiredNodes(ctx context.Context, c client.Client, nodeRefs []corev1.ObjectReference, providerIDList []string) error {

This flow seems to work fine for certain MachinePool use cases, but it means that self-hosted clusters running with MachinePools can go offline for an extended period.

As described in #9522 (comment), deleting of the underlying infrastructure can kill CAPI management components - if they're running on a DockerMachinePool backed node. The pod and node are stopped without informing Kubernetes, which waits until the pod-eviction-timeout has expired to bring the management components back into a working state.

This is a substantial bug in self-hosted clusters using DockerMachinePools. I'm not sure of the state in other infrastructure providers, or whether the self-hosted case is of interested and tested in other infrastructure providers for MachinePools.

The fix for this bug may include changes to the core MachinePool controller. It's not clear to me this bug is strictly in the implementation of the DockerMachinePool controller or in the design of the MachinePool rollout flow.

/kind bug

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-11-01T14:26:08Z

/triage accepted

killianmuldoon · 2023-11-01T14:52:15Z

@nawazkh from CI team, @willie-yao for the MP test implementation.

killianmuldoon · 2023-11-01T14:52:37Z

/help

k8s-ci-robot · 2023-11-01T14:52:40Z

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2023-11-01T17:15:30Z

I believe @Jont828's PR to implement DockerMachinePool Machines (#8842) may be fixing this behavior

CecileRobertMichon · 2023-11-28T18:56:40Z

This was fixed by #8842

/close

k8s-ci-robot · 2023-11-28T18:56:46Z

@CecileRobertMichon: Closing this issue.

In response to this:

This was fixed by #8842

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 1, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 1, 2023

killianmuldoon mentioned this issue Nov 1, 2023

Consider dropping MachinePools from ClusterClass self-hosted tests #9656

Closed

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 1, 2023

killianmuldoon mentioned this issue Nov 3, 2023

✨ Add MachinePool Machine implementation to CAPD components #8842

Merged

k8s-ci-robot closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DockerMachinePool rollout flow and deletion behaviour #9655

Fix DockerMachinePool rollout flow and deletion behaviour #9655

killianmuldoon commented Nov 1, 2023

killianmuldoon commented Nov 1, 2023

killianmuldoon commented Nov 1, 2023

killianmuldoon commented Nov 1, 2023

k8s-ci-robot commented Nov 1, 2023

CecileRobertMichon commented Nov 1, 2023

CecileRobertMichon commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023

Fix DockerMachinePool rollout flow and deletion behaviour #9655

Fix DockerMachinePool rollout flow and deletion behaviour #9655

Comments

killianmuldoon commented Nov 1, 2023

killianmuldoon commented Nov 1, 2023

killianmuldoon commented Nov 1, 2023

killianmuldoon commented Nov 1, 2023

k8s-ci-robot commented Nov 1, 2023

Guidelines

CecileRobertMichon commented Nov 1, 2023

CecileRobertMichon commented Nov 28, 2023

k8s-ci-robot commented Nov 28, 2023