Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DockerMachinePool rollout flow and deletion behaviour #9655

Closed
killianmuldoon opened this issue Nov 1, 2023 · 7 comments
Closed

Fix DockerMachinePool rollout flow and deletion behaviour #9655

killianmuldoon opened this issue Nov 1, 2023 · 7 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@killianmuldoon
Copy link
Contributor

When using Docker infrastructure, when rolling out a MachinePool the underlying infrastructure is deleted straight away without allowing for any other deletion steps e.g. draining the node, deleting the node.

if totalNumberOfMachines > desiredReplicas || !np.isMachineMatchingInfrastructureSpec(machine) {

Once the underlying infrastructure is deleted the MachinePool controller is supposed to clean up the Kubernetes Node as retired:

// deleteRetiredNodes deletes nodes that don't have a corresponding ProviderID in Spec.ProviderIDList.
// A MachinePool infrastructure provider indicates an instance in the set has been deleted by
// removing its ProviderID from the slice.
func (r *MachinePoolReconciler) deleteRetiredNodes(ctx context.Context, c client.Client, nodeRefs []corev1.ObjectReference, providerIDList []string) error {

This flow seems to work fine for certain MachinePool use cases, but it means that self-hosted clusters running with MachinePools can go offline for an extended period.

As described in #9522 (comment), deleting of the underlying infrastructure can kill CAPI management components - if they're running on a DockerMachinePool backed node. The pod and node are stopped without informing Kubernetes, which waits until the pod-eviction-timeout has expired to bring the management components back into a working state.

This is a substantial bug in self-hosted clusters using DockerMachinePools. I'm not sure of the state in other infrastructure providers, or whether the self-hosted case is of interested and tested in other infrastructure providers for MachinePools.

The fix for this bug may include changes to the core MachinePool controller. It's not clear to me this bug is strictly in the implementation of the DockerMachinePool controller or in the design of the MachinePool rollout flow.

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 1, 2023
@killianmuldoon
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 1, 2023
@killianmuldoon
Copy link
Contributor Author

@nawazkh from CI team, @willie-yao for the MP test implementation.

@killianmuldoon
Copy link
Contributor Author

/help

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 1, 2023
@CecileRobertMichon
Copy link
Contributor

I believe @Jont828's PR to implement DockerMachinePool Machines (#8842) may be fixing this behavior

@CecileRobertMichon
Copy link
Contributor

This was fixed by #8842

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closing this issue.

In response to this:

This was fixed by #8842

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants