-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DockerMachinePool rollout flow and deletion behaviour #9655
Comments
/triage accepted |
@nawazkh from CI team, @willie-yao for the MP test implementation. |
/help |
@killianmuldoon: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This was fixed by #8842 /close |
@CecileRobertMichon: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When using Docker infrastructure, when rolling out a MachinePool the underlying infrastructure is deleted straight away without allowing for any other deletion steps e.g. draining the node, deleting the node.
cluster-api/test/infrastructure/docker/exp/internal/docker/nodepool.go
Line 92 in a314957
Once the underlying infrastructure is deleted the MachinePool controller is supposed to clean up the Kubernetes
Node
asretired
:cluster-api/exp/internal/controllers/machinepool_controller_noderef.go
Lines 120 to 123 in 051b006
This flow seems to work fine for certain MachinePool use cases, but it means that self-hosted clusters running with MachinePools can go offline for an extended period.
As described in #9522 (comment), deleting of the underlying infrastructure can kill CAPI management components - if they're running on a DockerMachinePool backed node. The pod and node are stopped without informing Kubernetes, which waits until the
pod-eviction-timeout
has expired to bring the management components back into a working state.This is a substantial bug in self-hosted clusters using DockerMachinePools. I'm not sure of the state in other infrastructure providers, or whether the self-hosted case is of interested and tested in other infrastructure providers for MachinePools.
The fix for this bug may include changes to the core MachinePool controller. It's not clear to me this bug is strictly in the implementation of the DockerMachinePool controller or in the design of the MachinePool rollout flow.
/kind bug
The text was updated successfully, but these errors were encountered: