Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup on failed k8s bootstrap an k8s join-cluster attempts #521

Merged
merged 6 commits into from
Jul 1, 2024

Conversation

neoaggelos
Copy link
Contributor

Summary

Merge after #520

Changes

  • Bootstrap control plane, bootstrap worker and join control plane hooks are refactored. We always defer a function that checks the result of the hook.
  • In the case of preRemove, we simply log the error and proceed, otherwise the node is removed by microcluster peers but not the underlying dqlite database, breaking the cluster
  • In the case of k8s bootstrap, remove all configs, stop control plane services, then use ResetClusterMember. This resets the microcluster state. Note that this runs automatically by k8sd, no manual action from the client is required.
  • In the case of k8s join-cluster for worker nodes, similarly revert configs, then use ResetClusterMember
  • In the case of k8s join-cluster for control plane nodes: When the postJoin hook runs, the node has already joined microcluster. Therefore, we need to revert configs, but also make sure to use DeleteClusterMember, such that the failed node is removed from the cluster before resetting.

Notes

  • The wait for a node to be not Pending before removing has been moved to the k8s remove-node command, instead of delaying the completion of the k8s join-cluster command.

@neoaggelos neoaggelos marked this pull request as ready for review July 1, 2024 14:09
@neoaggelos neoaggelos requested a review from a team as a code owner July 1, 2024 14:09
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/util/retry"
)

// DeleteNode will remove a node from the kubernetes cluster.
// DeleteNode will retry if there is a conflict on the resource.
// DeleteNode will not fail if the node does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment incomplete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a couple more things like this, will round it with a separate fix typos PR, to avoid unnecessary conflicts across #518, #520, #521

Base automatically changed from KU-1013/timeouts to main July 1, 2024 15:24
@neoaggelos neoaggelos merged commit b07b07a into main Jul 1, 2024
17 checks passed
@neoaggelos neoaggelos deleted the KU-1013/cleanup branch July 1, 2024 16:33
@neoaggelos neoaggelos mentioned this pull request Jul 2, 2024
louiseschmidtgen pushed a commit that referenced this pull request Jul 4, 2024
* cleanup on failed bootstrap

* cleanup on failed worker join

* create a local microcluster client on App

* start cleanup on control plane join

* do not block remove hook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants