Cleanup on failed `k8s bootstrap` an `k8s join-cluster` attempts #521

neoaggelos · 2024-06-30T20:36:27Z

Summary

Merge after #520

Changes

Bootstrap control plane, bootstrap worker and join control plane hooks are refactored. We always defer a function that checks the result of the hook.
In the case of preRemove, we simply log the error and proceed, otherwise the node is removed by microcluster peers but not the underlying dqlite database, breaking the cluster
In the case of k8s bootstrap, remove all configs, stop control plane services, then use ResetClusterMember. This resets the microcluster state. Note that this runs automatically by k8sd, no manual action from the client is required.
In the case of k8s join-cluster for worker nodes, similarly revert configs, then use ResetClusterMember
In the case of k8s join-cluster for control plane nodes: When the postJoin hook runs, the node has already joined microcluster. Therefore, we need to revert configs, but also make sure to use DeleteClusterMember, such that the failed node is removed from the cluster before resetting.

Notes

The wait for a node to be not Pending before removing has been moved to the k8s remove-node command, instead of delaying the completion of the k8s join-cluster command.

bschimke95

LGTM

bschimke95 · 2024-07-01T14:03:23Z

src/k8s/pkg/client/kubernetes/node.go

 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/client-go/util/retry"
 )

 // DeleteNode will remove a node from the kubernetes cluster.
 // DeleteNode will retry if there is a conflict on the resource.
+// DeleteNode will not fail if the node does not


nit: comment incomplete

found a couple more things like this, will round it with a separate fix typos PR, to avoid unnecessary conflicts across #518, #520, #521

* cleanup on failed bootstrap * cleanup on failed worker join * create a local microcluster client on App * start cleanup on control plane join * do not block remove hook

neoaggelos force-pushed the KU-1013/cleanup branch from c544f4f to 7853340 Compare June 30, 2024 20:38

neoaggelos marked this pull request as ready for review July 1, 2024 14:09

neoaggelos requested a review from a team as a code owner July 1, 2024 14:09

bschimke95 approved these changes Jul 1, 2024

View reviewed changes

neoaggelos force-pushed the KU-1013/timeouts branch from a4bcbb6 to 3620fba Compare July 1, 2024 15:23

Base automatically changed from KU-1013/timeouts to main July 1, 2024 15:24

neoaggelos added 6 commits July 1, 2024 18:24

cleanup on failed bootstrap

b09253f

cleanup on failed worker join

eaa0f57

create a local microcluster client on App

93144d5

start cleanup on control plane join

199527c

do not block remove hook

2603d6d

reduce diff

ba8e00c

neoaggelos force-pushed the KU-1013/cleanup branch from 7853340 to ba8e00c Compare July 1, 2024 15:24

neoaggelos merged commit b07b07a into main Jul 1, 2024
17 checks passed

neoaggelos deleted the KU-1013/cleanup branch July 1, 2024 16:33

neoaggelos mentioned this pull request Jul 2, 2024

Typos and go doc fixes #525

Merged

neoaggelos mentioned this pull request Jul 15, 2024

failed to bootstrap new cluster using POST /k8sd/cluster: failed to bootstrap new cluster: Post "http://control.socket/cluster/control": context deadline exceeded #482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup on failed `k8s bootstrap` an `k8s join-cluster` attempts #521

Cleanup on failed `k8s bootstrap` an `k8s join-cluster` attempts #521

neoaggelos commented Jun 30, 2024

bschimke95 left a comment

bschimke95 Jul 1, 2024

neoaggelos Jul 1, 2024

Cleanup on failed k8s bootstrap an k8s join-cluster attempts #521

Cleanup on failed k8s bootstrap an k8s join-cluster attempts #521

Conversation

neoaggelos commented Jun 30, 2024

Summary

Changes

Notes

bschimke95 left a comment

Choose a reason for hiding this comment

bschimke95 Jul 1, 2024

Choose a reason for hiding this comment

neoaggelos Jul 1, 2024

Choose a reason for hiding this comment

Cleanup on failed `k8s bootstrap` an `k8s join-cluster` attempts #521

Cleanup on failed `k8s bootstrap` an `k8s join-cluster` attempts #521