Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make snap service start more robust. #630

Merged
merged 11 commits into from
Aug 30, 2024
Merged

Conversation

bschimke95
Copy link
Contributor

@bschimke95 bschimke95 commented Aug 27, 2024

Summary

This PR fixed a race-condition that could pop up when bootstrapping or joining a control-plane node.

Aug 27 15:48:13 k8s-integration-226cb2-4 k8s.k8sd[3300]: error: error running snapctl: snap "k8s" has "service-control" change in progress

Details

On startup, the controllers (node controller etc.) are prevented from running by the WaitGroup a.readyWG. This WG count is set to 1 and only decremented in markNodeReay. This function runs in a separate go routine and will wait until the apiserver is reachable and then decrement the WG.

Also on startup, the boostrap or join hook is executed, which at its end starts all snap services sequentially. The services are defined here. The kube-apiserver is started second in the list, kubelet is started last.

So, the race-condition happens if kube-apiserver is ready, markReadyNode sees this (remember that this runs in a separate routine) and marks the node as ready. This will start the controllers, including the node-configuration controller which in turn will restart the kubelet in its reconcilation loop. It may now happen that the hook and node-controller try to start/restart the kubelet at the same time which fails.

This PR adds a couple of improvements to prevent this:

  • in markNodeReady wait until all expected snap services are up
  • start the apiserver last
  • retries the hook services starts

Note: Initially I wanted to monitor in the markNodeReady function if all snap services are up. However, this introduces a lot of implicit dependencies, e.g. you need to wait for different services on cp and worker. For that, we need to check the worker marker file which is also set on bootstrap and may cause subtle bugs if the order is wrong.

@bschimke95 bschimke95 force-pushed the KU-1124/snapctl-service-control branch 3 times, most recently from 69410a1 to bb9c7b5 Compare August 28, 2024 12:47
"kube-controller-manager",
"kube-proxy",
"kube-scheduler",
"kubelet",
"kube-apiserver",
Copy link
Contributor

@louiseschmidtgen louiseschmidtgen Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kubelet not dependent on the api-server to start first? It needs to register nodes with the api-server?
Should we rather wait to start kubelet ony after the api-server is up to avoid the race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet will wait until the apiserver is up - this is not a problem.

If we would wait until the apiserver is up, we would not avoid the problem as the node-controller could still randomly kick in and kill out snap startup.
The only preferred way to handle this IMHO is to make sure the controllers do not run before all snap services are settled.
I'm open for better ideas, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, waiting for snap services introduces a whole lot of implicit dependencies which we really don't want in the code (see updated description). Let's just retry the hook snap start instead and keep it simple.

@bschimke95 bschimke95 force-pushed the KU-1124/snapctl-service-control branch from f03b706 to 136336e Compare August 29, 2024 07:27
@bschimke95 bschimke95 marked this pull request as ready for review August 29, 2024 07:44
@bschimke95 bschimke95 requested a review from a team as a code owner August 29, 2024 07:44
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bschimke95 bschimke95 merged commit e5ca72f into main Aug 30, 2024
19 checks passed
@bschimke95 bschimke95 deleted the KU-1124/snapctl-service-control branch August 30, 2024 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants