Make snap service start more robust. #630

bschimke95 · 2024-08-27T17:11:14Z

Summary

This PR fixed a race-condition that could pop up when bootstrapping or joining a control-plane node.

Aug 27 15:48:13 k8s-integration-226cb2-4 k8s.k8sd[3300]: error: error running snapctl: snap "k8s" has "service-control" change in progress

Details

On startup, the controllers (node controller etc.) are prevented from running by the WaitGroup a.readyWG. This WG count is set to 1 and only decremented in markNodeReay. This function runs in a separate go routine and will wait until the apiserver is reachable and then decrement the WG.

Also on startup, the boostrap or join hook is executed, which at its end starts all snap services sequentially. The services are defined here. The kube-apiserver is started second in the list, kubelet is started last.

So, the race-condition happens if kube-apiserver is ready, markReadyNode sees this (remember that this runs in a separate routine) and marks the node as ready. This will start the controllers, including the node-configuration controller which in turn will restart the kubelet in its reconcilation loop. It may now happen that the hook and node-controller try to start/restart the kubelet at the same time which fails.

This PR adds a couple of improvements to prevent this:

~~in markNodeReady wait until all expected snap services are up~~
start the apiserver last
retries the hook services starts

Note: Initially I wanted to monitor in the markNodeReady function if all snap services are up. However, this introduces a lot of implicit dependencies, e.g. you need to wait for different services on cp and worker. For that, we need to check the worker marker file which is also set on bootstrap and may cause subtle bugs if the order is wrong.

louiseschmidtgen · 2024-08-28T13:38:19Z

src/k8s/pkg/snap/util/services.go

 		"kube-controller-manager",
 		"kube-proxy",
 		"kube-scheduler",
 		"kubelet",
+		"kube-apiserver",


Is kubelet not dependent on the api-server to start first? It needs to register nodes with the api-server?
Should we rather wait to start kubelet ony after the api-server is up to avoid the race condition?

kubelet will wait until the apiserver is up - this is not a problem.

If we would wait until the apiserver is up, we would not avoid the problem as the node-controller could still randomly kick in and kill out snap startup.
The ~~only~~ preferred way to handle this IMHO is to make sure the controllers do not run before all snap services are settled.
I'm open for better ideas, though.

Actually, waiting for snap services introduces a whole lot of implicit dependencies which we really don't want in the code (see updated description). Let's just retry the hook snap start instead and keep it simple.

berkayoz

LGTM!

bschimke95 force-pushed the KU-1124/snapctl-service-control branch 3 times, most recently from 69410a1 to bb9c7b5 Compare August 28, 2024 12:47

louiseschmidtgen reviewed Aug 28, 2024

View reviewed changes

bschimke95 added 9 commits August 29, 2024 09:26

increase log verbosity

bd74f91

add additional logging

a9b7aec

add even more logging

7f49c02

wait for snap services

e69d1d6

debug logs

019c759

add IsSubSlice

7467c6f

cleanup

1f634bc

use retries instead of waiting for services

294eebc

remove snap changes

136336e

bschimke95 force-pushed the KU-1124/snapctl-service-control branch from f03b706 to 136336e Compare August 29, 2024 07:27

go mod cleanup

ed89630

bschimke95 marked this pull request as ready for review August 29, 2024 07:44

bschimke95 requested a review from a team as a code owner August 29, 2024 07:44

fix error checks and add retry in node controller

534bdf2

berkayoz approved these changes Aug 30, 2024

View reviewed changes

bschimke95 merged commit e5ca72f into main Aug 30, 2024
19 checks passed

bschimke95 deleted the KU-1124/snapctl-service-control branch August 30, 2024 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make snap service start more robust. #630

Make snap service start more robust. #630

bschimke95 commented Aug 27, 2024 •

edited

Loading

louiseschmidtgen Aug 28, 2024 •

edited

Loading

bschimke95 Aug 28, 2024

bschimke95 Aug 29, 2024

berkayoz left a comment

Make snap service start more robust. #630

Make snap service start more robust. #630

Conversation

bschimke95 commented Aug 27, 2024 • edited Loading

Summary

Details

louiseschmidtgen Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

bschimke95 Aug 28, 2024

Choose a reason for hiding this comment

bschimke95 Aug 29, 2024

Choose a reason for hiding this comment

berkayoz left a comment

Choose a reason for hiding this comment

bschimke95 commented Aug 27, 2024 •

edited

Loading

louiseschmidtgen Aug 28, 2024 •

edited

Loading