-
Notifications
You must be signed in to change notification settings - Fork 39.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix evented pleg mirror pod & use IsEventedPLEGInUse instead of FG status check #122778
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
/test pull-kubernetes-e2e-kind-alpha-beta-features |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-e2e-kind-alpha-beta-features |
/test pull-kubernetes-unit |
/test pull-kubernetes-e2e-kind-alpha-beta-features |
2 similar comments
/test pull-kubernetes-e2e-kind-alpha-beta-features |
/test pull-kubernetes-e2e-kind-alpha-beta-features |
/test pull-kubernetes-e2e-kind-alpha-beta-features |
1 similar comment
/test pull-kubernetes-e2e-kind-alpha-beta-features |
This comment was marked as duplicate.
This comment was marked as duplicate.
1 similar comment
This comment was marked as duplicate.
This comment was marked as duplicate.
@pacoxu I am temporarily unable to access kubernets slack for your discussion context, I'm still confused that how the dup container was created, as we know from the codes, only one |
@wxx213 This only happened to static pod. In the test CI failure, we find the apiserver pod cannot be started correctly. This bug doesn't occur to a pod which is not static pod, and this is why some users is using the EventedPLEG in their production without a problem. This bug should be triggered when below two things happened at the same time.
BTW, I tried to reproduce the failure in #127063 to show the problem. |
Thanks, I checked the kubelet log for pod kube-system/kube-apiserver-kind-control-plane, and it seems that the start event of the app container wasn't synced to kubelet by evented pleg timely caused this issue. It's a little same as the rpc timeout situation where sometimes the pod is trapped in "dual running" state that what we are suffering. I summarized the timeline from the log for pod kube-system/kube-apiserver-kind-control-plane:
|
Does this mean that the kubelet got the started event first and then the created event? Is this an expected behavior? |
The created event always arrives in kubelet before started event for the same container. In this case, the started event arrived in kubelet after the second pod sync work started as the timeline described, so the second pod sync work could only got the created state from the pod cache when started the work(the code: podWorkerLoop). Then the second sync work think the app container should be restarted(recreate) in computePodActions I'm not sure why app container's started event didn't arrive in kubelet just when first pod sync worke finished where the start function already be called for the app container, maybe because the start event is send by async like containerd Anyway, kubelet should manage this information gap correctly with runtime. |
Returning to the current PR fix, the current solution f99b04e should correctly handle this scenario. Do we have some better approach? I'm afraid that this is the simplest fix for the bug.
IMO if the current approach is the intended fix, the key point of discussion is whether 1 minute is an appropriate time threshold to trigger the restart. |
#126543 has been recently merged. A similar fix would be necessary for init containers: kubernetes/pkg/kubelet/kuberuntime/kuberuntime_container.go Lines 1086 to 1092 in 17bae91
I reproduced the same issue with an init container on recent master:
This message implies that another init container was created:
|
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e |
f99b04e
to
76d8dae
Compare
@hshiina I rebased with the latest code. Do you know if we have an e2e test to address the init container restart problem? Or probably, we should add one. |
/test pull-kubernetes-e2e-kind-beta-features |
There doesn't look to be a test that address this problem explicitly. However, I guess a test that runs a web server on a init container like the following one will fail if we hit this issue. Init containers created later will fail to start with an error like "address already in use" and its restart count will be increased: kubernetes/test/e2e/common/node/container_probe.go Lines 787 to 788 in 8e3adc4
Please note that #124297 made this issue less likely to occur. |
…efore seeting it to PodInitializing
76d8dae
to
e404a28
Compare
after rebasing with master to include some related changes. |
/test pull-kubernetes-node-e2e-containerd-alpha-features |
@pacoxu: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
should it also be handled here: #126543? I remember that PR was introducing something similar for sidecars This is one of the PRs that are hard to reason about - it fixes EventedPLEG feature's bug and introduces changes all over, potentially affecting generic PLEG as well. Please make sure all related places are well-documented that those are only for EventedPLEG. Finally, is the test change covers this fix so we will catch regression in the future? Or we may need to do something more than we already have? |
Yes. This PR originally only fixed the problem of static pod when EventedPLEG is enabled. #126543 mentions that there is a similar issue for sidecar/init container.
This can be mocked after #127495 was merged. I can write a new Test to cover those cases. |
This PR is based on #122763, and we use
IsEventedPLEGInUse()
instead ofutilfeature.DefaultFeatureGate.Enabled(features.EventedPLEG)
.Fixes #123087
Context can be found in #122763.