Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix evented pleg mirror pod & use IsEventedPLEGInUse instead of FG status check #122778

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pacoxu
Copy link
Member

@pacoxu pacoxu commented Jan 15, 2024

This PR is based on #122763, and we use IsEventedPLEGInUse() instead of utilfeature.DefaultFeatureGate.Enabled(features.EventedPLEG).

Fixes #123087

Context can be found in #122763.

kubelet: fix a static pod startup bug with Evented PLEG

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 15, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features
/test pull-kubernetes-e2e-kind-beta-features

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 15, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features

@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-unit
/test pull-kubernetes-integration

@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features

2 similar comments
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features

@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features

@pacoxu pacoxu marked this pull request as ready for review January 15, 2024 08:41
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features

1 similar comment
@pacoxu
Copy link
Member Author

pacoxu commented Jan 15, 2024

/test pull-kubernetes-e2e-kind-alpha-beta-features
/test pull-kubernetes-e2e-kind-alpha-features

@pacoxu

This comment was marked as duplicate.

1 similar comment
@pacoxu

This comment was marked as duplicate.

@wxx213
Copy link

wxx213 commented Sep 2, 2024

The dual running issue is not caused by rpc timeout, but for a immediate restart(recreate) for a new container of a static pod currently, this doesn't happen for pod that is not a static pod in the kubernetes manifest. This can be reproduced in the e2e CI using kind. apiserver/controller-manager will fail to start due to dup container created for the same pod, and the second will fail to start due to port already binding. More details in #123087.

@pacoxu I am temporarily unable to access kubernets slack for your discussion context, I'm still confused that how the dup container was created, as we know from the codes, only one podWorkerLoop work could be executed per pod where the pod containers are created and started, could you give your assumption for this?

@pacoxu
Copy link
Member Author

pacoxu commented Sep 2, 2024

The dual running issue is not caused by rpc timeout, but for a immediate restart(recreate) for a new container of a static pod currently, this doesn't happen for pod that is not a static pod in the kubernetes manifest. This can be reproduced in the e2e CI using kind. apiserver/controller-manager will fail to start due to dup container created for the same pod, and the second will fail to start due to port already binding. More details in #123087.

@pacoxu I am temporarily unable to access kubernets slack for your discussion context, I'm still confused that how the dup container was created, as we know from the codes, only one podWorkerLoop work could be executed per pod where the pod containers are created and started, could you give your assumption for this?

@wxx213 This only happened to static pod. In the test CI failure, we find the apiserver pod cannot be started correctly. This bug doesn't occur to a pod which is not static pod, and this is why some users is using the EventedPLEG in their production without a problem.

This bug should be triggered when below two things happened at the same time.

  • for the change here, a CREATED event from container runtime triggered a restart here.
    • if it is not EventedPLEG, but Generic PLEG, the CREATED state of a static pod is in a quite short period and will be ignored.
    • if it is not a staticpod, the logic may be like your statement: the pod should be handled by a podWorkerLoop.
  • meanwhile, the old CREATED container is started(probably, the startContainer is called at the same time the restart happens above.)

BTW, I tried to reproduce the failure in #127063 to show the problem.

@wxx213
Copy link

wxx213 commented Sep 2, 2024

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/127063/pull-kubernetes-e2e-kind-alpha-features/1830544265197916160

@wxx213 You can see the dup container in https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/127063/pull-kubernetes-e2e-kind-alpha-features/1830544265197916160/artifacts/kind-control-plane/pods/kube-system_kube-apiserver-kind-control-plane_6c276a318e8e1532ab2e7dd8dbcfebba/kube-apiserver/

The kubelet log can be found in https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/127063/pull-kubernetes-e2e-kind-alpha-features/1830544265197916160/artifacts/kind-control-plane/

Thanks, I checked the kubelet log for pod kube-system/kube-apiserver-kind-control-plane, and it seems that the start event of the app container wasn't synced to kubelet by evented pleg timely caused this issue. It's a little same as the rpc timeout situation where sometimes the pod is trapped in "dual running" state that what we are suffering.

I summarized the timeline from the log for pod kube-system/kube-apiserver-kind-control-plane:

I0902 10:03:22.220403  kubelet.go:1758] "SyncPod enter", the first sync loop for pod kube-system/kube-apiserver-kind-control-plane started
I0902 10:03:22.710694  evented.go:295 "Received Container Started Event", the evented pleg got container start event, 
    container id: 5dd8a71e5d80f85e0a7c672b7c9a865807c595e403aefa8a33d7266cf5d92811 which is a sandbox container

I0902 10:03:22.710935  pod_workers.go:963] "Notifying pod of pending update, then triggered a pod work pendding for pod 
    kube-system/kube-apiserver-kind-control-plane

I0902 10:03:23.634078  evented.go:292 "Received Container Created Event" , the evented pleg got container created event, 
    container id: eacee74973c186af7e16228a0fb22bb624a25ade9c499e5355910c8659a53968 which is a app container

I0902 10:03:23.839023  kubelet.go:1764 "SyncPod exit", the first sync loop for pod kube-system/kube-apiserver-kind-control-plane finished

I0902 10:03:23.839162  kubelet.go:1758] "SyncPod enter", the second sync loop for pod kube-system/kube-apiserver-kind-control-plane started, the work only 
    got the created state from evented pleg of the app container eacee74973c186af7e16228a0fb22bb624a25ade9c499e5355910c8659a53968
    
I0902 10:03:23.842087  evented.go:295] "Received Container Started Event", container id: eacee74973c186af7e16228a0fb22bb624a25ade9c499e5355910c8659a53968, 
    and then the evented pleg got the started event of the app container eacee74973c186af7e16228a0fb22bb624a25ade9c499e5355910c8659a53968, but the second 
    container sync loop work has already started.

I0902 10:03:23.845966  kuberuntime_manager.go:968 "Container of pod is not in the desired state and shall be started" containerName="kube-apiserver" 
    pod="kube-system/kube-apiserver-kind-control-plane", so the second container sync loop work think the app container didn't started and try to create a 
    new and started failed for the first one was in running state.

@pacoxu
Copy link
Member Author

pacoxu commented Sep 3, 2024

the start event of the app container wasn't synced to kubelet by evented pleg timely

Does this mean that the kubelet got the started event first and then the created event? Is this an expected behavior?

@wxx213
Copy link

wxx213 commented Sep 3, 2024

the start event of the app container wasn't synced to kubelet by evented pleg timely

Does this mean that the kubelet got the started event first and then the created event? Is this an expected behavior?

The created event always arrives in kubelet before started event for the same container.

In this case, the started event arrived in kubelet after the second pod sync work started as the timeline described, so the second pod sync work could only got the created state from the pod cache when started the work(the code: podWorkerLoop). Then the second sync work think the app container should be restarted(recreate) in computePodActions

I'm not sure why app container's started event didn't arrive in kubelet just when first pod sync worke finished where the start function already be called for the app container, maybe because the start event is send by async like containerd

Anyway, kubelet should manage this information gap correctly with runtime.

@pacoxu
Copy link
Member Author

pacoxu commented Sep 3, 2024

Returning to the current PR fix, the current solution f99b04e should correctly handle this scenario. Do we have some better approach? I'm afraid that this is the simplest fix for the bug.

If a pod hangs for more than 1m, we should trigger the restart then.

IMO if the current approach is the intended fix, the key point of discussion is whether 1 minute is an appropriate time threshold to trigger the restart.

@hshiina
Copy link
Contributor

hshiina commented Sep 10, 2024

#126543 has been recently merged. A similar fix would be necessary for init containers:

case kubecontainer.ContainerStateCreated:
// The main sync loop should have created and started the container
// in one step. If the init container is in the 'created' state,
// it is likely that the container runtime failed to start it. To
// prevent the container from getting stuck in the 'created' state,
// restart it.
changes.InitContainersToStart = append(changes.InitContainersToStart, i)

I reproduced the same issue with an init container on recent master:

$ cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: testpod
  name: testpod
spec:
  containers:
  - command:
    - sh
    - -c
    - trap exit TERM; while true; do sleep 1; done
    image: busybox
    name: testcontainer
    resources: {}
  initContainers:
  - command:
    - mkdir
    - /myvolume/test
    image: busybox
    name: testinitcontainer
    volumeMounts:
    - mountPath: /myvolume
      name: myvolume
  volumes:
  - name: myvolume
    emptyDir: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

$ kubectl create -f pod.yaml
pod/testpod created

$ kubectl get pod testpod -w
NAME  	READY   STATUS              	RESTARTS 	AGE
testpod   0/1 	Init:CrashLoopBackOff   2 (3s ago)   7s
testpod   0/1 	Init:0/1            	3 (14s ago)   18s
testpod   0/1 	Init:Error          	3 (14s ago)   18s
testpod   0/1 	Init:CrashLoopBackOff   3 (0s ago)	18s
testpod   0/1 	Init:0/1            	4 (27s ago)   45s
testpod   0/1 	Init:Error          	4 (27s ago)   45s
testpod   0/1 	Init:CrashLoopBackOff   4 (0s ago)	45s

This message implies that another init container was created:

$ kubectl logs testpod -c testinitcontainer
mkdir: can't create directory '/myvolume/test': File exists

@pacoxu
Copy link
Member Author

pacoxu commented Sep 12, 2024

/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

@pacoxu
Copy link
Member Author

pacoxu commented Sep 12, 2024

#126543 has been recently merged. A similar fix would be necessary for init containers:

I reproduced the same issue with an init container on recent master:

@hshiina I rebased with the latest code.

Do you know if we have an e2e test to address the init container restart problem? Or probably, we should add one.

@pacoxu
Copy link
Member Author

pacoxu commented Sep 12, 2024

/test pull-kubernetes-e2e-kind-beta-features
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features

@hshiina
Copy link
Contributor

hshiina commented Sep 12, 2024

#126543 has been recently merged. A similar fix would be necessary for init containers:
I reproduced the same issue with an init container on recent master:

@hshiina I rebased with the latest code.

Do you know if we have an e2e test to address the init container restart problem? Or probably, we should add one.

There doesn't look to be a test that address this problem explicitly. However, I guess a test that runs a web server on a init container like the following one will fail if we hit this issue. Init containers created later will fail to start with an error like "address already in use" and its restart count will be increased:

ginkgo.It("with readiness probe that fails should never be ready and never restart", func(ctx context.Context) {
p := podClient.Create(ctx, testWebServerSidecarPodSpec(probe.withFailing().build(), nil, "test-webserver", 80))

Please note that #124297 made this issue less likely to occur.

@pacoxu
Copy link
Member Author

pacoxu commented Sep 25, 2024

after rebasing with master to include some related changes.
/test pull-kubernetes-e2e-kind-beta-features
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-e2e-kind-evented-pleg
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features

@pacoxu
Copy link
Member Author

pacoxu commented Oct 11, 2024

/test pull-kubernetes-node-e2e-containerd-alpha-features

@k8s-ci-robot
Copy link
Contributor

@pacoxu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e e404a28 link false /test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
pull-kubernetes-node-e2e-containerd-alpha-features e404a28 link false /test pull-kubernetes-node-e2e-containerd-alpha-features

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@SergeyKanzhelev
Copy link
Member

should it also be handled here: #126543? I remember that PR was introducing something similar for sidecars

This is one of the PRs that are hard to reason about - it fixes EventedPLEG feature's bug and introduces changes all over, potentially affecting generic PLEG as well. Please make sure all related places are well-documented that those are only for EventedPLEG.

Finally, is the test change covers this fix so we will catch regression in the future? Or we may need to do something more than we already have?

@pacoxu
Copy link
Member Author

pacoxu commented Oct 12, 2024

should it also be handled here: #126543? I remember that PR was introducing something similar for sidecars

This is one of the PRs that are hard to reason about - it fixes EventedPLEG feature's bug and introduces changes all over, potentially affecting generic PLEG as well. Please make sure all related places are well-documented that those are only for EventedPLEG.

Yes. This PR originally only fixed the problem of static pod when EventedPLEG is enabled.

#126543 mentions that there is a similar issue for sidecar/init container.

Finally, is the test change covers this fix so we will catch regression in the future? Or we may need to do something more than we already have?

This can be mocked after #127495 was merged. I can write a new Test to cover those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Status: PRs Waiting on Author
Status: Waiting on Author
Status: Release Blocker
Development

Successfully merging this pull request may close these issues.

kubelet started multi-containers for static pods when EventedPLEG is enabled