-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595
Comments
@sjenning it seems to already have #17514 I don't think your condition prevents failed->succeded
There few other flakes likely caused by kubelet not respecting pod state transition diagram: |
(previous issue #17011) |
@sjenning any chance the pod team can investigate this? regardless of flakes, this might affect production deployments where the deployer pod can go from failed to success. |
I'm looking into this now. Sorry for the delay. |
@tnozicka just wondering, have you seen this one lately? Doesn't look like we have hit this one for a while now. Did it disappear at the same time we bumped the disk size on the CI instances? |
I haven't. @ironcladlou @mfojtik got any flakes looking like this? (Those are usually different tests as the detector is asynchronous and run for every deployment test.) Any chance that this might be related #18233? |
likely not caused by the informers issue I pointed you to as @deads2k just saw it and the watch cache is already fixed on master |
@tnozicka still trying to get to the root cause on this kubernetes/kubernetes#58711 (comment) |
Opened a PR upstream: |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. kubelet: check for illegal phase transition I have been unable to root cause the transition from `Failed` phase to `Succeeded`. I have been unable to recreate as well. However, our CI in Origin, where we have controllers that look for these transitions and rely on the phase transition rules being respected, obviously indicate that the illegal transition does occur. This PR will prevent the illegal phase transition from propagating into the kubelet caches or the API server. Fixes #58711 xref openshift/origin#17595 @dashpole @yujuhong @derekwaynecarr @smarterclayton @tnozicka
Did this make it into 3.9? If so, still happening (below). If not, we need to get it in before release. |
It did go in about 2 weeks ago Looking into it. If this is legitimately happening even with the check I added, I'm very confused. |
I don't think the requested check in apiserver is present. To check, if I do
it gets overwritten back to Running. This is a serious bug and I feel we need that enforcement in the apiserver as well as the fix. It feels like the whole kubelet status manager ignores optimistic concurrency with resourceVersion and just does GET before updating the status no matter what resourceVersion was used for computing that status - the precondition might have changed in that case. origin/vendor/k8s.io/kubernetes/pkg/kubelet/status/status_manager.go Lines 451 to 476 in 710998e
|
switching back to this issue |
I swear, I'm losing it. I totally forgot that this PR got reopened and merged upstream kubernetes/kubernetes#54530. I'll pick it right now. |
Automatic merge from submit-queue. [3.9] UPSTREAM: 54530: api: validate container phase transitions master PR #18791 kubernetes/kubernetes#54530 fixes #17595 xref https://bugzilla.redhat.com/show_bug.cgi?id=1534492 @tnozicka @smarterclayton @derekwaynecarr
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17589/test_pull_request_origin_extended_conformance_install/3497/
/sig master
/kind test-flake
/assign mfojtik tnozicka
xref https://bugzilla.redhat.com/show_bug.cgi?id=1534492
The text was updated successfully, but these errors were encountered: