Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

Closed
0xmichalis opened this issue Dec 5, 2017 · 19 comments · Fixed by #18791
Closed

Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

0xmichalis opened this issue Dec 5, 2017 · 19 comments · Fixed by #18791
Assignees
Labels
Milestone

Comments

@0xmichalis
Copy link
Contributor

0xmichalis commented Dec 5, 2017

/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:1299
2017-12-05 09:50:41.393147976 +0000 UTC: detected deployer pod transition from terminated phase: "Failed" -> "Succeeded"
Expected
    <bool>: true
to be false
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/util.go:727

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17589/test_pull_request_origin_extended_conformance_install/3497/

/sig master
/kind test-flake
/assign mfojtik tnozicka

xref https://bugzilla.redhat.com/show_bug.cgi?id=1534492

@tnozicka
Copy link
Contributor

tnozicka commented Dec 5, 2017

@sjenning it seems to already have #17514

I don't think your condition prevents failed->succeded

if oldStatus.State.Terminated != nil && newStatus.State.Terminated == nil {

There few other flakes likely caused by kubelet not respecting pod state transition diagram:

/assign @sjenning
/unassign @tnozicka @mfojtik
/priority P0

@tnozicka
Copy link
Contributor

tnozicka commented Dec 5, 2017

(previous issue #17011)

@tnozicka tnozicka changed the title deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with config change in case the deployment is still running [Suite:openshift/conformance/parallel] Pod shall not transition from terminated phase: "Failed" -> "Succeeded" Dec 5, 2017
@mfojtik
Copy link
Contributor

mfojtik commented Jan 8, 2018

@sjenning any chance the pod team can investigate this? regardless of flakes, this might affect production deployments where the deployer pod can go from failed to success.

@sjenning
Copy link
Contributor

I'm looking into this now. Sorry for the delay.

@sjenning
Copy link
Contributor

@tnozicka just wondering, have you seen this one lately? Doesn't look like we have hit this one for a while now. Did it disappear at the same time we bumped the disk size on the CI instances?

@tnozicka
Copy link
Contributor

I haven't. @ironcladlou @mfojtik got any flakes looking like this? (Those are usually different tests as the detector is asynchronous and run for every deployment test.)

Any chance that this might be related #18233?

@tnozicka
Copy link
Contributor

tnozicka commented Feb 1, 2018

likely not caused by the informers issue I pointed you to as @deads2k just saw it and the watch cache is already fixed on master

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18387/test_pull_request_origin_extended_conformance_install/6640/

@sjenning
Copy link
Contributor

sjenning commented Feb 1, 2018

@tnozicka still trying to get to the root cause on this kubernetes/kubernetes#58711 (comment)

@sjenning
Copy link
Contributor

Opened a PR upstream:
kubernetes/kubernetes#59767

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this issue Feb 12, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kubelet: check for illegal phase transition

I have been unable to root cause the transition from `Failed` phase to `Succeeded`.  I have been unable to recreate as well.  However, our CI in Origin, where we have controllers that look for these transitions and rely on the phase transition rules being respected, obviously indicate that the illegal transition does occur.

This PR will prevent the illegal phase transition from propagating into the kubelet caches or the API server.

Fixes #58711

xref openshift/origin#17595

@dashpole @yujuhong @derekwaynecarr @smarterclayton @tnozicka
@smarterclayton
Copy link
Contributor

Did this make it into 3.9? If so, still happening (below). If not, we need to get it in before release.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18739/test_pull_request_origin_extended_conformance_install/7966/

@sjenning
Copy link
Contributor

It did go in about 2 weeks ago
#18585

Looking into it. If this is legitimately happening even with the check I added, I'm very confused.

@tnozicka
Copy link
Contributor

tnozicka commented Mar 1, 2018

I don't think the requested check in apiserver is present.

To check, if I do

curl -k -v -XPATCH -H "Authorization: Bearer ${TOKEN}" -H "Accept: application/json" -H "Content-Type: application/strategic-merge-patch+json" https://172.16.20.11:8443/api/v1/namespaces/test/pods/busyapp-76589968df-xnthx/status -d '{"status": {"phase": "Failed"}}'

it gets overwritten back to Running.

This is a serious bug and I feel we need that enforcement in the apiserver as well as the fix.

It feels like the whole kubelet status manager ignores optimistic concurrency with resourceVersion and just does GET before updating the status no matter what resourceVersion was used for computing that status - the precondition might have changed in that case.

pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(status.podName, metav1.GetOptions{})
if errors.IsNotFound(err) {
glog.V(3).Infof("Pod %q (%s) does not exist on the server", status.podName, uid)
// If the Pod is deleted the status will be cleared in
// RemoveOrphanedStatuses, so we just ignore the update here.
return
}
if err != nil {
glog.Warningf("Failed to get status for pod %q: %v", format.PodDesc(status.podName, status.podNamespace, uid), err)
return
}
translatedUID := m.podManager.TranslatePodUID(pod.UID)
// Type convert original uid just for the purpose of comparison.
if len(translatedUID) > 0 && translatedUID != kubetypes.ResolvedPodUID(uid) {
glog.V(2).Infof("Pod %q was deleted and then recreated, skipping status update; old UID %q, new UID %q", format.Pod(pod), uid, translatedUID)
m.deletePodStatus(uid)
return
}
pod.Status = status.status
// TODO: handle conflict as a retry, make that easier too.
newPod, err := m.kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod)
if err != nil {
glog.Warningf("Failed to update status for pod %q: %v", format.Pod(pod), err)
return
}

@sjenning
Copy link
Contributor

sjenning commented Mar 1, 2018

switching back to this issue

@sjenning
Copy link
Contributor

sjenning commented Mar 1, 2018

I swear, I'm losing it. I totally forgot that this PR got reopened and merged upstream kubernetes/kubernetes#54530. I'll pick it right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants