Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1596440 - surface OOMKilled pod to build #20297

Conversation

wozniakjan
Copy link
Contributor

@wozniakjan wozniakjan commented Jul 12, 2018

Init container can be OOMKilled as well, that way the OOMKilled doesn't bubble up to the pod.status.reason but remains among pod.status.initcontainerstatuses.

Comparing as a string because it looks like kubelet is setting this as a raw string as well, I couldn't find any constants representing OOMKilled reason

case r.State.OOMKilled:
// TODO: consider exposing OOMKilled via the runtimeAPI.
// Note: if an application handles OOMKilled gracefully, the
// exit code could be zero.
reason = "OOMKilled"

fixes https://bugzilla.redhat.com/show_bug.cgi?id=1596440
ptal @openshift/sig-developer-experience

@wozniakjan wozniakjan added kind/bug Categorizes issue or PR as related to a bug. component/build labels Jul 12, 2018
@wozniakjan wozniakjan added this to the v3.11 milestone Jul 12, 2018
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 12, 2018
@@ -514,6 +514,10 @@ const (
// range of build failures.
StatusReasonGenericBuildFailed StatusReason = "GenericBuildFailed"

// StatusReasonGenericBuildFailed is the reason associated with a broad
// range of build failures.
StatusReasonOOMKilled StatusReason = "OOMKilled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder: have to be copied to openshift/api as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder, I assume somewhere here

Btw. it appears to work even without the API change, so just out of pure curiosity, what is the purpose of having the constants duplicated in the API? I didn't find any errors or warnings in the logs.

$ oc get build
NAME      TYPE      FROM          STATUS                       STARTED          DURATION
bc1-1     Source    Git           Failed (FetchSourceFailed)   13 minutes ago   5s
bc1-2     Source    Git           Failed (FetchSourceFailed)   12 minutes ago   2s
bc1-3     Source    Git@eb6a686   Failed (AssembleFailed)      12 minutes ago   19s
bc1-4     Source    Git           Failed (OOMKilled)           10 minutes ago   2s
bc1-5     Source    Git           Failed (OOMKilled)           5 seconds ago    1s

$ oc describe build bc1-5 | grep 'Status:'
Status:         Failed (Out of memory)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of having the constants duplicated in the API?

putting them in the api repo makes them available to other repos like client tools that want to use them. Also technically the types in origin are the internal types which do not have any guarantees about not being changed. Constants defined in the openshift/api repo (the external api) are part of the api spec, meaning we clients can rely on them and we won't change their values.

@@ -540,6 +544,7 @@ const (
StatusMessageNoBuildContainerStatus = "The pod for this build has no container statuses indicating success or failure."
StatusMessageFailedContainer = "The pod for this build has at least one container with a non-zero exit status."
StatusMessageGenericBuildFailed = "Generic Build failure - check logs for details."
StatusMessageOOMKilled = "Out of memory"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would maybe put more details into this message: "what is out of memory?"

@@ -1014,6 +1016,25 @@ func (bc *BuildController) handleActiveBuild(build *buildapi.Build, pod *v1.Pod)
return update, nil
}

func isOOMKilled(pod *v1.Pod) bool {
if pod.Status.Reason == "OOMKilled" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have constant for this right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought since that constant is in builds context, it shouldn't be used to check the status reason for pods (they currently have the same string describing it but potentially this could change, maybe?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah technically they are two different constants. One is the "build failure reason" and the other is the "pod status reason" so i agree that keeping them as separate constants makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too bad there's no k8s constant for this that we can reference.

@wozniakjan
Copy link
Contributor Author

/retest

FORWARD PARAMETERS TO THE REMOTE HOST

@wozniakjan wozniakjan force-pushed the bug-1596440/surface_oomkilled_in_build branch 4 times, most recently from 1998310 to 80f1364 Compare July 12, 2018 13:08
@@ -540,6 +543,7 @@ const (
StatusMessageNoBuildContainerStatus = "The pod for this build has no container statuses indicating success or failure."
StatusMessageFailedContainer = "The pod for this build has at least one container with a non-zero exit status."
StatusMessageGenericBuildFailed = "Generic Build failure - check logs for details."
StatusMessageOOMKilled = "The build pod was killed for 'Out of memory'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The build pod was killed due to an out of memory condition" (we don't actually know if it's the node that ran out of memory or the container exceeded its cgroup memory limit, right? @derekwaynecarr @mfojtik @sjenning)

@bparees
Copy link
Contributor

bparees commented Jul 12, 2018

one nit for discussion on the message and let's get the api change in and use the constant here, but otherwise lgtm.

@bparees bparees self-assigned this Jul 12, 2018
@wozniakjan wozniakjan force-pushed the bug-1596440/surface_oomkilled_in_build branch from 80f1364 to a529a8e Compare July 12, 2018 14:18
@wozniakjan wozniakjan changed the title Bug 1596440 - surface OOMKilled pod to build [WIP] Bug 1596440 - surface OOMKilled pod to build Jul 12, 2018
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2018
@wozniakjan wozniakjan force-pushed the bug-1596440/surface_oomkilled_in_build branch from a529a8e to a9c8e88 Compare July 12, 2018 15:28
@wozniakjan wozniakjan changed the title [WIP] Bug 1596440 - surface OOMKilled pod to build Bug 1596440 - surface OOMKilled pod to build Jul 12, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2018
@wozniakjan
Copy link
Contributor Author

@bparees thanks for the review, updated to match the openshift/api#71 (OOM -> OutOfMemory)

@bparees
Copy link
Contributor

bparees commented Jul 12, 2018

@wozniakjan can you add a build extended test for this?

@wozniakjan wozniakjan changed the title Bug 1596440 - surface OOMKilled pod to build [WIP] Bug 1596440 - surface OOMKilled pod to build Jul 12, 2018
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 12, 2018
@wozniakjan
Copy link
Contributor Author

/retest

@wozniakjan wozniakjan force-pushed the bug-1596440/surface_oomkilled_in_build branch from 31871e1 to e59dbb8 Compare July 13, 2018 09:15
@wozniakjan
Copy link
Contributor Author

/test unit

@wozniakjan wozniakjan changed the title [WIP] Bug 1596440 - surface OOMKilled pod to build Bug 1596440 - surface OOMKilled pod to build Jul 13, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 13, 2018
@wozniakjan
Copy link
Contributor Author

wozniakjan commented Jul 13, 2018

@bparees thanks for reminding to add an extended test, ran it couple of times today and seems to work as expected.

EDIT: actually, just finished another run of extended tests and it failed to register oomkill
https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_builds_debug/45/consoleFull#146509343056cbb9a5e4b02b88ae8c2f77

is it possible that when OOMKilled pod happens, build logic sets the build.Status.Phase to failed before the controller gets a chance to investigate?

if build.Status.Phase != buildapi.BuildPhaseFailed {
// If a DeletionTimestamp has been set, it means that the pod will
// soon be deleted. The build should be transitioned to the Error phase.
if pod.DeletionTimestamp != nil {
update = transitionToPhase(buildapi.BuildPhaseError, buildapi.StatusReasonBuildPodDeleted, buildapi.StatusMessageBuildPodDeleted)
} else if isOOMKilled(pod) {
update = transitionToPhase(buildapi.BuildPhaseFailed, buildapi.StatusReasonOutOfMemoryKilled, buildapi.StatusMessageOutOfMemoryKilled)
} else {
update = transitionToPhase(buildapi.BuildPhaseFailed, buildapi.StatusReasonGenericBuildFailed, buildapi.StatusMessageGenericBuildFailed)
}
}

Would we maybe want to override the failure status here if we are confident it was OOMKilled and move the OOMKill check before build.status.phase if branch?

@wozniakjan wozniakjan changed the title Bug 1596440 - surface OOMKilled pod to build [WIP] Bug 1596440 - surface OOMKilled pod to build Jul 13, 2018
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 13, 2018
@wozniakjan
Copy link
Contributor Author

/test extended_builds

SYNC REPOSITORIES

@wozniakjan wozniakjan force-pushed the bug-1596440/surface_oomkilled_in_build branch from e2dfc84 to 3ce641c Compare July 13, 2018 13:04
@bparees
Copy link
Contributor

bparees commented Jul 13, 2018

is it possible that when OOMKilled pod happens, build logic sets the build.Status.Phase to failed before the controller gets a chance to investigate?

generally the only time the build logic running inside the pod will update the build.Status.Phase is when it knows it's going to fail (such as when the git clone fails). The container being OOMkilled would not be such a case, by the time it's oomkilled there's no way for it to update the build.status.phase (because the process that would need to do that update, just got oomkilled). At least near as i can reason.

But some judicious tracing should show you who's setting the phase to failed.

That said, checking to see if the pod was oomkilled, when the build controller sees a build updated to failed, and then adding the status would not be the worst thing in the world. But it's not great because like the logsnippet issue I just fixed, it means that someone watching builds will first see the build transition to failed, and then see a follow-on update adding the reason for the failure. It's better if they get all the information in the update that set the build to failed.

@wozniakjan
Copy link
Contributor Author

wozniakjan commented Jul 17, 2018

The container being OOMkilled would not be such a case, by the time it's oomkilled there's no way for it to update the build.status.phase (because the process that would need to do that update, just got oomkilled). At least near as i can reason.

I would have thought that as well, but I can now reproduce a situation, where init container is being OOMKilled and the build populates its failure status before the pod updates InitContainerStatuses and build_controller.go doesn't yet know about the OOMKill.

It's better if they get all the information in the update that set the build to failed.

I agree, but at this moment, I don't know how to achieve this without changing a lot of builder logic. For now, I would like to rewrite the OOMKilled test to wait for the updated status and I hope this change won't introduce any flakiness.

Description of how I came up with the above statements:
I took your advice about tracing and added a few debug logs in wozniakjan#5. Interpreting the logs show following sequence of events

  1. build pod fails at
    if err != nil {
    c.build.Status.Phase = buildapiv1.BuildPhaseFailed
    c.build.Status.Reason = buildapiv1.StatusReasonFetchSourceFailed
    c.build.Status.Message = builderutil.StatusMessageFetchSourceFailed
    fmt.Println("# builder 178")
    return err
    }
  2. builder.go populates the build.status fo Failed when pod is still at Pending
I0717 10:38:48.958687       1 build_controller.go:377] ### PRE  12 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.958741       1 build_controller.go:378] ### PRE  12 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:48.958848       1 build_controller.go:381] ### POST 12 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.958933       1 build_controller.go:382] ### POST 12 UPDATE buildUpdate(completionTime: "2018-07-17 10:38:48.958797705 +0000 UTC m=+166.524519037", duration: "2s")
I0717 10:38:48.958952       1 build_controller.go:383] ### POST 12 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
  1. when build_controller.go sees the OOMKilled pod, the status is already populated as described above
I0717 10:38:49.948678       1 build_controller.go:377] ### PRE  15 POD statusfail-oomkilled-17-build[Failed] init: [TERMINATED(OOMKilled), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.948731       1 build_controller.go:378] ### PRE  15 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.948778       1 build_controller.go:381] ### POST 15 POD statusfail-oomkilled-17-build[Failed] init: [TERMINATED(OOMKilled), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.948795       1 build_controller.go:382] ### POST 15 UPDATE nil
I0717 10:38:49.948811       1 build_controller.go:383] ### POST 15 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
  1. build_controller.go never gets a chance to update the reason because of this switch
    case shouldCancel(build):
    update, err = bc.cancelBuild(build)
    case build.Status.Phase == buildapi.BuildPhaseNew:
    update, err = bc.handleNewBuild(build, pod)
    case build.Status.Phase == buildapi.BuildPhasePending,
    build.Status.Phase == buildapi.BuildPhaseRunning:
    update, err = bc.handleActiveBuild(build, pod)
    case buildutil.IsBuildComplete(build):
    update, err = bc.handleCompletedBuild(build, pod)
    }

Full logs:

augmented controller logs
I0717 10:38:46.757879       1 build_controller.go:377] ### PRE  0 POD nil
I0717 10:38:46.757896       1 build_controller.go:378] ### PRE  0 BUILD statusfail-oomkilled-17[New] []
I0717 10:38:46.781693       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:46.781714       1 build_controller.go:381] ### POST 0 POD nil
I0717 10:38:46.781738       1 build_controller.go:382] ### POST 0 UPDATE buildUpdate(phase: "Pending", reason: "", message: "", podName: "statusfail-oomkilled-17-build")
I0717 10:38:46.781753       1 build_controller.go:383] ### POST 0 BUILD statusfail-oomkilled-17[New] []
I0717 10:38:46.781759       1 build_controller.go:390] ###
I0717 10:38:46.804798       1 build_controller.go:377] ### PRE  1 POD statusfail-oomkilled-17-build[Pending] init: [], cont: []
I0717 10:38:46.804813       1 build_controller.go:378] ### PRE  1 BUILD statusfail-oomkilled-17[New] []
I0717 10:38:46.804822       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:46.804828       1 build_controller.go:381] ### POST 1 POD statusfail-oomkilled-17-build[Pending] init: [], cont: []
I0717 10:38:46.804836       1 build_controller.go:382] ### POST 1 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:46.804840       1 build_controller.go:383] ### POST 1 BUILD statusfail-oomkilled-17[New] []
I0717 10:38:46.804843       1 build_controller.go:390] ###
I0717 10:38:46.840944       1 build_controller.go:377] ### PRE  2 POD statusfail-oomkilled-17-build[Pending] init: [WAITING(PodInitializing), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:46.840968       1 build_controller.go:378] ### PRE  2 BUILD statusfail-oomkilled-17[New] []
I0717 10:38:46.840982       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:46.840993       1 build_controller.go:381] ### POST 2 POD statusfail-oomkilled-17-build[Pending] init: [WAITING(PodInitializing), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:46.841035       1 build_controller.go:382] ### POST 2 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:46.841042       1 build_controller.go:383] ### POST 2 BUILD statusfail-oomkilled-17[New] []
I0717 10:38:46.841048       1 build_controller.go:390] ###
I0717 10:38:46.957944       1 build_controller.go:377] ### PRE  3 POD statusfail-oomkilled-17-build[Pending] init: [WAITING(PodInitializing), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:46.957969       1 build_controller.go:378] ### PRE  3 BUILD statusfail-oomkilled-17[Pending] []
I0717 10:38:46.957981       1 build_controller.go:381] ### POST 3 POD statusfail-oomkilled-17-build[Pending] init: [WAITING(PodInitializing), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.850517       1 build_controller.go:377] ### PRE  3 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.850534       1 build_controller.go:378] ### PRE  3 BUILD statusfail-oomkilled-17[Pending] []
I0717 10:38:47.850540       1 build_controller.go:1428] ### transitioning to Running
I0717 10:38:47.850547       1 build_controller.go:381] ### POST 3 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.850573       1 build_controller.go:382] ### POST 3 UPDATE buildUpdate(phase: "Running", reason: "", message: "", startTime: "2018-07-17 10:38:46 +0000 UTC")
I0717 10:38:47.850577       1 build_controller.go:383] ### POST 3 BUILD statusfail-oomkilled-17[Pending] []
I0717 10:38:47.850581       1 build_controller.go:390] ###
I0717 10:38:47.957911       1 build_controller.go:377] ### PRE  4 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.957930       1 build_controller.go:378] ### PRE  4 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.957936       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:47.957943       1 build_controller.go:381] ### POST 4 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.957975       1 build_controller.go:382] ### POST 4 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:47.957979       1 build_controller.go:383] ### POST 4 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.957982       1 build_controller.go:390] ###
I0717 10:38:47.963146       1 build_controller.go:377] ### PRE  5 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.963164       1 build_controller.go:378] ### PRE  5 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.963171       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:47.963181       1 build_controller.go:381] ### POST 5 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.963216       1 build_controller.go:382] ### POST 5 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:47.963222       1 build_controller.go:383] ### POST 5 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.963227       1 build_controller.go:390] ###
I0717 10:38:47.973417       1 build_controller.go:377] ### PRE  6 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.973432       1 build_controller.go:378] ### PRE  6 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.973437       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:47.973445       1 build_controller.go:381] ### POST 6 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.973454       1 build_controller.go:382] ### POST 6 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:47.973458       1 build_controller.go:383] ### POST 6 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.973468       1 build_controller.go:390] ###
I0717 10:38:47.994058       1 build_controller.go:377] ### PRE  7 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.994384       1 build_controller.go:378] ### PRE  7 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.994453       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:47.994499       1 build_controller.go:381] ### POST 7 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:47.994544       1 build_controller.go:382] ### POST 7 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:47.994567       1 build_controller.go:383] ### POST 7 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:47.994587       1 build_controller.go:390] ###
I0717 10:38:48.035044       1 build_controller.go:377] ### PRE  8 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.035117       1 build_controller.go:378] ### PRE  8 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.035147       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:48.035265       1 build_controller.go:381] ### POST 8 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.035315       1 build_controller.go:382] ### POST 8 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:48.035350       1 build_controller.go:383] ### POST 8 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.035370       1 build_controller.go:390] ###
I0717 10:38:48.116138       1 build_controller.go:377] ### PRE  9 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.116364       1 build_controller.go:378] ### PRE  9 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.116395       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:48.116424       1 build_controller.go:381] ### POST 9 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.116455       1 build_controller.go:382] ### POST 9 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:48.116470       1 build_controller.go:383] ### POST 9 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.116483       1 build_controller.go:390] ###
I0717 10:38:48.276963       1 build_controller.go:377] ### PRE  10 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.277031       1 build_controller.go:378] ### PRE  10 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.277062       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:48.277098       1 build_controller.go:381] ### POST 10 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.277141       1 build_controller.go:382] ### POST 10 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:48.277165       1 build_controller.go:383] ### POST 10 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.277261       1 build_controller.go:390] ###
I0717 10:38:48.597614       1 build_controller.go:377] ### PRE  11 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.597674       1 build_controller.go:378] ### PRE  11 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.597694       1 build_controller.go:1428] ### transitioning to Pending
I0717 10:38:48.597717       1 build_controller.go:381] ### POST 11 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.597771       1 build_controller.go:382] ### POST 11 UPDATE buildUpdate(phase: "Pending", reason: "", message: "")
I0717 10:38:48.597788       1 build_controller.go:383] ### POST 11 BUILD statusfail-oomkilled-17[Running] []
I0717 10:38:48.597801       1 build_controller.go:390] ###
I0717 10:38:48.958687       1 build_controller.go:377] ### PRE  12 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.958741       1 build_controller.go:378] ### PRE  12 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:48.958848       1 build_controller.go:381] ### POST 12 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:48.958933       1 build_controller.go:382] ### POST 12 UPDATE buildUpdate(completionTime: "2018-07-17 10:38:48.958797705 +0000 UTC m=+166.524519037", duration: "2s")
I0717 10:38:48.958952       1 build_controller.go:383] ### POST 12 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:48.958967       1 build_controller.go:390] ###
I0717 10:38:49.158942       1 build_controller.go:377] ### PRE  13 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.159003       1 build_controller.go:378] ### PRE  13 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.159031       1 build_controller.go:381] ### POST 13 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.159046       1 build_controller.go:382] ### POST 13 UPDATE nil
I0717 10:38:49.159061       1 build_controller.go:383] ### POST 13 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.159084       1 build_controller.go:390] ###
I0717 10:38:49.238181       1 build_controller.go:377] ### PRE  14 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.238308       1 build_controller.go:378] ### PRE  14 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.238341       1 build_controller.go:381] ### POST 14 POD statusfail-oomkilled-17-build[Pending] init: [RUNNING(), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.238387       1 build_controller.go:382] ### POST 14 UPDATE nil
I0717 10:38:49.238407       1 build_controller.go:383] ### POST 14 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.238426       1 build_controller.go:390] ###
I0717 10:38:49.948678       1 build_controller.go:377] ### PRE  15 POD statusfail-oomkilled-17-build[Failed] init: [TERMINATED(OOMKilled), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.948731       1 build_controller.go:378] ### PRE  15 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.948778       1 build_controller.go:381] ### POST 15 POD statusfail-oomkilled-17-build[Failed] init: [TERMINATED(OOMKilled), WAITING(PodInitializing), ], cont: [WAITING(PodInitializing), ]
I0717 10:38:49.948795       1 build_controller.go:382] ### POST 15 UPDATE nil
I0717 10:38:49.948811       1 build_controller.go:383] ### POST 15 BUILD statusfail-oomkilled-17[Failed] [FetchSourceFailed]
I0717 10:38:49.948825       1 build_controller.go:390] ###
$ oc logs build/statusfail-oomkilled-17
Cloning "https://github.com/openshift/ruby-hello-world" ...
# builder 178
error: signal: killed
$ oc describe pod
Name:           statusfail-oomkilled-17-build
Namespace:      myproject
Node:           localhost/192.168.122.76
Start Time:     Tue, 17 Jul 2018 06:38:46 -0400
Labels:         openshift.io/build.name=statusfail-oomkilled-17
Annotations:    openshift.io/build.name=statusfail-oomkilled-17
                openshift.io/scc=privileged
Status:         Failed
IP:             172.17.0.4
Controlled By:  Build/statusfail-oomkilled-17
Init Containers:
  git-clone:
    Container ID:  docker://88e1e186ee188428132e2cbcd6684a54d2e6598d62f11c6a44a6413f9cf3bb29
    Image:         openshift/origin-docker-builder:v3.11
    Image ID:      docker://sha256:944baf0428aa27e155f26a8d8ea3f9a3c0363773e9a4cc0b8efc441c3d12369e
    Port:          <none>
    Host Port:     <none>
    Command:
      openshift-git-clone
    Args:
      --loglevel=0
    State:      Terminated
      Reason:   OOMKilled
      Message:  Cloning "https://github.com/openshift/ruby-hello-world" ...
# builder 178
error: signal: killed

      Exit Code:    1
      Started:      Tue, 17 Jul 2018 06:38:47 -0400
      Finished:     Tue, 17 Jul 2018 06:38:48 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  20Mi
    Requests:
      memory:  20Mi
    Environment:
$ oc describe build
Name:           statusfail-oomkilled-17
Namespace:      myproject
Created:        About a minute ago
Labels:         buildconfig=statusfail-oomkilled
                openshift.io/build-config.name=statusfail-oomkilled
                openshift.io/build.start-policy=Serial
Annotations:    openshift.io/build-config.name=statusfail-oomkilled
                openshift.io/build.number=17
                openshift.io/build.pod-name=statusfail-oomkilled-17-build

Status:         ^[[1mFailed^[[0m (Failed to fetch the input source.)
Started:        Tue, 17 Jul 2018 06:38:46 EDT
Duration:       2s

Build Config:   statusfail-oomkilled
Build Pod:      statusfail-oomkilled-17-build

Strategy:       Source
URL:            https://github.com/openshift/ruby-hello-world
From Image:     DockerImage centos/ruby-23-centos7:latest

Build trigger cause:    Manually triggered

Events:
  Type          Reason          Age     From                    Message
  ----          ------          ----    ----                    -------
  Normal        Scheduled       1m      default-scheduler       Successfully assigned myproject/statusfail-oomkilled-17-build to localhost
  Normal        BuildStarted    1m      build-controller        Build myproject/statusfail-oomkilled-17 is now running
  Normal        Pulled          1m      kubelet, localhost      Container image "openshift/origin-docker-builder:v3.11" already present on machine
  Normal        Created         1m      kubelet, localhost      Created container
  Normal        Started         1m      kubelet, localhost      Started container
  Normal        BuildFailed     1m      build-controller        Build myproject/statusfail-oomkilled-17 failed

@bparees
Copy link
Contributor

bparees commented Jul 17, 2018

I would like to rewrite the OOMKilled test to wait for the updated status and I hope this change won't introduce any flakiness.

discussed on irc. what is happening here is the git clone process is being killed, but the container's pid 1 (the build logic) is not, so it sees the git clone process exited and marks the build failed (w/o indicating oom killed as the reason because it just knows the git clone died, not why).

so in this case we'll see two updates to the build (one indicating reason fetch source failed, and the second indicating oom killed). I think we can live w/ that.

@wozniakjan wozniakjan force-pushed the bug-1596440/surface_oomkilled_in_build branch from 3ce641c to 01eab5b Compare July 18, 2018 14:56
@wozniakjan wozniakjan changed the title [WIP] Bug 1596440 - surface OOMKilled pod to build Bug 1596440 - surface OOMKilled pod to build Jul 18, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 18, 2018
@wozniakjan
Copy link
Contributor Author

/test gcp

// can get OOMKilled and this confuses builder to prematurely populate
// failure reason
if oldBuild.Status.Phase == buildapi.BuildPhaseFailed &&
newBuild.Status.Reason != buildapi.StatusReasonOutOfMemoryKilled {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bparees I think the API was overriding the reason and message here. With this changed, I don't get any flakes testing this (but both reason and message updates after being set from builder.go)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so without this change the reason was never getting set to OOM, or it only sometimes getting set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the time it was setting OOM, only sometimes, when builder.go set it first, it wasn't

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it's an ugly hack, but i don't have a better idea at the moment :)

@bparees
Copy link
Contributor

bparees commented Jul 18, 2018

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 18, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, wozniakjan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2018
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 18, 2018

@wozniakjan: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/cross 3bd1d95 link /test cross

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wozniakjan
Copy link
Contributor Author

/test gcp

@openshift-merge-robot openshift-merge-robot merged commit 6b0c540 into openshift:master Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/build kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants