-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1596440 - surface OOMKilled pod to build #20297
Bug 1596440 - surface OOMKilled pod to build #20297
Conversation
pkg/build/apis/build/types.go
Outdated
@@ -514,6 +514,10 @@ const ( | |||
// range of build failures. | |||
StatusReasonGenericBuildFailed StatusReason = "GenericBuildFailed" | |||
|
|||
// StatusReasonGenericBuildFailed is the reason associated with a broad | |||
// range of build failures. | |||
StatusReasonOOMKilled StatusReason = "OOMKilled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder: have to be copied to openshift/api as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder, I assume somewhere here
Btw. it appears to work even without the API change, so just out of pure curiosity, what is the purpose of having the constants duplicated in the API? I didn't find any errors or warnings in the logs.
$ oc get build
NAME TYPE FROM STATUS STARTED DURATION
bc1-1 Source Git Failed (FetchSourceFailed) 13 minutes ago 5s
bc1-2 Source Git Failed (FetchSourceFailed) 12 minutes ago 2s
bc1-3 Source Git@eb6a686 Failed (AssembleFailed) 12 minutes ago 19s
bc1-4 Source Git Failed (OOMKilled) 10 minutes ago 2s
bc1-5 Source Git Failed (OOMKilled) 5 seconds ago 1s
$ oc describe build bc1-5 | grep 'Status:'
Status: Failed (Out of memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of having the constants duplicated in the API?
putting them in the api repo makes them available to other repos like client tools that want to use them. Also technically the types in origin are the internal types which do not have any guarantees about not being changed. Constants defined in the openshift/api repo (the external api) are part of the api spec, meaning we clients can rely on them and we won't change their values.
pkg/build/apis/build/types.go
Outdated
@@ -540,6 +544,7 @@ const ( | |||
StatusMessageNoBuildContainerStatus = "The pod for this build has no container statuses indicating success or failure." | |||
StatusMessageFailedContainer = "The pod for this build has at least one container with a non-zero exit status." | |||
StatusMessageGenericBuildFailed = "Generic Build failure - check logs for details." | |||
StatusMessageOOMKilled = "Out of memory" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would maybe put more details into this message: "what is out of memory?"
@@ -1014,6 +1016,25 @@ func (bc *BuildController) handleActiveBuild(build *buildapi.Build, pod *v1.Pod) | |||
return update, nil | |||
} | |||
|
|||
func isOOMKilled(pod *v1.Pod) bool { | |||
if pod.Status.Reason == "OOMKilled" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have constant for this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought since that constant is in builds
context, it shouldn't be used to check the status reason for pods
(they currently have the same string describing it but potentially this could change, maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah technically they are two different constants. One is the "build failure reason" and the other is the "pod status reason" so i agree that keeping them as separate constants makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's too bad there's no k8s constant for this that we can reference.
/retest
|
1998310
to
80f1364
Compare
pkg/build/apis/build/types.go
Outdated
@@ -540,6 +543,7 @@ const ( | |||
StatusMessageNoBuildContainerStatus = "The pod for this build has no container statuses indicating success or failure." | |||
StatusMessageFailedContainer = "The pod for this build has at least one container with a non-zero exit status." | |||
StatusMessageGenericBuildFailed = "Generic Build failure - check logs for details." | |||
StatusMessageOOMKilled = "The build pod was killed for 'Out of memory'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The build pod was killed due to an out of memory condition" (we don't actually know if it's the node that ran out of memory or the container exceeded its cgroup memory limit, right? @derekwaynecarr @mfojtik @sjenning)
one nit for discussion on the message and let's get the api change in and use the constant here, but otherwise lgtm. |
80f1364
to
a529a8e
Compare
a529a8e
to
a9c8e88
Compare
@bparees thanks for the review, updated to match the openshift/api#71 |
@wozniakjan can you add a build extended test for this? |
/retest |
31871e1
to
e59dbb8
Compare
/test unit |
@bparees thanks for reminding to add an extended test, ran it couple of times today and seems to work as expected. EDIT: actually, just finished another run of extended tests and it failed to register oomkill is it possible that when OOMKilled pod happens, build logic sets the build.Status.Phase to failed before the controller gets a chance to investigate? origin/pkg/build/controller/build/build_controller.go Lines 1004 to 1014 in e59dbb8
Would we maybe want to override the failure status here if we are confident it was OOMKilled and move the OOMKill check before |
/test extended_builds
|
e2dfc84
to
3ce641c
Compare
generally the only time the build logic running inside the pod will update the build.Status.Phase is when it knows it's going to fail (such as when the git clone fails). The container being OOMkilled would not be such a case, by the time it's oomkilled there's no way for it to update the build.status.phase (because the process that would need to do that update, just got oomkilled). At least near as i can reason. But some judicious tracing should show you who's setting the phase to failed. That said, checking to see if the pod was oomkilled, when the build controller sees a build updated to failed, and then adding the status would not be the worst thing in the world. But it's not great because like the logsnippet issue I just fixed, it means that someone watching builds will first see the build transition to failed, and then see a follow-on update adding the reason for the failure. It's better if they get all the information in the update that set the build to failed. |
I would have thought that as well, but I can now reproduce a situation, where init container is being OOMKilled and the build populates its failure status before the pod updates
I agree, but at this moment, I don't know how to achieve this without changing a lot of builder logic. For now, I would like to rewrite the OOMKilled test to wait for the updated status and I hope this change won't introduce any flakiness. Description of how I came up with the above statements:
Full logs: augmented controller logs
$ oc logs build/statusfail-oomkilled-17
$ oc describe pod
$ oc describe build
|
discussed on irc. what is happening here is the git clone process is being killed, but the container's pid 1 (the build logic) is not, so it sees the git clone process exited and marks the build failed (w/o indicating oom killed as the reason because it just knows the git clone died, not why). so in this case we'll see two updates to the build (one indicating reason fetch source failed, and the second indicating oom killed). I think we can live w/ that. |
3ce641c
to
01eab5b
Compare
/test gcp |
// can get OOMKilled and this confuses builder to prematurely populate | ||
// failure reason | ||
if oldBuild.Status.Phase == buildapi.BuildPhaseFailed && | ||
newBuild.Status.Reason != buildapi.StatusReasonOutOfMemoryKilled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees I think the API was overriding the reason
and message
here. With this changed, I don't get any flakes testing this (but both reason
and message
updates after being set from builder.go
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so without this change the reason was never getting set to OOM, or it only sometimes getting set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of the time it was setting OOM, only sometimes, when builder.go
set it first, it wasn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it's an ugly hack, but i don't have a better idea at the moment :)
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, wozniakjan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@wozniakjan: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test gcp |
Init container can be OOMKilled as well, that way the OOMKilled doesn't bubble up to the
pod.status.reason
but remains amongpod.status.initcontainerstatuses
.Comparing as a string because it looks like kubelet is setting this as a raw string as well, I couldn't find any constants representing OOMKilled reason
origin/vendor/k8s.io/kubernetes/pkg/kubelet/dockershim/docker_container.go
Lines 355 to 359 in ea877ac
fixes https://bugzilla.redhat.com/show_bug.cgi?id=1596440
ptal @openshift/sig-developer-experience