Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildkitd logs "failed to kill process in container id <cid>: buildkit-runc did not terminate successfully: container not running" for hours #4483

Open
dnephin opened this issue Dec 13, 2023 · 19 comments

Comments

@dnephin
Copy link
Member

dnephin commented Dec 13, 2023

Hello! 👋

Problem

We are seeing the buildkitd logs full of the following lines:

level=debug msg="sending sigkill to process in container 7l2ykurlf4pziiza5pnf97raw"
level=error msg="failed to kill process in container id 7l2ykurlf4pziiza5pnf97raw: buildkit-runc did not terminate successfully: exit status 1: container not running\n"

The container id is always the same. These lines are repeated for many hours, sometimes days. The frequency is about 1000/minute (~15/s). We are seeing the buildkitd healthcheck fail, and it's not clear if the cause is these errors or if the errors are simply making it more difficult to determine the underlying cause.

A pod restart clears out the problem for a bit, but it generally happens again within a few hours.

Seems like the logs come from:

bklog.G(ctx).Debugf("sending sigkill to process in container %s", k.id)
defer func() {
if err != nil {
bklog.G(ctx).Errorf("failed to kill process in container id %s: %+v", k.id, err)

Before the log spam starts I don't see anything particularlly interesting in the logs

debug time="2023-12-12T08:38:23Z" level=debug msg="removed snapshot" key=buildkit/10364/nekk455whng4seiauy6yz3w4k-view snapshotter=overlayfs
debug time="2023-12-12T08:38:23Z" level=debug msg="content garbage collected" d=29.017431ms
debug time="2023-12-12T08:38:23Z" level=debug msg="snapshot garbage collected" d=415.90514ms snapshotter=overlayfs
debug time="2023-12-12T08:38:23Z" level=debug msg="gc cleaned up 2099235886 bytes"
debug time="2023-12-12T08:39:20Z" level=debug msg="session finished: <nil>" spanID=e936167880750bc7 traceID=ec6a365de81b3b59b41ae2dc13e6e08c
debug time="2023-12-12T08:39:20Z" level=debug msg="sending sigkill to process in container 7l2ykurlf4pziiza5pnf97raw"

Details

Version: buildkit:v0.12.3-rootless (we also saw this problem with 0.11.0 before upgrade)

We run 4 buildkitd runner with --oci-worker-no-process-sandbox. All of these seem to manifest this behaviour at one time or another (not necessarily at the same time).

Any ideas about how to fix this? Thank you!

@tonistiigi
Copy link
Member

👋

@coryb Can you take a look? I've seen some of these logs myself but not in a loop as it is described here. I've also not been able to understand what exact case causes this. Even if there is some weird condition that stops from killing container cleanly, going to loop when the process is already gone should be avoided.

--oci-worker-no-process-sandbox

This might be related as it does change the parent-child structure for subprocess.

@dnephin
Copy link
Member Author

dnephin commented Dec 14, 2023

Looking at this for loop

for {
select {
case <-ctx.Done():
killCtx, timeout := context.WithTimeout(context.Background(), 7*time.Second)
if err := p.killer.Kill(killCtx); err != nil {
select {
case <-killCtx.Done():
timeout()
cancel()
return
default:
}
}
timeout()
select {
case <-time.After(50 * time.Millisecond):
case <-p.ended:
return
}
case <-p.ended:
return
}
}

We know that killer.Kill on line 619 is returning before the timeout with an error (the one we see in the logs). In the select on 620 that would mean it goes into the default case and does not cancel or return. Assuming p.ended on 631 is also not yet closed, that select on 629 would exit from the time.After(50ms) case and also not return, and the loop would continue. I think this is roughly the frequency of logging we are seeing, and likely what's happening.

p.ended I guess is closed by procHandle.Release, which is deferred from the caller . If I'm reading this right the caller is supposed to be cancelled by the cancel() on 623, but in this flow it seems like it would be skipped. I'm not sure if there are other possible ways this context would be cancelled.

I don't understand exactly the scenario that leads to this, but I believe this could explain the behaviour. Maybe before going into the select on 620 we could inspect the error and cancel and return if the error indicates the container is already killed?

I can also try to capture a goroutine stack dump if that would help.

@tonistiigi
Copy link
Member

we could inspect the error and cancel and return if the error indicates the container is already killed?

I think this makes sense. Looks like the actual error is coming from runc directly (via runc kill) so we do need either some string matching or some follow-up runc command to check that it doesn't see the container anymore.

We do need to make sure that the container is actually cleaned up and this isn't some issue where runc kill does not work well with the oci-worker-no-process-sandbox flag.

@dnephin
Copy link
Member Author

dnephin commented Jan 13, 2024

Thank you for confirming! This hasn't been a huge problem just yet, but I will be doing more work with our buildkit runners soon. I may come back to this and attempt a fix if it's getting in the way of the work I am doing.

@fullykubed
Copy link

Confirmed still occurring in v0.14.1-rootless.

We are seeing this behavior occur when the buildctl client is terminated in the middle of a RUN stanza in a build.

In addition to generating massive amounts of error messages, the buildkitd server can no longer terminate gracefully immediately. Instead it waits a few minutes before finally exiting with this message:

....
time="2024-06-24T19:02:15Z" level=error msg="failed to kill process in container id 8ila2y0gzo081gpwj1b6ble0q: buildkit-runc did not terminate successfully: exit status 1: container not running\n"
time="2024-06-24T19:02:15Z" level=error msg="failed to kill process in container id 8ila2y0gzo081gpwj1b6ble0q: buildkit-runc did not terminate successfully: exit status 1: container not running\n"
time="2024-06-24T19:02:16Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
buildkitd: got 1 SIGTERM/SIGINTs, forcing shutdown
github.com/moby/buildkit/util/appcontext.Context.func1.1
    /src/util/appcontext/appcontext.go:38
runtime.goexit
    /usr/local/go/src/runtime/asm_arm64.s:1197
[rootlesskit:child ] error: command [buildkitd --addr tcp://0.0.0.0:1234 --addr unix:///run/user/1000/buildkit/buildkitd.sock --oci-worker-no-process-sandbox] exited: exit status 1
[rootlesskit:parent] error: child exited: exit status 1
Stream closed EOF for buildkit/buildkit-arm64-0 (buildkitd)

@tonistiigi tonistiigi added this to the v0.15.0 milestone Jun 24, 2024
@fullykubed
Copy link

Interestingly, it does resolve itself eventually, sometimes in as short as 30 seconds but other times continues for many minutes.

It does not appear to have any functional impact. The buildkitd daemon is still able to accept new builds even when this is occurring.

@tonistiigi
Copy link
Member

Looked into it more but still quite confusing what case this is.

Pretty clear it comes from this loop/sleep https://github.com/moby/buildkit/blob/v0.14.1/executor/runcexecutor/executor.go#L618-L633 , but this is only possible if p.ended has not been closed. Otherwise, you could get one or two such logs but not get stuck there. p.ended is closed in defer https://github.com/moby/buildkit/blob/v0.14.1/executor/runcexecutor/executor_linux.go#L61 . So it looks like something is blocked in this function and this goroutine is not returning.

There is no stacktrace but possible candidates where the goroutine is stuck is errgroup.Wait in https://github.com/moby/buildkit/blob/v0.14.1/executor/runcexecutor/executor_linux.go#L65-L78 or the runc.Run(). The errgroup does not seem possible, for RUN handleSignals path is skipped and WaitForStart has own early return on cancellation.

That leaves runc.Run. Looking at library, I thought that maybe it hangs in https://github.com/containerd/go-runc/blob/main/runc.go#L341 channel send because the receiver side has already returned. But that doesn't seem possible because the channel is buffered https://github.com/moby/buildkit/blob/v0.14.1/executor/runcexecutor/executor_linux.go#L71 . I fixed another case #5106 when looking at this, that could have caused process cancellation to be skipped if it appeared before the process was started, but this should not affect this case in here. I also thought that maybe the stdout/stderr streams are blocked after cancellation and runc.Run can not return but does not look like this is possible.

Then last remaining part is actual runc bug where runc run process keeps running and doesn't return while runc kill anwers "no such container running". That is possible in a small window but should not remain in error state repeatedly.

So, although we could fix the log issue with some detection of specific error, it is not clear if the runc process has actually returned, and if it has not then fixing the log would just hide the error and also avoid sending further kill commands. It should also mean that the build request is still hanging while the repeated runc kill commands are being attempted.

@fullykubed
Copy link

While the specifics of the bug are over my head, I am happy to build and deploy a debug build if that might help capture more specific information such as a stack trace.

@thompson-shaun thompson-shaun modified the milestones: v0.16.0, v0.future Aug 29, 2024
@thompson-shaun
Copy link
Collaborator

Any additional insights or work on this @fullykubed?

@fullykubed
Copy link

@thompson-shaun I am afraid I am not well-versed enough in the internals to drive this forward independently. However, if anyone wants help catching more runtime information (@tonistiigi), I have a setup where I can reliably reproduce the issue.

@tonistiigi
Copy link
Member

tonistiigi commented Aug 29, 2024

@fullykubed Can you put together some repro steps for us? If not, then I guess one option is that we will send you a special version of buildkit with some extra logging that you could run. Or maybe there is a way you can give us access to the setup that reproduces this.

@fullykubed
Copy link

@tonistiigi Unfortunately, the setup I have is going to be somewhat cumbersome to deploy as BuildKit is running in an EKS cluster and builds are submitting over an SSH tunnel. That all said, I am having to give anyone working on this issue direct access to a cluster where this issue occurs via a kubeconfig file.

Alternatively, if you have an image / binary with the extra logging, I am happy to deploy it to the cluster, run the reproduction steps and post the logs here.

@thompson-shaun
Copy link
Collaborator

Any chance these nodes are using a containerd image store @fullykubed ?

@fullykubed
Copy link

@thompson-shaun They are.

@tonistiigi
Copy link
Member

@fullykubed I think this is caused by runc not returning properly after the container is already gone. I created a debug version with extra logs https://github.com/tonistiigi/buildkit/commits/proc-kill-debug/ . It is pushed to tonistiigi/buildkit:runc-kill-debug in Docker Hub. Make sure you enable debug logs as well.

In buildx config would look like docker buildx create --driver-opt image=tonistiigi/buildkit:runc-kill-debug --buildkitd-flags '--debug' . Once you hit the problem, please report back the daemon logs.

@fullykubed
Copy link

fullykubed commented Sep 13, 2024

@tonistiigi Just want to clarify something:

I believe this problem occurs specifically when of buildkit is started with rootlesskit with the --oci-worker-no-process-sandbox flag. However, rootlesskit is not included in the tonistiigi/buildkit:runc-kill-debug image.

We are running our buildkit instances as follows:

rootlesskit \
  buildkitd \
  --addr tcp://0.0.0.0:1234 \
  --addr unix:///run/user/1000/buildkit/buildkitd.sock \
  --oci-worker-no-process-sandbox \
  --oci-worker-gc-keepstorage 50000

Using these settings, I am able to replicate the problem reliably by terminating the build in the middle of executing a RUN stanza.


That said, I did execute the privileged version of your buildkitd with the --debug flag. I was not able to replicate the issue, but did get the following logs:

time="2024-09-13T13:10:57Z" level=warning msg="TLS is not enabled for tcp://0.0.0.0:1234. enabling mutual TLS authentication is highly recommended"
time="2024-09-13T13:10:57Z" level=info msg="auto snapshotter: using overlayfs"
time="2024-09-13T13:10:57Z" level=warning msg="using host network as the default"
time="2024-09-13T13:10:57Z" level=info msg="found worker \"jj1z7z0hdc13p8lplson6fghe\", labels=map[org.mobyproject.buildkit.worker.executor:oci org.mobyproject.buildkit.worker.hostname:buildkit-amd64-0 org.mobyproject.buildkit.worker.network:host org.mobyproject.buildkit.worker.oci.process-mode:sandbox org.mobyproject.buildkit.worker.selinux.enabled:false org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/amd64/v2 linux/amd64/v3 linux/amd64/v4 linux/386]"
time="2024-09-13T13:10:57Z" level=warning msg="skipping containerd worker, as \"/run/containerd/containerd.sock\" does not exist"
time="2024-09-13T13:10:57Z" level=info msg="found 1 workers, default=\"jj1z7z0hdc13p8lplson6fghe\""
time="2024-09-13T13:10:57Z" level=warning msg="currently, only the default worker can be used."
time="2024-09-13T13:10:57Z" level=info msg="running server on [::]:1234"
time="2024-09-13T13:10:57Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
time="2024-09-13T13:12:18Z" level=debug msg="session started" spanID=b14bdfaf38cbd8c0 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="resolve exporter image with map[name:891377197483.dkr.ecr.us-east-2.amazonaws.com/891377197483.dkr.ecr.us-east-2.amazonaws.com/bastion:test-amd64 push:true source-date-epoch:315532800]" spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="new ref for local: sby9wiv04wqz05txzuwo6q4sx" span="[internal] load build definition from ./Dockerfile" spanID=2062be68a5e7685d traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="diffcopy took: 188.928225ms" span="[internal] load build definition from ./Dockerfile" spanID=4fc338e951ef3334 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="saved sby9wiv04wqz05txzuwo6q4sx as dockerfile:dockerfile:" span="[internal] load build definition from ./Dockerfile" spanID=2062be68a5e7685d traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="checked for cached auth handler namespace" cached=false key="docker.io/library/ubuntu::pull" name=docker.io/library/ubuntu scope=pull
time="2024-09-13T13:12:18Z" level=debug msg=resolving span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="do request" request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=buildkit/v0.0.0+unknown request.method=HEAD span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="fetch response received" response.header.content-length=157 response.header.content-type=application/json response.header.date="Fri, 13 Sep 2024 13:12:18 GMT" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=18.218.214.155 response.header.strict-transport-security="max-age=31536000" response.header.www-authenticate="Bearer realm=\"https://auth.docker.io/token\",service=\"registry.docker.io\",scope=\"repository:library/ubuntu:pull\"" response.status="401 Unauthorized" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg=Unauthorized header="Bearer realm=\"https://auth.docker.io/token\",service=\"registry.docker.io\",scope=\"repository:library/ubuntu:pull\"" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="do request" request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=buildkit/v0.0.0+unknown request.method=HEAD span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="fetch response received" response.header.content-length=1338 response.header.content-type=application/vnd.oci.image.index.v1+json response.header.date="Fri, 13 Sep 2024 13:12:18 GMT" response.header.docker-content-digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=18.218.214.155 response.header.etag="\"sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee\"" response.header.ratelimit-limit="100;w=21600" response.header.ratelimit-remaining="91;w=21600" response.header.strict-transport-security="max-age=31536000" response.status="200 OK" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg=resolved desc.digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg=fetch span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="do request" request.header.accept="application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=buildkit/v0.0.0+unknown request.method=GET span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="fetch response received" response.header.content-length=1338 response.header.content-type=application/vnd.oci.image.index.v1+json response.header.date="Fri, 13 Sep 2024 13:12:18 GMT" response.header.docker-content-digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=18.218.214.155 response.header.etag="\"sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee\"" response.header.ratelimit-limit="100;w=21600" response.header.ratelimit-remaining="91;w=21600" response.header.strict-transport-security="max-age=31536000" response.status="200 OK" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg=fetch span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="do request" request.header.accept="application/vnd.oci.image.manifest.v1+json, */*" request.header.user-agent=buildkit/v0.0.0+unknown request.method=GET span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="fetch response received" response.header.content-length=424 response.header.content-type=application/vnd.oci.image.manifest.v1+json response.header.date="Fri, 13 Sep 2024 13:12:18 GMT" response.header.docker-content-digest="sha256:d35dfc2fe3ef66bcc085ca00d3152b482e6cafb23cdda1864154caf3b19094ba" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=18.218.214.155 response.header.etag="\"sha256:d35dfc2fe3ef66bcc085ca00d3152b482e6cafb23cdda1864154caf3b19094ba\"" response.header.ratelimit-limit="100;w=21600" response.header.ratelimit-remaining="91;w=21600" response.header.strict-transport-security="max-age=31536000" response.status="200 OK" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg=fetch span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:18Z" level=debug msg="do request" request.header.accept="application/vnd.oci.image.config.v1+json, */*" request.header.user-agent=buildkit/v0.0.0+unknown request.method=GET span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="fetch response received" response.header.accept-ranges=bytes response.header.age=2376105 response.header.cache-control="public, max-age=14400" response.header.cf-cache-status=HIT response.header.cf-ray=8c28635f2f6160a2-ORD response.header.connection=keep-alive response.header.content-length=2297 response.header.content-type=application/octet-stream response.header.date="Fri, 13 Sep 2024 13:12:19 GMT" response.header.etag="\"6816f90355932e7c37011fcb928f42e4\"" response.header.expires="Fri, 13 Sep 2024 17:12:19 GMT" response.header.last-modified="Thu, 01 Aug 2024 15:42:13 GMT" response.header.server=cloudflare response.header.vary=Accept-Encoding response.header.x-amz-id-2="CyCA968dl9flKEKFhfRvjXSObaIn7giCGSWk9P8e6JoJQLJqXBrT90wcDTwXBEFowg+cnatNFzw=" response.header.x-amz-request-id=SDRWHHRM29A58W9T response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-version-id=VvqeYTa4p0U7_iTriRNw8HPg6qGp9Cm3 response.status="200 OK" span="resolving docker.io/library/ubuntu:latest" spanID=f1f20a9c1605a2d1 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="new ref for local: kz9ts5enkk4o5fxfvdttxwdl8" span="[internal] load .dockerignore" spanID=1f23a1266a26f856 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="diffcopy took: 163.731047ms" span="[internal] load .dockerignore" spanID=961b6b22ea246206 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="saved kz9ts5enkk4o5fxfvdttxwdl8 as context:context-.dockerignore:" span="[internal] load .dockerignore" spanID=1f23a1266a26f856 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="checked for cached auth handler namespace" cached=true key="docker.io/library/ubuntu::pull" name=docker.io/library/ubuntu scope=pull
time="2024-09-13T13:12:19Z" level=debug msg=fetch digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" mediatype=application/vnd.oci.image.index.v1+json size=1338 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg=fetch digest="sha256:d35dfc2fe3ef66bcc085ca00d3152b482e6cafb23cdda1864154caf3b19094ba" mediatype=application/vnd.oci.image.manifest.v1+json size=424 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg=fetch digest="sha256:edbfe74c41f8a3501ce542e137cf28ea04dd03e6df8c9d66519b6ad761c2598a" mediatype=application/vnd.oci.image.config.v1+json size=2297 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="checked for cached auth handler namespace" cached=true key="docker.io/library/ubuntu::pull" name=docker.io/library/ubuntu scope=pull
time="2024-09-13T13:12:19Z" level=debug msg="checked for cached auth handler namespace" cached=true key="docker.io/library/ubuntu::pull" name=docker.io/library/ubuntu scope=pull
time="2024-09-13T13:12:19Z" level=debug msg=fetch digest="sha256:31e907dcc94a592a57796786399eb004dcbba714389fa615f5efa05a91316356" mediatype=application/vnd.oci.image.layer.v1.tar+gzip size=29706298 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:19Z" level=debug msg="do request" digest="sha256:31e907dcc94a592a57796786399eb004dcbba714389fa615f5efa05a91316356" mediatype=application/vnd.oci.image.layer.v1.tar+gzip request.header.accept="application/vnd.oci.image.layer.v1.tar+gzip, */*" request.header.user-agent=containerd/1.7.21+unknown request.method=GET size=29706298 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b url="https://registry-1.docker.io/v2/library/ubuntu/blobs/sha256:31e907dcc94a592a57796786399eb004dcbba714389fa615f5efa05a91316356"
time="2024-09-13T13:12:19Z" level=debug msg="fetch response received" digest="sha256:31e907dcc94a592a57796786399eb004dcbba714389fa615f5efa05a91316356" mediatype=application/vnd.oci.image.layer.v1.tar+gzip response.header.accept-ranges=bytes response.header.age=1255003 response.header.cache-control="public, max-age=14400" response.header.cf-cache-status=HIT response.header.cf-ray=8c28636179f960a2-ORD response.header.connection=keep-alive response.header.content-length=29706298 response.header.content-type=application/octet-stream response.header.date="Fri, 13 Sep 2024 13:12:19 GMT" response.header.etag="\"3203ab8dce387ad2c8e28418d6b06b72\"" response.header.expires="Fri, 13 Sep 2024 17:12:19 GMT" response.header.last-modified="Thu, 01 Aug 2024 15:42:11 GMT" response.header.server=cloudflare response.header.vary=Accept-Encoding response.header.x-amz-id-2=yD7+QiTLVu88NGtqU+4FkUs205LxNjOoBcIvRRsuttxs54tVQsHrhFkd22U2OOxErVSzr4XwWPjyEnpvUcfiXFuziqpECi9L response.header.x-amz-request-id=PJZCYC80RPHP331S response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-version-id=B353OJcwFGgvWU6lrbjXkTsaIBod.fTn response.status="200 OK" size=29706298 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b url="https://registry-1.docker.io/v2/library/ubuntu/blobs/sha256:31e907dcc94a592a57796786399eb004dcbba714389fa615f5efa05a91316356"
time="2024-09-13T13:12:19Z" level=debug msg="using pigz for decompression"
time="2024-09-13T13:12:20Z" level=debug msg="diff applied" d=542.22265ms digest="sha256:31e907dcc94a592a57796786399eb004dcbba714389fa615f5efa05a91316356" media=application/vnd.oci.image.layer.v1.tar+gzip size=29706298 spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:20Z" level=debug msg="> creating n5enwgexkbkkpi6pkx05c1bcj [/bin/sh -c sleep 60]" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:20Z" level=debug msg="starting runc process monitor" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:20Z" level=debug msg="runc run n5enwgexkbkkpi6pkx05c1bcj /var/lib/buildkit/runc-overlayfs/executor/n5enwgexkbkkpi6pkx05c1bcj" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:20Z" level=debug msg="handleSignals done: <nil>" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:20Z" level=debug msg="waitForStart done: <nil>" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:26Z" level=debug msg="sending sigkill to process in container n5enwgexkbkkpi6pkx05c1bcj"
time="2024-09-13T13:12:26Z" level=debug msg="sending sigkill via runc process n5enwgexkbkkpi6pkx05c1bcj"
time="2024-09-13T13:12:26Z" level=debug msg="session finished: <nil>" spanID=b14bdfaf38cbd8c0 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:26Z" level=error msg="completed to kill process in container id n5enwgexkbkkpi6pkx05c1bcj: <nil>"
time="2024-09-13T13:12:26Z" level=debug msg="killed process" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:26Z" level=debug msg="runc run n5enwgexkbkkpi6pkx05c1bcj /var/lib/buildkit/runc-overlayfs/executor/n5enwgexkbkkpi6pkx05c1bcj took 6.128745182s, err: buildkit-runc did not terminate successfully: exit status 137" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:26Z" level=debug msg="call done: buildkit-runc did not terminate successfully: exit status 137" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:26Z" level=debug msg="releasing runc process monitor"
time="2024-09-13T13:12:26Z" level=debug msg="runc process monitor ended (2)" span="[2/2] RUN sleep 60" spanID=1e4973b167872fa3 traceID=63dec637cc25194ba06b3077cfafd98b
time="2024-09-13T13:12:26Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled" spanID=e219f48b8502ce4c traceID=63dec637cc25194ba06b3077cfafd98b
Canceled: context canceled
1 290af9a buildkitd --addr tcp://0.0.0.0:1234 --addr unix:///run/buildkit/buildkitd.sock --debug
main.unaryInterceptor
	/src/cmd/buildkitd/main.go:713
google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1
	/src/vendor/google.golang.org/grpc/server.go:1194
github.com/moby/buildkit/api/services/control._Control_Solve_Handler
	/src/api/services/control/control.pb.go:2493
google.golang.org/grpc.(*Server).processUnaryRPC
	/src/vendor/google.golang.org/grpc/server.go:1383
google.golang.org/grpc.(*Server).handleStream
	/src/vendor/google.golang.org/grpc/server.go:1794
google.golang.org/grpc.(*Server).serveStreams.func2.1
	/src/vendor/google.golang.org/grpc/server.go:1027
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1695

time="2024-09-13T13:12:27Z" level=debug msg="remove snapshot" key=mf73lxktlj8rb5izchj6ctejl snapshotter=overlayfs
time="2024-09-13T13:12:27Z" level=debug msg="schedule snapshotter cleanup" snapshotter=overlayfs
time="2024-09-13T13:12:27Z" level=debug msg="removed snapshot" key=buildkit/5/mf73lxktlj8rb5izchj6ctejl snapshotter=overlayfs
time="2024-09-13T13:12:57Z" level=debug msg="session started" spanID=42917d6c8450b569 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="resolve exporter image with map[name:891377197483.dkr.ecr.us-east-2.amazonaws.com/891377197483.dkr.ecr.us-east-2.amazonaws.com/bastion:test-amd64 push:true source-date-epoch:315532800]" spanID=eba934ca8a5cdb84 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="reusing ref for local: sby9wiv04wqz05txzuwo6q4sx" span="[internal] load build definition from ./Dockerfile" spanID=2b948af4ef9ae248 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="diffcopy took: 224.957113ms" span="[internal] load build definition from ./Dockerfile" spanID=26fa7d91985ff8a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="checked for cached auth handler namespace" cached=true key="docker.io/library/ubuntu::pull" name=docker.io/library/ubuntu scope=pull
time="2024-09-13T13:12:57Z" level=debug msg=resolving span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="do request" request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=buildkit/v0.0.0+unknown request.method=HEAD span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="fetch response received" response.header.content-length=1338 response.header.content-type=application/vnd.oci.image.index.v1+json response.header.date="Fri, 13 Sep 2024 13:12:57 GMT" response.header.docker-content-digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=18.218.214.155 response.header.etag="\"sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee\"" response.header.ratelimit-limit="100;w=21600" response.header.ratelimit-remaining="90;w=21600" response.header.strict-transport-security="max-age=31536000" response.status="200 OK" span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg=resolved desc.digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg=fetch span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg=fetch span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg=fetch span="resolving docker.io/library/ubuntu:latest" spanID=73c526f172fe364e traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="reusing ref for local: kz9ts5enkk4o5fxfvdttxwdl8" span="[internal] load .dockerignore" spanID=2d508480f7a08036 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="diffcopy took: 130.91932ms" span="[internal] load .dockerignore" spanID=fe90dbb59f88a906 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="checked for cached auth handler namespace" cached=true key="docker.io/library/ubuntu::pull" name=docker.io/library/ubuntu scope=pull
time="2024-09-13T13:12:57Z" level=debug msg=fetch digest="sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee" mediatype=application/vnd.oci.image.index.v1+json size=1338 spanID=eba934ca8a5cdb84 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg=fetch digest="sha256:d35dfc2fe3ef66bcc085ca00d3152b482e6cafb23cdda1864154caf3b19094ba" mediatype=application/vnd.oci.image.manifest.v1+json size=424 spanID=eba934ca8a5cdb84 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg=fetch digest="sha256:edbfe74c41f8a3501ce542e137cf28ea04dd03e6df8c9d66519b6ad761c2598a" mediatype=application/vnd.oci.image.config.v1+json size=2297 spanID=eba934ca8a5cdb84 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:57Z" level=debug msg="load cache for [1/2] FROM docker.io/library/ubuntu:latest@sha256:8a37d68f4f73ebf3d4efafbcf66379bf3728902a8038616808f04e34a9ab63ee with jj1z7z0hdc13p8lplson6fghe::qrnfdkn16u528xf9k12jomhqh"
time="2024-09-13T13:12:58Z" level=debug msg="> creating oys6n2rrp1dwgclvv5b9pf5wy [/bin/sh -c sleep 75]" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:58Z" level=debug msg="starting runc process monitor" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:58Z" level=debug msg="runc run oys6n2rrp1dwgclvv5b9pf5wy /var/lib/buildkit/runc-overlayfs/executor/oys6n2rrp1dwgclvv5b9pf5wy" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:58Z" level=debug msg="handleSignals done: <nil>" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:12:58Z" level=debug msg="waitForStart done: <nil>" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:13:14Z" level=debug msg="sending sigkill to process in container oys6n2rrp1dwgclvv5b9pf5wy"
time="2024-09-13T13:13:14Z" level=debug msg="sending sigkill via runc process oys6n2rrp1dwgclvv5b9pf5wy"
time="2024-09-13T13:13:14Z" level=debug msg="session finished: <nil>" spanID=42917d6c8450b569 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:13:14Z" level=error msg="completed to kill process in container id oys6n2rrp1dwgclvv5b9pf5wy: <nil>"
time="2024-09-13T13:13:14Z" level=debug msg="killed process" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:13:14Z" level=debug msg="runc run oys6n2rrp1dwgclvv5b9pf5wy /var/lib/buildkit/runc-overlayfs/executor/oys6n2rrp1dwgclvv5b9pf5wy took 16.805789068s, err: buildkit-runc did not terminate successfully: exit status 137" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:13:14Z" level=debug msg="call done: buildkit-runc did not terminate successfully: exit status 137" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:13:14Z" level=debug msg="releasing runc process monitor"
time="2024-09-13T13:13:14Z" level=debug msg="runc process monitor ended (2)" span="[2/2] RUN sleep 75" spanID=deba44c1fff943a2 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
time="2024-09-13T13:13:14Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled" spanID=eba934ca8a5cdb84 traceID=cbb3c2fc5e2e5a1d9755795642a7387c
Canceled: context canceled
1 290af9a buildkitd --addr tcp://0.0.0.0:1234 --addr unix:///run/buildkit/buildkitd.sock --debug
main.unaryInterceptor
	/src/cmd/buildkitd/main.go:713
google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1
	/src/vendor/google.golang.org/grpc/server.go:1194
github.com/moby/buildkit/api/services/control._Control_Solve_Handler
	/src/api/services/control/control.pb.go:2493
google.golang.org/grpc.(*Server).processUnaryRPC
	/src/vendor/google.golang.org/grpc/server.go:1383
google.golang.org/grpc.(*Server).handleStream
	/src/vendor/google.golang.org/grpc/server.go:1794
google.golang.org/grpc.(*Server).serveStreams.func2.1
	/src/vendor/google.golang.org/grpc/server.go:1027
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1695

time="2024-09-13T13:13:27Z" level=debug msg="remove snapshot" key=x5psvprsidvd8n3vwqkiiab73 snapshotter=overlayfs
time="2024-09-13T13:13:27Z" level=debug msg="schedule snapshotter cleanup" snapshotter=overlayfs
time="2024-09-13T13:13:27Z" level=debug msg="removed snapshot" key=buildkit/6/x5psvprsidvd8n3vwqkiiab73 snapshotter=overlayfs

@tonistiigi
Copy link
Member

@fullykubed Pushed tonistiigi/buildkit:runc-kill-debug-rootless for rootless version.

When I added sleep for runc shutdown, the issue reproduced for me with or without rootless. But it is possible that the rootless config is what is causing runc to not shut down in a timely manner.

@thompson-shaun thompson-shaun added the area/rootless rootless mode label Sep 19, 2024
@schdief
Copy link

schdief commented Nov 5, 2024

@fullykubed Pushed tonistiigi/buildkit:runc-kill-debug-rootless for rootless version.

When I added sleep for runc shutdown, the issue reproduced for me with or without rootless. But it is possible that the rootless config is what is causing runc to not shut down in a timely manner.

@tonistiigi thanks a lot for providing this fix, today we suddenly had this issue with all our pipelines failing - are there any plans to merge this fix and release it?

@zachspar
Copy link

zachspar commented Nov 20, 2024

Any update on a fix for this? For me, it seems like this happens if there is a build error during emulation with QEMU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants