Improved debugging support #1472

tonistiigi · 2020-05-02T00:50:20Z

addresses #1053
addresses #1470

An issue with the current build environment is that we often assume everyone can write a perfect Dockerfile from scratch without any mistakes. In real-world there is a lot of trial and error for writing a complex Dockerfile. Users get errors, need to understand what is causing them, and react accordingly.

In the legacy builder, one of the methods for dealing with this situation was to use --rm=false or look up the image ID of the last image layer from the build output and run docker run session with it to understand what was wrong. Buildkit does not create intermediate images nor make the containers it runs visible in docker run (both for very good reasons). Therefore this is even more complicated now and usually requires the user to set --target to do a partial build and the debug the output of it.

To improve this, we shouldn't try to bring back --rm=false that makes all the builds significantly slower and makes it impossible to manage storage for build cache. Instead, we could provide a better solution for this with a new --debugger flag.

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

If the error happened on a RUN command (execop in LLB), the user can use shell to rerun the command and keep tweaking it. This will happen in an identical environment to the one where execop runs, for example, this means access to secrets, ssh, cache mounts etc. They can also inspect the environment variables and files in the system that might be causing the issue. Using control commands, a user can switch between the broken state that was left behind by the failed command and the initial base state for that command. So in the case where they would try many possible fixes but end up in a bad state, they can just restore back to the initial state and start again.

If the error happened on a copy (or other file operation like rm), they can run ls and similar tools to find out why the file path is not correct and not working.

For implementation, this depends on #749 for support to run processes on build mounts directly without going through the solver. We would first start by modifying the Executor and ExecOp to instead of releasing the mounts after error, return them together with the error. I believe typed errors #1454 support can be reused for this. They should be returned up to the client Solve method, who can then decide to call llb.Exec with these mounts. If mounts are left unhandled, they are released with the gateway api release.

Once the debugging has completed, and the user has made changes to the source files, it is easy to trigger a restart of the build with exactly the same settings. This is also useful if you think you might be hitting a temporary error. If the retry didn't fix it, user is brought back to the debugger.

It might make sense to introduce a concept of "debugger image" that is used as a basis of the debugging environment. This would allow avoiding hardcoded logic in an opinionated area.

Later this could be extended with the step-based debugger, and source mapping support could be used to make source code changes directly in the editor or tracking dependencies in the build graph.

@hinshun

The text was updated successfully, but these errors were encountered:

hinshun · 2020-05-02T01:23:29Z

Regarding the "debugger image", my colleague @slushie did some interesting work with sharing a mount namespace (partial containers) with a image that has debugging tools: https://github.com/slushie/cdbg

In that repository, there's a prototype of gdb in the debugging image, attaching to the process of a running container.

This may be useful to debug scratch images or minimal images that may not have the basic tools like a shell binary.

fuweid · 2020-05-02T07:30:18Z

/cc

tonistiigi · 2020-10-07T23:56:12Z

@coryb Now that Exec support has landed how big job do you estimate it to be to return the typed errors from execop/fileop that would allow running exec from the error position and position from the start of the op. Wondering if we should target that for v0.8 or not. We could potentially continue working on the client side ux after v0.8 is out. Already added #1714 to v0.8 that I think is a requirement.

coryb · 2020-10-08T00:50:31Z

I am working on #1714 now, I am guessing a week+ before I have something viable for that.

I have not really looked into the change required for this yet. I think @hinshun has some ideas and is generally more familiar with this than I am. I will sync up with him and maybe twist his arm to help out 😄 I think we can try to break down what is remaining for this and try to come up with some estimates.

ag-TJNII · 2020-10-23T03:02:52Z

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

Interactive shells being the only option is going to leave much to be desired when building in CI pipelines. I often use Docker in CI pipelines where the build command has no terminal to drop to or is a direct API call; having the only option be "run interactive" is not inline with current automated build best practices. Please consider an option to allow sideband inspection of buildkit layers, similar to how the legacy docker build works. Thanks.

lyager · 2021-03-18T12:49:58Z

I've just upgraded Docker for Mac, which uses BUILDKIT as its default engine. Not feeling very comfortable with the suggested nsenter solution since the project is deprecated (or at least marked 'read-only'). Just wanted to give a +1 for getting this fixed. --debugger sounds like a great solution, maybe even letting it switch directly into interactive shell when a build step fails.

lyager · 2021-03-18T13:17:41Z

Just wanted to follow up, changing the backend while building works for me: DOCKER_BUILDKIT=0 docker build . - but I must admit the speed of using buildkit is nice!

JoelTrain · 2021-03-23T15:45:47Z

I agree.
Having the image of the layer immediately prior to the issue makes it incredibly handy to run an interactive container immediately prior to the problem to poke around.

I guess for now I will run DOCKER_BUILDKIT=0 docker build . as a work around when debugging new dockerfiles

so that I can get the image ids in the output again

Step 2/12 : WORKDIR /usr/src/app
---> Running in 14307a565858
Removing intermediate container 14307a565858
---> 472b33608107
Step 3/12 : COPY ./package.json .
---> 40293e6966f5
Step 4/12 : COPY ./package-lock.json .
---> e91be6e9c9c6
Step 5/12 : RUN npm install
---> Running in dc762b24b192

$ docker run -it --rm e91be6e9c9c6 sh
/usr/src/app #

gtmtech · 2021-03-23T18:10:39Z

Is there any solution in this space yet (that doesn't involve nsenter or regressing to DOCKER_BUILDKIT=0). I cant quite believe that it's coming up for 2 years since #1053 was raised and nobody has been able to debug docker buildkit builds since - it sounds like something that is as common a usecase as you could get?

Can't find any example of active work to resolve this issue, might step in and help out if there's nothing in the pipeline

tonistiigi · 2021-03-23T19:57:49Z

I don't know what you mean by nsenter solution but that is not recommended. What you can do is create a named target to the position of the dockerfile you want to debug, build that target with --target and run it with docker run.

matt2000 · 2021-04-06T19:50:54Z

Just chiming in with a user perspective, after being put in a new environment where BUILDKIT appears to be the default, this is a decidedly worse experience than the past. Clearly the layers are being cached. I'd guess the simplest solution with a "backward compatible user experience" might be to just automatically export the last cached layer to the image store, and display its hash, whenever there is an error in docker build. Named targets for debugging feel like an awkward misuse of the feature, since the old way was "automatic."

strelga · 2021-04-14T14:11:10Z

@tonistiigi
Do you plan to take this issue in development in any near future?
Does it have blockers now?

itcarroll · 2021-04-23T18:33:40Z

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

KevOrr · 2021-04-23T18:40:29Z

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

Iirc, when using Compose, target is a field in the build: subsection of a service definition

edit: https://github.com/compose-spec/compose-spec/blob/master/build.md#target

willemm · 2021-04-29T14:27:44Z

The proposed option mentioned in #1053 , where you can specify that it should create the image even on failure, would be very helpful. It would even be helpful if you could just enhance the --output option with a flag that it also outputs on failure.

emmahyde · 2021-05-29T04:47:35Z

This would be fantastic. It's the only thing holding me back from moving over to buildkit full time!

NicolasDorier · 2021-06-10T04:06:13Z

Just want to say that it is VERY painful to not be able to interatively debug intermediate images...
It really makes debugging a problem in 5 min take a 2 Hour long process...

cburgard · 2021-07-01T06:29:42Z

After switching to buildkit recently because of the secret-mount option, I've just spent about half an hour trying to figure out what magical command I need to show the images in the buildkit cache, the apparent answer being "it's not possible". I find it hard to believe that this issue still persists...

tonistiigi · 2021-07-01T06:34:50Z

You can add a multi-stage split anywhere in your Dockerfile and use --target to build the portion you want to turn into a debug image.

hraban · 2021-07-21T10:35:50Z

A temporary work-around is docker-compose, which (as of writing, v1.29.2) still doesn't use build kit when you do docker-compose run. You can create a simple docker-compose file with context: ., use docker-compose run --rm yourservice, which will then try to build it and print hash ids along the way. But if you use docker-compose build, it already uses buildkit, so this workaround is most likely on its way out. As is docker-compose itself, iirc?

chrisawad · 2022-03-22T21:30:56Z

This can give you a look at a the point after a successfully completed stage:

DOCKER_BUILDKIT=1
docker build --target <stage> -t test .
docker run --rm -it test bash

But unlike when DOCKER_BUILDKIT=0, I don't think there's a way to see the hash for each layer created in the image so you can't just jump in right before the error and test at the moment of failure.

Highly unfortunate, and a big deal if you ask me!

kingbuzzman · 2022-04-18T20:03:54Z

$ docker --version
Docker version 20.10.14

DOCKER_BUILDKIT=0 docker build .. doesn't seem to work anymore. I no longer get the hashes

ktock · 2022-05-10T13:32:54Z

FYI:

I'm recently implemented an experimental interactive debugger for Dockerfile : buildg https://github.com/ktock/buildg

Also in buildx, discussion is ongoing towards interactive debugger support and UI/UX: docker/buildx#1104

yambottle · 2022-07-15T17:29:04Z

If the buildkit removes the intermediate container when build failure, how can I docker commit to debug that layer?
- DOCKER_BUILDKIT=0 works for me in this case
But is there an official best practice to debug failure build layer with buildkit on?(because I do like the buildkit's logging tho)

terekcampbell · 2023-01-31T20:28:15Z

It's been quite some time since there's been movement here. Can we get an update on this?

ptrxyz · 2023-02-13T10:40:05Z

I fully support the idea of getting the hashes of each layer back. Maybe a good compromise would be to at least display the hash of the layer a failing command was run in?

rfay · 2023-02-13T15:11:52Z

Hashes of each later would help so much.

Derekt2 · 2023-03-28T02:00:33Z

still using DOCKER_BUILDKIT=0
to get image layer hashes, why not at least give the hashes when --progress=plain is specified?

TBBle · 2023-03-28T02:49:33Z

Because it's not simply "give the hashes", those hashes (i.e. what you see in the legacy builder) do not exist until the export stage of the build, and generating them by exporting each layer as it's built into an image would be a non-trivial operation that makes BuildKIt slower for everyone, and require redesigning the BuildKit build process to know about and use the chosen image exporter much earlier in the build than it does now.

As mentioned earlier, the solution for your actual problem (debugging failed builds in docker buildx) is being worked on over in docker/buildx#1104; PR6 landed last month, and PR7+8 are currently under-review.

Given that the BuildKit work to implement debugging was completed almost a year ago (Exec in the gateway API, and resolving and passing-up content IDs to the client when a build fails), I'd suggest closing this issue and redirecting people to follow the remaining work in buildx, as it does not seem like there's remaining scope for productive discussion in this ticket.

mmerickel · 2023-03-28T02:52:30Z

I just want the hash of the last layer built prior to the failure. Don’t need the hash of every later exported.

TBBle · 2023-03-28T03:09:42Z

That's what #1472 (comment) does now, by making the "last layer" the final layer, so BuildKit can export an image, since that's all it knows how to do. Anything more would only be workable when BuildKit is being used with Docker directly (and knows it), and buildx exists to contain those cases.

What other use do you have for intermediate image generation and hash output that isn't hand-implementing docker/buildx#1104 and isn't trying to build #1472 (comment) directly into BuildKit instead of buildx?

willemm · 2023-03-28T07:08:11Z

My use case is actually to access the test report files after a failed unit test step. At the moment we use a separate target that has the unit test as last step with a " || echo failed" at the end to always succeed so we have an image to extract the test report from. But that requires building the dockerfile twice in each build, and specially tuning all the dockerfiles to support this. So access from an automated script to the build/state/files after a failed build would be very useful.

TBBle · 2023-03-28T10:34:09Z

Okay, so that's a use-case that isn't supported by the legacy builder either, AFAIR, it never created an image out of a failed step.

I hope you'll be pleased to know that PR8 of docker/buildx#1104 is implementing both "Execute in container at start of failed step" (similar to legacy builder "write-down layer ID and docker run it") and "Execute in container after failed step" (new! and the default) in the monitor via proposed docker buildx build --invoke=on-error, so you can get access to those files through this, I expect. It's currently being worked on (and you can see a more-detailed usage example) in docker/buildx#1640.

Based on this work, it would probably also be possible to implement in buildx something that can actually export an image from either the start or end of a failed step, since (I think) BuildKit now sends enough information on failure for buildx to request an image export of the container state, and buildx has enough information to tell BuildKit where to send such an image.

I don't immediately see an open feature-request in buildx for that, and I suspect it wouldn't be worked on until docker/buildx#1104 is completed (since the work heavily overlaps).

It's also possible that I'm wrong and the infrastructure that supports docker/buildx#1104 is not sufficient to support buildx exporting either or both of the before and after images of a failed build step.

So yeah, I suggest you open a feature request for your use-case on buildx, and see what the buildx maintainers think. (I'm not a buildx maintainer; I'm not super familiar with that codebase, and I have no particularly strong prediction on what they'll think of it. I hope they like it, it seems useful to me for, e.g., tests-run-during-container-build workflows.)

willemm · 2023-03-28T11:38:43Z

True, legacy didn't support that either. I was just throwing it out there as a use-case, and I am indeed pleased to know that information about PR8, thank you ^^

opinionmachine · 2023-05-12T16:39:36Z

So my usecase is to use docker build to run all the package restore, build, test (including coverage, static code analysis, static security analysis et c) and finally put the built artifact in a lightweight image. The only issue is I'd need to access the test output from the intermedate layer to push to the CI system, and that is possible with buildkit = 0, but as far as this discussion goes not possible with buildkit. Now I'm all for performance, but I'd love it it was possible to label and publish an intermediate layer manually for this specific case. Otherwise I need multiple dockerfiles, like a barbarian.

tonistiigi · 2023-05-12T18:01:44Z

You can use #1472 (comment) instead of multiple Dockerfiles. Or you can PR a change that adds an option to stop at a specific Dockerfile line.

opinionmachine · 2023-05-12T20:29:17Z

You can use #1472 (comment) instead of multiple Dockerfiles. Or you can PR a change that adds an option to stop at a specific Dockerfile line.

I don’t know how you do test coverage and test results, but I’d like to have the output every run, not just when tests break.

tonistiigi · 2023-05-12T20:40:40Z

If your case is that you want to build multiple things (stages) and push their results to different locations, not only your final build result then you can look into docker buildx bake https://docs.docker.com/build/bake/reference/ . Define all the points you want to access as separate targets and a single command will build them all together and push where needed.

tonistiigi · 2023-05-12T20:42:36Z

There are some new (experimental for now) debug options in new buildx release candidate: https://github.com/docker/buildx/releases/tag/v0.11.0-rc1

andyneff · 2023-06-16T12:17:36Z

If your case is that you want to build multiple things (stages) and push their results to different locations, not only your final build result then you can look into docker buildx bake https://docs.docker.com/build/bake/reference/ . Define all the points you want to access as separate targets and a single command will build them all together and push where needed.

I finally needed to use the experimental debug invoke, and I really like how it works! I hope it gets added to the bake command too, eventually. (And this too)

shapirus · 2023-10-27T07:48:22Z

So, considering all the experimental features, is there now a possibility to run a command (typically a shell) inside a build container?

With the normal builder, I can run docker ps, get the build container's ID from the output, then run docker exec -it <id> sh and get a shell running inside that container to inspect or run whatever I need there.

Does buildkit support this in any way, other than running an ssh reverse tunnel from inside the container in a RUN build step? It would be nice for it to support it before the normal builder is removed.

TBBle · 2023-10-27T08:37:27Z

@shapirus Does https://github.com/docker/buildx/blob/v0.11.2/docs/guides/debugging.md do what you want? The BuildKit-side requirements (low-level bits) are implemented; the buildx side is being built-out, was shipped experimentally in buildx 0.11 and hence Docker Desktop 4.22.0, and is looking for feedback at docker/buildx#1104.

I'd suggest trying buildx 0.12.0-rc1 if you're interested in this feature, as the command-line was changed and the relevant docs are now at https://github.com/docker/buildx/blob/v0.12.0-rc1/docs/guides/debugging.md. That way any feedback you give is relative to the current state of development.

jedevc · 2023-10-27T08:39:34Z

@tonistiigi does it make sense to close this issue? Now that we're tracking things in docker/buildx#1104, and the area/debug tag on buildx.

shapirus · 2023-10-27T08:43:35Z

Does https://github.com/docker/buildx/blob/v0.11.2/docs/guides/debugging.md do what you want?

Yes, from what I read there, it should solve it, as far as practical use cases are concerned. Thanks for the hint.

tonistiigi added the kind/enhancement label May 2, 2020

tonistiigi mentioned this issue May 11, 2020

--rm=false does not keep intermediate containers as expected #1470

Closed

coryb mentioned this issue Jun 26, 2020

add tty proxy to allow attaching to container ExecOp #1546

Closed

tonistiigi mentioned this issue Jul 2, 2020

always display image hashes #1053

Closed

tonistiigi mentioned this issue Aug 25, 2020

Debugging failed builds docker/buildx#227

Closed

manics mentioned this issue Oct 3, 2020

Persist repository build logs for later access jupyterhub/binderhub#1156

Open

hinshun mentioned this issue Oct 13, 2020

Allow gateway exec-ing into a failed solve with an exec op #1732

Merged

tonistiigi mentioned this issue Oct 22, 2020

How do I get the SHA / Container ID of failed builds docker/buildx#424

Closed

2 tasks

tonistiigi mentioned this issue Dec 29, 2020

How to debug build command? #1922

Closed

tonistiigi mentioned this issue Jun 11, 2021

intermediate (build) container image not visible when using buildkit docker/buildx#628

Closed

3 tasks

DMRobertson mentioned this issue Apr 6, 2022

Use buildkit's cache feature to speed up docker builds matrix-org/synapse#11691

Merged

tonistiigi mentioned this issue Apr 20, 2022

Support launching shell on a build step for debugging #2813

Closed

tonistiigi mentioned this issue May 10, 2022

Proposal: build debugging in buildx (interactive sessions) docker/buildx#1104

Open

gaocegege mentioned this issue Jul 13, 2022

feat(CLI): Support debug command tensorchord/envd#124

Open

sam-thibault mentioned this issue Feb 27, 2023

Partially-built image not created if error occurs during build process moby/moby#45012

Closed

colinhemmings mentioned this issue Feb 19, 2024

Flag To Display Intermediate Containers on Docker Build docker/roadmap#594

Open

Improved debugging support #1472

Improved debugging support #1472

Comments

tonistiigi commented May 2, 2020

hinshun commented May 2, 2020 • edited Loading

fuweid commented May 2, 2020

tonistiigi commented Oct 7, 2020

coryb commented Oct 8, 2020

ag-TJNII commented Oct 23, 2020 • edited Loading

lyager commented Mar 18, 2021

lyager commented Mar 18, 2021

JoelTrain commented Mar 23, 2021

gtmtech commented Mar 23, 2021

tonistiigi commented Mar 23, 2021

matt2000 commented Apr 6, 2021

strelga commented Apr 14, 2021 • edited Loading

itcarroll commented Apr 23, 2021

KevOrr commented Apr 23, 2021 • edited Loading

willemm commented Apr 29, 2021

emmahyde commented May 29, 2021

NicolasDorier commented Jun 10, 2021 • edited Loading

cburgard commented Jul 1, 2021

tonistiigi commented Jul 1, 2021

hraban commented Jul 21, 2021

chrisawad commented Mar 22, 2022

kingbuzzman commented Apr 18, 2022

ktock commented May 10, 2022

yambottle commented Jul 15, 2022 • edited Loading

terekcampbell commented Jan 31, 2023

ptrxyz commented Feb 13, 2023

rfay commented Feb 13, 2023

Derekt2 commented Mar 28, 2023

TBBle commented Mar 28, 2023

mmerickel commented Mar 28, 2023

TBBle commented Mar 28, 2023 • edited Loading

willemm commented Mar 28, 2023

TBBle commented Mar 28, 2023

willemm commented Mar 28, 2023

opinionmachine commented May 12, 2023

tonistiigi commented May 12, 2023

opinionmachine commented May 12, 2023

tonistiigi commented May 12, 2023

tonistiigi commented May 12, 2023

andyneff commented Jun 16, 2023

shapirus commented Oct 27, 2023

TBBle commented Oct 27, 2023

jedevc commented Oct 27, 2023

shapirus commented Oct 27, 2023

hinshun commented May 2, 2020 •

edited

Loading

ag-TJNII commented Oct 23, 2020 •

edited

Loading

strelga commented Apr 14, 2021 •

edited

Loading

KevOrr commented Apr 23, 2021 •

edited

Loading

NicolasDorier commented Jun 10, 2021 •

edited

Loading

yambottle commented Jul 15, 2022 •

edited

Loading

TBBle commented Mar 28, 2023 •

edited

Loading