Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved debugging support #1472

Open
tonistiigi opened this issue May 2, 2020 · 47 comments
Open

Improved debugging support #1472

tonistiigi opened this issue May 2, 2020 · 47 comments

Comments

@tonistiigi
Copy link
Member

addresses #1053
addresses #1470

An issue with the current build environment is that we often assume everyone can write a perfect Dockerfile from scratch without any mistakes. In real-world there is a lot of trial and error for writing a complex Dockerfile. Users get errors, need to understand what is causing them, and react accordingly.

In the legacy builder, one of the methods for dealing with this situation was to use --rm=false or look up the image ID of the last image layer from the build output and run docker run session with it to understand what was wrong. Buildkit does not create intermediate images nor make the containers it runs visible in docker run (both for very good reasons). Therefore this is even more complicated now and usually requires the user to set --target to do a partial build and the debug the output of it.

To improve this, we shouldn't try to bring back --rm=false that makes all the builds significantly slower and makes it impossible to manage storage for build cache. Instead, we could provide a better solution for this with a new --debugger flag.

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

If the error happened on a RUN command (execop in LLB), the user can use shell to rerun the command and keep tweaking it. This will happen in an identical environment to the one where execop runs, for example, this means access to secrets, ssh, cache mounts etc. They can also inspect the environment variables and files in the system that might be causing the issue. Using control commands, a user can switch between the broken state that was left behind by the failed command and the initial base state for that command. So in the case where they would try many possible fixes but end up in a bad state, they can just restore back to the initial state and start again.

If the error happened on a copy (or other file operation like rm), they can run ls and similar tools to find out why the file path is not correct and not working.

For implementation, this depends on #749 for support to run processes on build mounts directly without going through the solver. We would first start by modifying the Executor and ExecOp to instead of releasing the mounts after error, return them together with the error. I believe typed errors #1454 support can be reused for this. They should be returned up to the client Solve method, who can then decide to call llb.Exec with these mounts. If mounts are left unhandled, they are released with the gateway api release.

Once the debugging has completed, and the user has made changes to the source files, it is easy to trigger a restart of the build with exactly the same settings. This is also useful if you think you might be hitting a temporary error. If the retry didn't fix it, user is brought back to the debugger.

It might make sense to introduce a concept of "debugger image" that is used as a basis of the debugging environment. This would allow avoiding hardcoded logic in an opinionated area.

Later this could be extended with the step-based debugger, and source mapping support could be used to make source code changes directly in the editor or tracking dependencies in the build graph.

@hinshun

@hinshun
Copy link
Collaborator

hinshun commented May 2, 2020

Regarding the "debugger image", my colleague @slushie did some interesting work with sharing a mount namespace (partial containers) with a image that has debugging tools: https://github.com/slushie/cdbg

In that repository, there's a prototype of gdb in the debugging image, attaching to the process of a running container.

This may be useful to debug scratch images or minimal images that may not have the basic tools like a shell binary.

@fuweid
Copy link
Contributor

fuweid commented May 2, 2020

/cc

@tonistiigi
Copy link
Member Author

@coryb Now that Exec support has landed how big job do you estimate it to be to return the typed errors from execop/fileop that would allow running exec from the error position and position from the start of the op. Wondering if we should target that for v0.8 or not. We could potentially continue working on the client side ux after v0.8 is out. Already added #1714 to v0.8 that I think is a requirement.

@coryb
Copy link
Collaborator

coryb commented Oct 8, 2020

I am working on #1714 now, I am guessing a week+ before I have something viable for that.

I have not really looked into the change required for this yet. I think @hinshun has some ideas and is generally more familiar with this than I am. I will sync up with him and maybe twist his arm to help out 😄 I think we can try to break down what is remaining for this and try to come up with some estimates.

@ag-TJNII
Copy link

ag-TJNII commented Oct 23, 2020

Using --debugger on a build, should that build error, will take the user into a debugger shell similar to interactive docker run experience. There the user can see the error and use control commands to debug the actual cause.

Interactive shells being the only option is going to leave much to be desired when building in CI pipelines. I often use Docker in CI pipelines where the build command has no terminal to drop to or is a direct API call; having the only option be "run interactive" is not inline with current automated build best practices. Please consider an option to allow sideband inspection of buildkit layers, similar to how the legacy docker build works. Thanks.

@lyager
Copy link

lyager commented Mar 18, 2021

I've just upgraded Docker for Mac, which uses BUILDKIT as its default engine. Not feeling very comfortable with the suggested nsenter solution since the project is deprecated (or at least marked 'read-only'). Just wanted to give a +1 for getting this fixed. --debugger sounds like a great solution, maybe even letting it switch directly into interactive shell when a build step fails.

@lyager
Copy link

lyager commented Mar 18, 2021

Just wanted to follow up, changing the backend while building works for me: DOCKER_BUILDKIT=0 docker build . - but I must admit the speed of using buildkit is nice!

@JoelTrain
Copy link

I agree.
Having the image of the layer immediately prior to the issue makes it incredibly handy to run an interactive container immediately prior to the problem to poke around.

I guess for now I will run DOCKER_BUILDKIT=0 docker build . as a work around when debugging new dockerfiles

so that I can get the image ids in the output again

Step 2/12 : WORKDIR /usr/src/app
---> Running in 14307a565858
Removing intermediate container 14307a565858
---> 472b33608107
Step 3/12 : COPY ./package.json .
---> 40293e6966f5
Step 4/12 : COPY ./package-lock.json .
---> e91be6e9c9c6
Step 5/12 : RUN npm install
---> Running in dc762b24b192

$ docker run -it --rm e91be6e9c9c6 sh
/usr/src/app #

@gtmtech
Copy link

gtmtech commented Mar 23, 2021

Is there any solution in this space yet (that doesn't involve nsenter or regressing to DOCKER_BUILDKIT=0). I cant quite believe that it's coming up for 2 years since #1053 was raised and nobody has been able to debug docker buildkit builds since - it sounds like something that is as common a usecase as you could get?

Can't find any example of active work to resolve this issue, might step in and help out if there's nothing in the pipeline

@tonistiigi
Copy link
Member Author

I don't know what you mean by nsenter solution but that is not recommended. What you can do is create a named target to the position of the dockerfile you want to debug, build that target with --target and run it with docker run.

@matt2000
Copy link

matt2000 commented Apr 6, 2021

Just chiming in with a user perspective, after being put in a new environment where BUILDKIT appears to be the default, this is a decidedly worse experience than the past. Clearly the layers are being cached. I'd guess the simplest solution with a "backward compatible user experience" might be to just automatically export the last cached layer to the image store, and display its hash, whenever there is an error in docker build. Named targets for debugging feel like an awkward misuse of the feature, since the old way was "automatic."

@strelga
Copy link

strelga commented Apr 14, 2021

@tonistiigi
Do you plan to take this issue in development in any near future?
Does it have blockers now?

@itcarroll
Copy link

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

@KevOrr
Copy link

KevOrr commented Apr 23, 2021

The --target option is not recognized by docker-compose build (version 1.28.5), so I'm sadly resorting to DOCKER_BUILDKIT=0.

Iirc, when using Compose, target is a field in the build: subsection of a service definition

edit: https://github.com/compose-spec/compose-spec/blob/master/build.md#target

@willemm
Copy link

willemm commented Apr 29, 2021

The proposed option mentioned in #1053 , where you can specify that it should create the image even on failure, would be very helpful. It would even be helpful if you could just enhance the --output option with a flag that it also outputs on failure.

@emmahyde
Copy link

This would be fantastic. It's the only thing holding me back from moving over to buildkit full time!

@NicolasDorier
Copy link

NicolasDorier commented Jun 10, 2021

Just want to say that it is VERY painful to not be able to interatively debug intermediate images...
It really makes debugging a problem in 5 min take a 2 Hour long process...

@cburgard
Copy link

cburgard commented Jul 1, 2021

After switching to buildkit recently because of the secret-mount option, I've just spent about half an hour trying to figure out what magical command I need to show the images in the buildkit cache, the apparent answer being "it's not possible". I find it hard to believe that this issue still persists...

@tonistiigi
Copy link
Member Author

You can add a multi-stage split anywhere in your Dockerfile and use --target to build the portion you want to turn into a debug image.

@hraban
Copy link

hraban commented Jul 21, 2021

A temporary work-around is docker-compose, which (as of writing, v1.29.2) still doesn't use build kit when you do docker-compose run. You can create a simple docker-compose file with context: ., use docker-compose run --rm yourservice, which will then try to build it and print hash ids along the way. But if you use docker-compose build, it already uses buildkit, so this workaround is most likely on its way out. As is docker-compose itself, iirc?

@chrisawad
Copy link

This can give you a look at a the point after a successfully completed stage:

DOCKER_BUILDKIT=1
docker build --target <stage> -t test .
docker run --rm -it test bash

But unlike when DOCKER_BUILDKIT=0, I don't think there's a way to see the hash for each layer created in the image so you can't just jump in right before the error and test at the moment of failure.

Highly unfortunate, and a big deal if you ask me!

@kingbuzzman
Copy link

$ docker --version
Docker version 20.10.14

DOCKER_BUILDKIT=0 docker build .. doesn't seem to work anymore. I no longer get the hashes

@ktock
Copy link
Collaborator

ktock commented May 10, 2022

FYI:

I'm recently implemented an experimental interactive debugger for Dockerfile : buildg https://github.com/ktock/buildg

Also in buildx, discussion is ongoing towards interactive debugger support and UI/UX: docker/buildx#1104

@yambottle
Copy link

yambottle commented Jul 15, 2022

  • If the buildkit removes the intermediate container when build failure, how can I docker commit to debug that layer?
    • DOCKER_BUILDKIT=0 works for me in this case
  • But is there an official best practice to debug failure build layer with buildkit on?(because I do like the buildkit's logging tho)

@terekcampbell
Copy link

It's been quite some time since there's been movement here. Can we get an update on this?

@ptrxyz
Copy link

ptrxyz commented Feb 13, 2023

I fully support the idea of getting the hashes of each layer back. Maybe a good compromise would be to at least display the hash of the layer a failing command was run in?

@rfay
Copy link

rfay commented Feb 13, 2023

Hashes of each later would help so much.

@Derekt2
Copy link

Derekt2 commented Mar 28, 2023

still using DOCKER_BUILDKIT=0
to get image layer hashes, why not at least give the hashes when --progress=plain is specified?

@TBBle
Copy link
Collaborator

TBBle commented Mar 28, 2023

Because it's not simply "give the hashes", those hashes (i.e. what you see in the legacy builder) do not exist until the export stage of the build, and generating them by exporting each layer as it's built into an image would be a non-trivial operation that makes BuildKIt slower for everyone, and require redesigning the BuildKit build process to know about and use the chosen image exporter much earlier in the build than it does now.

As mentioned earlier, the solution for your actual problem (debugging failed builds in docker buildx) is being worked on over in docker/buildx#1104; PR6 landed last month, and PR7+8 are currently under-review.

Given that the BuildKit work to implement debugging was completed almost a year ago (Exec in the gateway API, and resolving and passing-up content IDs to the client when a build fails), I'd suggest closing this issue and redirecting people to follow the remaining work in buildx, as it does not seem like there's remaining scope for productive discussion in this ticket.

@mmerickel
Copy link

I just want the hash of the last layer built prior to the failure. Don’t need the hash of every later exported.

@TBBle
Copy link
Collaborator

TBBle commented Mar 28, 2023

That's what #1472 (comment) does now, by making the "last layer" the final layer, so BuildKit can export an image, since that's all it knows how to do. Anything more would only be workable when BuildKit is being used with Docker directly (and knows it), and buildx exists to contain those cases.

What other use do you have for intermediate image generation and hash output that isn't hand-implementing docker/buildx#1104 and isn't trying to build #1472 (comment) directly into BuildKit instead of buildx?

@willemm
Copy link

willemm commented Mar 28, 2023

My use case is actually to access the test report files after a failed unit test step. At the moment we use a separate target that has the unit test as last step with a " || echo failed" at the end to always succeed so we have an image to extract the test report from. But that requires building the dockerfile twice in each build, and specially tuning all the dockerfiles to support this. So access from an automated script to the build/state/files after a failed build would be very useful.

@TBBle
Copy link
Collaborator

TBBle commented Mar 28, 2023

Okay, so that's a use-case that isn't supported by the legacy builder either, AFAIR, it never created an image out of a failed step.

I hope you'll be pleased to know that PR8 of docker/buildx#1104 is implementing both "Execute in container at start of failed step" (similar to legacy builder "write-down layer ID and docker run it") and "Execute in container after failed step" (new! and the default) in the monitor via proposed docker buildx build --invoke=on-error, so you can get access to those files through this, I expect. It's currently being worked on (and you can see a more-detailed usage example) in docker/buildx#1640.

Based on this work, it would probably also be possible to implement in buildx something that can actually export an image from either the start or end of a failed step, since (I think) BuildKit now sends enough information on failure for buildx to request an image export of the container state, and buildx has enough information to tell BuildKit where to send such an image.

I don't immediately see an open feature-request in buildx for that, and I suspect it wouldn't be worked on until docker/buildx#1104 is completed (since the work heavily overlaps).

It's also possible that I'm wrong and the infrastructure that supports docker/buildx#1104 is not sufficient to support buildx exporting either or both of the before and after images of a failed build step.

So yeah, I suggest you open a feature request for your use-case on buildx, and see what the buildx maintainers think. (I'm not a buildx maintainer; I'm not super familiar with that codebase, and I have no particularly strong prediction on what they'll think of it. I hope they like it, it seems useful to me for, e.g., tests-run-during-container-build workflows.)

@willemm
Copy link

willemm commented Mar 28, 2023

True, legacy didn't support that either. I was just throwing it out there as a use-case, and I am indeed pleased to know that information about PR8, thank you ^^

@opinionmachine
Copy link

So my usecase is to use docker build to run all the package restore, build, test (including coverage, static code analysis, static security analysis et c) and finally put the built artifact in a lightweight image. The only issue is I'd need to access the test output from the intermedate layer to push to the CI system, and that is possible with buildkit = 0, but as far as this discussion goes not possible with buildkit. Now I'm all for performance, but I'd love it it was possible to label and publish an intermediate layer manually for this specific case. Otherwise I need multiple dockerfiles, like a barbarian.

@tonistiigi
Copy link
Member Author

You can use #1472 (comment) instead of multiple Dockerfiles. Or you can PR a change that adds an option to stop at a specific Dockerfile line.

@opinionmachine
Copy link

You can use #1472 (comment) instead of multiple Dockerfiles. Or you can PR a change that adds an option to stop at a specific Dockerfile line.

I don’t know how you do test coverage and test results, but I’d like to have the output every run, not just when tests break.

@tonistiigi
Copy link
Member Author

If your case is that you want to build multiple things (stages) and push their results to different locations, not only your final build result then you can look into docker buildx bake https://docs.docker.com/build/bake/reference/ . Define all the points you want to access as separate targets and a single command will build them all together and push where needed.

@tonistiigi
Copy link
Member Author

There are some new (experimental for now) debug options in new buildx release candidate: https://github.com/docker/buildx/releases/tag/v0.11.0-rc1

@andyneff
Copy link

If your case is that you want to build multiple things (stages) and push their results to different locations, not only your final build result then you can look into docker buildx bake https://docs.docker.com/build/bake/reference/ . Define all the points you want to access as separate targets and a single command will build them all together and push where needed.

I finally needed to use the experimental debug invoke, and I really like how it works! I hope it gets added to the bake command too, eventually. (And this too)

@shapirus
Copy link

So, considering all the experimental features, is there now a possibility to run a command (typically a shell) inside a build container?

With the normal builder, I can run docker ps, get the build container's ID from the output, then run docker exec -it <id> sh and get a shell running inside that container to inspect or run whatever I need there.

Does buildkit support this in any way, other than running an ssh reverse tunnel from inside the container in a RUN build step? It would be nice for it to support it before the normal builder is removed.

@TBBle
Copy link
Collaborator

TBBle commented Oct 27, 2023

@shapirus Does https://github.com/docker/buildx/blob/v0.11.2/docs/guides/debugging.md do what you want? The BuildKit-side requirements (low-level bits) are implemented; the buildx side is being built-out, was shipped experimentally in buildx 0.11 and hence Docker Desktop 4.22.0, and is looking for feedback at docker/buildx#1104.

I'd suggest trying buildx 0.12.0-rc1 if you're interested in this feature, as the command-line was changed and the relevant docs are now at https://github.com/docker/buildx/blob/v0.12.0-rc1/docs/guides/debugging.md. That way any feedback you give is relative to the current state of development.

@jedevc
Copy link
Member

jedevc commented Oct 27, 2023

@tonistiigi does it make sense to close this issue? Now that we're tracking things in docker/buildx#1104, and the area/debug tag on buildx.

@shapirus
Copy link

Does https://github.com/docker/buildx/blob/v0.11.2/docs/guides/debugging.md do what you want?

Yes, from what I read there, it should solve it, as far as practical use cases are concerned. Thanks for the hint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests