You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has been affecting us over the past few days and has crippled our build system, which was using --cache-from with buildkit to reduce a 30 minute docker build to 5 minutes. The image is around 17gb, and we are using AWS CodeBuild (with docker 19 and buildkit), and AWS ECR as the docker registry.
Summary
When building the image, if at most a few layers have changed from the previous image, docker build produces an image that is much smaller than it should be, and completes, apparently successfully. Using docker image history to examine the layers (either before pushing or by looking at the resulting image from the registry) many layers are listed with 0B size, even though they should have affected the file system, with many GB missing in many cases. The empty layers are sometimes from those listed as CACHED in the build log, and sometimes from those that were supposedly built immediately before examining.
When the image is run, it finds that files are missing. I suspect that it's always a layer at a time that is missing, but which layers are missing seems to be to some extent random. That said, I haven't seen anything missing from near the start of the dockerfile - it's usually something towards the end.
Reduced Dockerfile and build scripts
This is not the actual Dockerfile, which has a few extra bits and pieces, but it shows the key moving parts.
We have other builds that use the same mechanism but produce much smaller images, and they haven't had this problem. My best guess is that it's something to do with the size of the image. We did change the parent image from one cuda image to another around the time the problems started, and the new image is bigger.
Observation
In a build which has used a substantial amount of cache from the registry copy, docker image history -H --no-trunc <image> shows some lines like this:
<missing> 2 hours ago RUN |2 AWS_DEFAULT_REGION=<...> /bin/sh -c pip install -r /project/requirements.txt # buildkit 0B buildkit.dockerfile.v0
I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful. The requirement to push and pull images in the tens of GB to be able to test it makes it very time consuming to iterate.
Expectation
I would expect that using --cache-from or not, the resulting image should be the same, especially as nothing external is changing between builds.
The text was updated successfully, but these errors were encountered:
I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful.
I understand but we do need a test case to understand this. It could error in display, size accounting, corrupt cache graph etc. . I guess it is possible for it to be size related if some errors gets dropped but that seems unlikely atm.
The problem seems to have vanished now. My only guesses are that there was something in the dockerfile that we changed that fixed it (we moved some chmods around), or that there was some poisonous image in the registry that was breaking stuff. Either way unless it happens again I have no way to get more information, and it's not affecting us, so I'm going to close this.
Background
This has been affecting us over the past few days and has crippled our build system, which was using --cache-from with buildkit to reduce a 30 minute docker build to 5 minutes. The image is around 17gb, and we are using AWS CodeBuild (with docker 19 and buildkit), and AWS ECR as the docker registry.
Summary
When building the image, if at most a few layers have changed from the previous image, docker build produces an image that is much smaller than it should be, and completes, apparently successfully. Using docker image history to examine the layers (either before pushing or by looking at the resulting image from the registry) many layers are listed with 0B size, even though they should have affected the file system, with many GB missing in many cases. The empty layers are sometimes from those listed as CACHED in the build log, and sometimes from those that were supposedly built immediately before examining.
When the image is run, it finds that files are missing. I suspect that it's always a layer at a time that is missing, but which layers are missing seems to be to some extent random. That said, I haven't seen anything missing from near the start of the dockerfile - it's usually something towards the end.
Reduced Dockerfile and build scripts
This is not the actual Dockerfile, which has a few extra bits and pieces, but it shows the key moving parts.
The docker run command is
We have other builds that use the same mechanism but produce much smaller images, and they haven't had this problem. My best guess is that it's something to do with the size of the image. We did change the parent image from one cuda image to another around the time the problems started, and the new image is bigger.
Observation
In a build which has used a substantial amount of cache from the registry copy,
docker image history -H --no-trunc <image>
shows some lines like this:I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful. The requirement to push and pull images in the tens of GB to be able to test it makes it very time consuming to iterate.
Expectation
I would expect that using --cache-from or not, the resulting image should be the same, especially as nothing external is changing between builds.
The text was updated successfully, but these errors were encountered: