Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large images using --cache-from have some layers come through with 0 bytes unexpectedly #1540

Closed
tobymiller opened this issue Jun 23, 2020 · 2 comments

Comments

@tobymiller
Copy link

tobymiller commented Jun 23, 2020

Background

This has been affecting us over the past few days and has crippled our build system, which was using --cache-from with buildkit to reduce a 30 minute docker build to 5 minutes. The image is around 17gb, and we are using AWS CodeBuild (with docker 19 and buildkit), and AWS ECR as the docker registry.

Summary

When building the image, if at most a few layers have changed from the previous image, docker build produces an image that is much smaller than it should be, and completes, apparently successfully. Using docker image history to examine the layers (either before pushing or by looking at the resulting image from the registry) many layers are listed with 0B size, even though they should have affected the file system, with many GB missing in many cases. The empty layers are sometimes from those listed as CACHED in the build log, and sometimes from those that were supposedly built immediately before examining.

When the image is run, it finds that files are missing. I suspect that it's always a layer at a time that is missing, but which layers are missing seems to be to some extent random. That said, I haven't seen anything missing from near the start of the dockerfile - it's usually something towards the end.

Reduced Dockerfile and build scripts

This is not the actual Dockerfile, which has a few extra bits and pieces, but it shows the key moving parts.

# syntax = docker/dockerfile:1.0-experimental
FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04

RUN apt-get <stuff>
RUN --mount=type=secret,id=accessKey --mount=type=secret,id=accessSecret --mount=type=secret,id=accessToken \ 
    AWS_ACCESS_KEY_ID=`cat /run/secrets/accessKey` \
    AWS_SECRET_ACCESS_KEY=`cat /run/secrets/accessSecret` \
    AWS_SESSION_TOKEN=`cat /run/secrets/accessToken` \
    aws s3 cp <...> - | tar -C <...> -xzv

ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV <more env vars>

COPY ./requirements.txt /project/requirements.txt
RUN pip install -r /project/requirements.txt

COPY / /project
CMD ["python", "..."]

The docker run command is

DOCKER_BUILDKIT=1 docker build --build-arg BUILDKIT_INLINE_CACHE=1 --secret id=accessKey,src=<...> <other secrets> -t <...> .

We have other builds that use the same mechanism but produce much smaller images, and they haven't had this problem. My best guess is that it's something to do with the size of the image. We did change the parent image from one cuda image to another around the time the problems started, and the new image is bigger.

Observation

In a build which has used a substantial amount of cache from the registry copy, docker image history -H --no-trunc <image> shows some lines like this:

<missing> 2 hours ago RUN |2 AWS_DEFAULT_REGION=<...> /bin/sh -c pip install -r /project/requirements.txt # buildkit 0B buildkit.dockerfile.v0

I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful. The requirement to push and pull images in the tens of GB to be able to test it makes it very time consuming to iterate.

Expectation

I would expect that using --cache-from or not, the resulting image should be the same, especially as nothing external is changing between builds.

@tonistiigi
Copy link
Member

I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful.

I understand but we do need a test case to understand this. It could error in display, size accounting, corrupt cache graph etc. . I guess it is possible for it to be size related if some errors gets dropped but that seems unlikely atm.

@tobymiller
Copy link
Author

The problem seems to have vanished now. My only guesses are that there was something in the dockerfile that we changed that fixed it (we moved some chmods around), or that there was some poisonous image in the registry that was breaking stuff. Either way unless it happens again I have no way to get more information, and it's not affecting us, so I'm going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants