Large images using --cache-from have some layers come through with 0 bytes unexpectedly #1540

tobymiller · 2020-06-23T12:07:58Z

Background

This has been affecting us over the past few days and has crippled our build system, which was using --cache-from with buildkit to reduce a 30 minute docker build to 5 minutes. The image is around 17gb, and we are using AWS CodeBuild (with docker 19 and buildkit), and AWS ECR as the docker registry.

Summary

When building the image, if at most a few layers have changed from the previous image, docker build produces an image that is much smaller than it should be, and completes, apparently successfully. Using docker image history to examine the layers (either before pushing or by looking at the resulting image from the registry) many layers are listed with 0B size, even though they should have affected the file system, with many GB missing in many cases. The empty layers are sometimes from those listed as CACHED in the build log, and sometimes from those that were supposedly built immediately before examining.

When the image is run, it finds that files are missing. I suspect that it's always a layer at a time that is missing, but which layers are missing seems to be to some extent random. That said, I haven't seen anything missing from near the start of the dockerfile - it's usually something towards the end.

Reduced Dockerfile and build scripts

This is not the actual Dockerfile, which has a few extra bits and pieces, but it shows the key moving parts.

# syntax = docker/dockerfile:1.0-experimental
FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04

RUN apt-get <stuff>
RUN --mount=type=secret,id=accessKey --mount=type=secret,id=accessSecret --mount=type=secret,id=accessToken \ 
    AWS_ACCESS_KEY_ID=`cat /run/secrets/accessKey` \
    AWS_SECRET_ACCESS_KEY=`cat /run/secrets/accessSecret` \
    AWS_SESSION_TOKEN=`cat /run/secrets/accessToken` \
    aws s3 cp <...> - | tar -C <...> -xzv

ENV PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV <more env vars>

COPY ./requirements.txt /project/requirements.txt
RUN pip install -r /project/requirements.txt

COPY / /project
CMD ["python", "..."]

The docker run command is

DOCKER_BUILDKIT=1 docker build --build-arg BUILDKIT_INLINE_CACHE=1 --secret id=accessKey,src=<...> <other secrets> -t <...> .

We have other builds that use the same mechanism but produce much smaller images, and they haven't had this problem. My best guess is that it's something to do with the size of the image. We did change the parent image from one cuda image to another around the time the problems started, and the new image is bigger.

Observation

In a build which has used a substantial amount of cache from the registry copy, docker image history -H --no-trunc <image> shows some lines like this:

<missing> 2 hours ago RUN |2 AWS_DEFAULT_REGION=<...> /bin/sh -c pip install -r /project/requirements.txt # buildkit 0B buildkit.dockerfile.v0

I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful. The requirement to push and pull images in the tens of GB to be able to test it makes it very time consuming to iterate.

Expectation

I would expect that using --cache-from or not, the resulting image should be the same, especially as nothing external is changing between builds.

The text was updated successfully, but these errors were encountered:

tonistiigi · 2020-06-23T23:54:33Z

I don't think it'll be easy make a reproducible case, so if anyone has any ideas that can help me work out what parts might be important and what definitely won't be that would be useful.

I understand but we do need a test case to understand this. It could error in display, size accounting, corrupt cache graph etc. . I guess it is possible for it to be size related if some errors gets dropped but that seems unlikely atm.

tobymiller · 2020-06-24T13:43:44Z

The problem seems to have vanished now. My only guesses are that there was something in the dockerfile that we changed that fixed it (we moved some chmods around), or that there was some poisonous image in the registry that was breaking stuff. Either way unless it happens again I have no way to get more information, and it's not affecting us, so I'm going to close this.

tonistiigi added the status/needs-more-info label Jun 23, 2020

tobymiller closed this as completed Jun 24, 2020

Patrick-Remy mentioned this issue Feb 23, 2021

Layer in built image is missing / empty 0B #1980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large images using --cache-from have some layers come through with 0 bytes unexpectedly #1540

Large images using --cache-from have some layers come through with 0 bytes unexpectedly #1540

tobymiller commented Jun 23, 2020 •

edited

Loading

tonistiigi commented Jun 23, 2020

tobymiller commented Jun 24, 2020

Large images using --cache-from have some layers come through with 0 bytes unexpectedly #1540

Large images using --cache-from have some layers come through with 0 bytes unexpectedly #1540

Comments

tobymiller commented Jun 23, 2020 • edited Loading

tonistiigi commented Jun 23, 2020

tobymiller commented Jun 24, 2020

tobymiller commented Jun 23, 2020 •

edited

Loading