Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(aws-lambda-python) docker build is not working #12610

Closed
ncaq opened this issue Jan 20, 2021 · 30 comments
Closed

(aws-lambda-python) docker build is not working #12610

ncaq opened this issue Jan 20, 2021 · 30 comments
Labels
@aws-cdk/aws-lambda-python blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@ncaq
Copy link
Contributor

ncaq commented Jan 20, 2021

Maybe the problem is
[aws-lambda-nodejs] docker build is not working · Issue #10881 · aws/aws-cdk
Python version.

Reproduction Steps

new PythonFunction(this, "handler");
cdk deploy FooStack

What did you expect to happen?

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "bash": executable file not found in $PATH: unknown.

What actually happened?

deploy is successed.

Environment

Other


This is 🐛 Bug Report

@ncaq ncaq added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 20, 2021
@NGL321 NGL321 changed the title [aws-lambda-python] docker build is not working (aws-lambda-python) docker build is not working Jan 25, 2021
@christophgysin
Copy link
Contributor

christophgysin commented Feb 5, 2021

I just wiped my docker cache, and everything seems to work again.

$  docker system prune --all

I noticed that it uses amazon/aws-sam-cli-build-image-python3.7:latest, so maybe I was using an older cached version of the image that caused the issue?

@christophgysin
Copy link
Contributor

christophgysin commented Feb 5, 2021

It seems I managed to reproduce it again. I added a line that prints out the docker command, and run it manually. It seems the the --user option is what causes the image amazon/aws-sam-cli-build-image-python3.7 to fail:

This works:

$ docker run --rm amazon/aws-sam-cli-build-image-python3.7 true

But this fails:

$ docker run --rm -u 1000:1000 amazon/aws-sam-cli-build-image-python3.7 true
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "true": executable file not found in $PATH: unknown.

This is on ArchLinux with docker:

$ docker --version
Docker version 20.10.2, build 2291f610ae

@christophgysin
Copy link
Contributor

It seems the issue is that the root (/) is not accessible to anyone other than root (uid 0):

$ docker run amazon/aws-sam-cli-build-image-python3.7 ls -ld /
drwx------ 1 root root 238 Feb  6 20:23 /

I tried adding a chmod 755 / to the Dockerfile, but strangely that doesn't seem to have any effect:

$ cat > Dockerfile <<EOF
FROM amazon/aws-sam-cli-build-image-python3.7
RUN chmod 755 /
EOF
$ docker run $(docker build -q .) ls -ld /
drwx------ 1 root root 238 Feb  6 20:27 /

Any idea how I could work around the issue?

@christophgysin
Copy link
Contributor

I noticed that the reference issue #10881 fixed this by adding chmod 711:
https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-lambda-nodejs/lib/Dockerfile#L28

@christophgysin
Copy link
Contributor

It seems that the above behavior where the root mode can't be changed is only when using the btrfs storage driver. As a workaround, I switched to overlay2.

Now I'm curious what should be done to fix this:
a) fix the aws-sam-cli-emulation-image-* images to have a sensible root mode
b) fix the btrfs storage driver to respect new root modes of additional layers
c) both of the above?

@ncaq
Copy link
Contributor Author

ncaq commented Feb 11, 2021

I searched to location that build aws-sam-cli-emulation-image-* when to fix nodejs.
I can't found.

@eladb eladb added effort/small Small work item – less than a day of effort p1 labels Feb 15, 2021
@eladb eladb removed their assignment Feb 25, 2021
@ryparker
Copy link
Contributor

ryparker commented Jun 3, 2021

Hey @ncaq 👋🏻

Thanks for bringing this bug to our attention! I'm going to leave this issue as p1 because the potential impact of this is significant. The fix suggested above looks like an appropriate way of fixing this.

If anyone is interested in lending a hand this is a great first issue to tackle. A good place to start is by following the steps described in our contribution guidelines.

@ryparker ryparker removed the needs-triage This issue or PR still needs to be triaged. label Jun 3, 2021
@christophgysin
Copy link
Contributor

christophgysin commented Jun 7, 2021

@ryparker Could you clarify which fix you are referring to?

a) fix the aws-sam-cli-emulation-image-* images to have a sensible root mode

If you are referring to this, could you please point my to the code that builds this image? IIRC from my research in February, the image is not built as part of any publicly available repository.

@hariseldon78
Copy link

I have the same problem. I'm using arch linux. The same build works fine on a Ubuntu vm.
It's 4 or 5 months already that I have to do my cdk development by ssh into a vm, with a sshfs file system to edit the code. docker system prune doesn't help, can anyone suggest any workaround?

@christophgysin
Copy link
Contributor

christophgysin commented Jun 14, 2021

@hariseldon78

It seems that the above behavior where the root mode can't be changed is only when using the btrfs storage driver. As a workaround, I switched to overlay2.

$ cat /etc/docker/daemon.json 
{
  "storage-driver": "overlay2"
}

@hariseldon78
Copy link

As a workaround, I switched to overlay2.

Thanks. In my case it was configured with "overlay", and i confirm that switching to "overlay2" fixed the problem.

@christophgysin
Copy link
Contributor

The fix suggested above looks like an appropriate way of fixing this.

@ryparker Could you please clarify what fix you were referring to? It seems that's all that is standing in the way to get this fixed.

@alexrashed
Copy link

alexrashed commented Sep 29, 2021

Unfortunately, I'm facing the same issue with the ZFS storage driver. Is there any other workaround which doesn't involve changing my file system?

I noticed that the reference issue #10881 fixed this by adding chmod 711: https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-lambda-nodejs/lib/Dockerfile#L28

@ryparker I guess the same fix could be easily applied here as well. The only question is just the location of the Dockerfile:

@ryparker Could you clarify which fix you are referring to?

a) fix the aws-sam-cli-emulation-image-* images to have a sensible root mode

If you are referring to this, could you please point my to the code that builds this image? IIRC from my research in February, the image is not built as part of any publicly available repository.

I'd be happy to file a PR if someone could point me to the Dockerfile 😄

@christophgysin
Copy link
Contributor

Is there any other workaround which doesn't involve changing my file system?

Yes, change the docker storage driver.

I'd be happy to file a PR if someone could point me to the Dockerfile

It doesn't seem to be public, so there is no way for us to make PRs. Someone from AWS needs to step up and fix this.

It's a shame that @ryparker isn't responding to this anymore. The community did all the work and the root cause has been found. A fix has been suggested. All it needs is someone on the inside to apply the fix 😞

@alexrashed
Copy link

alexrashed commented Sep 29, 2021

Yes, change the docker storage driver.

The overlay2 storage driver does not support all types of upper file systems (including mine). So unfortunately the workaround doesn't work for me.

@ryparker
Copy link
Contributor

ryparker commented Oct 5, 2021

I apologize for the delay in response.

I'm unable to reproduce this however I've applied the same fix that was applied to the nodejs image, to the python image.

Would anyone be able to test this fix or provide the full reproduction code?

@christophgysin
Copy link
Contributor

christophgysin commented Oct 5, 2021

@ryparker Thanks for your reply.

Here are the steps to reproduce the core issue:

$ cat /etc/docker/daemon.json 
{
  "storage-driver": "btrfs"
}
$ docker run --rm -u 1000:1000 amazon/aws-sam-cli-build-image-python3.7 true
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "true": executable file not found in $PATH: unknown.
$ docker run amazon/aws-sam-cli-build-image-python3.7 ls -ld /
drwx------ 1 root root 238 Oct  5 19:01 /

It seems that with the btrfs storage driver, overriding the mode of the root dir is not possible.

$ cat > Dockerfile <<EOF
FROM amazon/aws-sam-cli-build-image-python3.7
RUN chmod 755 /
EOF
$ docker run $(docker build -q .) ls -ld /
drwx------ 1 root root 238 Oct  5 19:08 /

This might be a bug in the btrfs storage driver. But it could be fixed by creating the aws-sam-cli-emulation-image-* images with a more sensible mode for the root.

EDIT: I don't have ZFS, so I can't verify that, but it seems to suffer from the same issue.

@alexrashed
Copy link

Unfortunately I can reproduce the issue with the steps described by @christophgysin with the ZFS storage driver.
This in turn means that the fix in #16804 might not have any effect, right?
But to be honest, I haven't fully tried to verify the bugfix from the PR, since building the PR branch is quite tedious.

@ryparker
Copy link
Contributor

ryparker commented Oct 6, 2021

Thanks @christophgysin, I'm working on reproducing this on a fresh Linux machine. I'll hold the PR until we can confirm it fixes this.

@kellertk
Copy link
Contributor

kellertk commented Oct 6, 2021

I was also able to reproduce this with the ZFS storage driver on Ubuntu 20.04:

root@host:/# docker run --rm -u 1000:1000 amazon/aws-sam-cli-build-image-python3.7 true
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "true": executable file not found in $PATH: unknown.
root@host:/# docker run amazon/aws-sam-cli-build-image-python3.7 ls -ld /
drwx------ 21 root root 24 Oct  6 16:58 /
root@host:~/test# cat > Dockerfile <<EOF
> FROM amazon/aws-sam-cli-build-image-python3.7
> RUN chmod 755 /
> EOF
root@host:~/test# docker run $(docker build -q .) ls -ld /
drwx------ 21 root root 24 Oct  6 17:02 /

Interestingly, switching to the overlay2 driver does allow execution as uid 1000 within the container. So this is some sort of interaction with non root uids and the way that docker handles CoW storage layers. It looks like the container actually is built with sensible / permissions after all:

root@host:~/test# sed -i 's/zfs/overlay2/g' /var/snap/docker/current/config/daemon.json
root@host:~/test# snap restart docker
Restarted.
root@host:~/test# docker run --rm -u 1000:1000 amazon/aws-sam-cli-build-image-python3.7 true
root@host:~/test# echo $?
0
root@host:~/test# docker run amazon/aws-sam-cli-build-image-python3.7 ls -ld /
drwxr-xr-x 1 root root 6 Oct  6 17:18 /

As a test, I tried applying the changes in https://github.com/aws/aws-cdk/blob/master/packages/%40aws-cdk/aws-lambda-nodejs/lib/Dockerfile#L31 (/sbin/useradd doesn't exist in this container so I did it manually):

root@host:~/test# sed -i 's/overlay2/zfs/g' /var/snap/docker/current/config/daemon.json
root@host:~/test# snap restart docker
Restarted.
root@host:~/test# cat > Dockerfile <<EOF
FROM amazon/aws-sam-cli-build-image-python3.7
RUN echo "user:x:1000:1000:user:/:/bin/bash" >> /etc/passwd && echo "group:x:1000:" >> /etc/group && chmod 711 / 
EOF
root@host:~/test# docker run --rm -u 1000:1000 $(docker build -q .) true
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "true": executable file not found in $PATH: unknown.

This still fails, so I don't think just adding the user is going to fix this.

@christophgysin
Copy link
Contributor

christophgysin commented Oct 6, 2021

Because in the btrfs and zfs storage drivers it is not possible to change the mode of the root directory in a layer, the PR won't fix this for those storage drivers. The only ways to fix this is to either fix the mode of the root in the initial layer, or change the btrfs and xfs (and possibly other) storage drivers to allow overriding the root mode in a layer.

@nija-at nija-at removed their assignment Oct 14, 2021
@ryparker ryparker added effort/medium Medium work item – several days of effort and removed effort/small Small work item – less than a day of effort labels Nov 9, 2021
@corymhall
Copy link
Contributor

@ncaq @christophgysin @alexrashed lambda-python now supports customizing the docker image that is used to perform bundling. From the discussion on this issue it sounds like changing the image would solve the issue. Can you let me know if you are able to get it to work by providing your own image?

new lambda.PythonFunction(this, 'function', {
 entry,
 runtime: Runtime.PYTHON_3_8,
 bundling: {
    image: DockerImage.fromBuild('/path/to/dockerfile'),
    // or DockerImage.fromRegistry('...'),
 },
});

@corymhall corymhall added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed p1 labels Jan 27, 2022
@christophgysin
Copy link
Contributor

@corymhall Without trying, I'm sure that this works, as the bug is in the docker image used by default.

While this provides a workaround, is there something preventing us from fixing the official docker images?

@corymhall
Copy link
Contributor

corymhall commented Jan 27, 2022

@christophgysin the official docker images are maintained by the sam team (I think in this repo). I would recommend reaching out there to see if that is something that they would consider.

@christophgysin
Copy link
Contributor

@corymhall As I commented previously, that repo does not seem to contain the root image though, see e.g. https://github.com/aws/aws-sam-build-images/blob/develop/build-image-src/Dockerfile-python39#L2

It seems the root image that contains the bug is not built from any publicly available source. Any chance you could find out what team at AWS is responsible for that image and get their attention to this issue?

@corymhall
Copy link
Contributor

@christophgysin I've created an issue so we'll see what they will say.

@corymhall corymhall removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 27, 2022
@corymhall corymhall added the blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. label Mar 8, 2022
@Krzysztow
Copy link

Same problem happens when building PythonLambda on Fedora 36.
Error I was getting:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "bash": executable file not found in $PATH: unknown.                                                                    
/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/core/lib/asset-staging.js:192
			throw fs.existsSync(bundleErrorDir) && fs.removeSync(bundleErrorDir), fs.renameSync(bundleDir, bundleErrorDir), new Error(`Failed to bundle asset ${this.node.path}, bundle output is located at ${bundleErrorDir}: ${err}`)
                                                                                                                   ^
Error: Failed to bundle asset BillFetcherStack/bill-sender/Code/Stage, bundle output is located at /home/krzysztow/projects/monthly-receipt/infra/cdk.out/asset.f3be281c94d2f053035043662b24e8aca59cc253cb06e73b7352d250d72c99f6-error: Error: docker exited with status 127
    at AssetStaging.bundle (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/core/lib/asset-staging.js:192:116)
    at AssetStaging.stageByBundling (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/core/lib/asset-staging.js:114:8)
    at stageThisAsset (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/core/lib/asset-staging.js:46:32)
    at Cache.obtain (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/core/lib/private/cache.js:1:242)
    at new AssetStaging (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/core/lib/asset-staging.js:61:42)
    at new Asset (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/aws-s3-assets/lib/asset.js:1:736)
    at AssetCode.bind (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/aws-lambda/lib/code.js:1:4628)
    at new Function (/home/krzysztow/projects/monthly-receipt/infra/node_modules/aws-cdk-lib/aws-lambda/lib/function.js:1:2803)
    at new PythonFunction (/home/krzysztow/projects/monthly-receipt/infra/node_modules/@aws-cdk/aws-lambda-python-alpha/lib/function.ts:73:5)
    at new BillFetcherStack (/home/krzysztow/projects/monthly-receipt/infra/lib/infra-stack.ts:85:32)

Changing default storage-driver in /etc/docker/daemon.json to "storage-driver": "overlay2" fixes the issue.

@thiagobasilio-nanga
Copy link

@hariseldon78

It seems that the above behavior where the root mode can't be changed is only when using the btrfs storage driver. As a workaround, I switched to overlay2.

$ cat /etc/docker/daemon.json 
{
  "storage-driver": "overlay2"
}

Thank you. It worked for me! Before, I was using the btrfs storage driver.

@madeline-k
Copy link
Contributor

Closing this issue, as there is no action that can be taken in the AWS CDK. Internal ref: V783772357

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-lambda-python blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.