Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda Ubuntu Arm64 and PPC64LE Image builds failing on VM with Buildkit Only #1943

Closed
demizer opened this issue Jan 14, 2021 · 4 comments
Closed

Comments

@demizer
Copy link

demizer commented Jan 14, 2021

Hello!

Seems to be related to #1929, but my setup is much different.

VM (Doesn't work)

Tried multiple kernel, qemu, docker, and moby/buildkit versions...

kernel: 5.4.89-1.el7.elrepo.x86_64, 5.8.13-1.el7.elrepo.x86_64
docker: 19.03.13, 20.10.2
qemu:   5.0.1, 5.1, 5.2
moby/buildkit: 0.7.1,0.7.2,0.8.0, 0.8.1, master, nightly

failing message:

#5 23.26 Error while loading /usr/sbin/dpkg-split: No such file or directory
#5 23.26 Error while loading /usr/sbin/dpkg-deb: No such file or directory

...

#5 23.37 E: Sub-process /usr/bin/dpkg returned an error code (1)
#5 ERROR: executor failed running [/dev/.buildkit_qemu_emulator /bin/sh -c apt-get update && apt-get install -y --no-install-recommends     gnupg2 curl ca-certificates]: exit code: 100
------
 > [2/2] RUN apt-get update && apt-get install -y --no-install-recommends     gnupg2 curl ca-certificates:
------
Dockerfile2:9
--------------------
   8 |
   9 | >>> RUN apt-get update && apt-get install -y --no-install-recommends \
  10 | >>>     gnupg2 curl ca-certificates
  11 |     # && \
--------------------
error: failed to solve: rpc error: code = Unknown desc = executor failed running [/dev/.buildkit_qemu_emulator /bin/sh -c apt-get update && apt-get install -y --no-install-recommends     gnupg2 curl ca-certificates]: exit code: 100

Docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 67
  Running: 1
  Paused: 0
  Stopped: 66
 Images: 27
 Server Version: 20.10.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.4.89-1.el7.elrepo.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 23.49GiB
 Name: <redacted>
 ID: <redacted>
 Docker Root Dir: /data/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
	<redacted>
 Live Restore Enabled: false

WARNING: No blkio weight support
WARNING: No blkio weight_device support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

Dockerfile

FROM ubuntu:20.04
RUN apt-get update && apt-get install -y --no-install-recommends gnupg2 curl ca-certificates

The buildx commands:

$ docker buildx create --use --platform linux/arm64 --driver-opt image=moby/buildkit:nightly --name cuda-arm64
$ docker buildx build --progress plain --load --pull --platform linux/arm64 -t cuda-test -f Dockerfile .

Builder:

$ docker buildx inspect
Name:   cuda-arm64
Driver: docker-container

Nodes:
Name:      cuda-arm640
Endpoint:  unix:///var/run/docker.sock
Status:    running
Flags:     --debug
Platforms: linux/arm64*, linux/amd64, linux/386

Building on bare metal with docker buildx works...

Bare metal machine config:

Host: Ubuntu 18.04
Kernel: 4.15.0-124-generic
Qemu: 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.34)

Docker Info:

Client:
 Debug Mode: false
 Plugins:
  app: Docker Application (Docker Inc., v0.8.0)
  buildx: Build with BuildKit (Docker Inc., v0.4.2-tp-docker)

Server:
 Containers: 9
  Running: 4
  Paused: 0
  Stopped: 5
 Images: 11
 Server Version: 19.03.13
 Storage Driver: btrfs
  Build Version: Btrfs v4.15.1
  Library Version: 102
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.15.0-124-generic
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 31.33GiB
 Name: <redacted>
 ID: <redacted>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: cudatest
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  <redacted>
 Live Restore Enabled: false

WARNING: No swap limit support

Builder info:

$ docker buildx inspect
Name:   cuda-arm64
Driver: docker-container

Nodes:
Name:      cuda-arm640
Endpoint:  unix:///var/run/docker.sock
Status:    running
Platforms: linux/arm64*, linux/amd64, linux/ppc64le, linux/386

Bulidx command:

$ docker buildx build --progress plain --load --pull --platform linux/arm64 -t cuda-test -f Dockerfile .

It worked:

$ docker inspect cuda-test
[
    {
        "Id": "sha256:8e8bc395285758ddd6f5584cbebba39f9c0cec598bc4ee98d72149ead2f7fcc7",
        "RepoTags": [
            "cuda-test:latest"
        ],
        "RepoDigests": [],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2021-01-14T00:34:38.826299631Z",
        "Container": "",
        "ContainerConfig": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": null,
            "Cmd": null,
            "Image": "",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": null
        },
        "DockerVersion": "",
        "Author": "",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "Cmd": [
                "/bin/bash"
            ],
            "Image": "",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": null
        },
        "Architecture": "arm64",
        "Os": "linux",
        "Size": 112017149,
        "VirtualSize": 112017149,
        "GraphDriver": {
            "Data": null,
            "Name": "btrfs"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:632c5d4dce7493801a746c1b8920afbcbbc965a0783e3119be028e8eb7485950",
                "sha256:a8c0e76b77c13f619c2d26086ff8e1c3b24b15a8cb24d301c470322e3fab61f1",
                "sha256:9475888a34750ab1808e94bec71d5955da17048c61709f571d5c6ad51ad22807",
                "sha256:1ae90461ccd477c699baa84201341b2ebc9169b02e9f26126c871c453c3ca0c9"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

Before this everything was working great! This just started happening a few weeks ago (before winter break at Nvidia). I am not sure what caused it.

The only difference that I can see on bare metal vs vm is the qemu-user-static package from the ubuntu repositories is really old.

Running images works just fine.

@tonistiigi
Copy link
Member

Do you have emulators installed on that machine? Run docker run --privileged --rm tonistiigi/binfmt to check. From https://github.com/tonistiigi/binfmt

@demizer
Copy link
Author

demizer commented Jan 14, 2021

Sorry, forgot to post that.

From the machine that works:

{
  "supported": [
    "linux/amd64",
    "linux/arm64",
    "linux/ppc64le",
    "linux/386"
  ],
  "emulators": [
    "jar",
    "llvm-12-runtime.binfmt",
    "llvm-6.0-runtime.binfmt",
    "llvm-7-runtime.binfmt",
    "llvm-9-runtime.binfmt",
    "python2.7",
    "python3.6",
    "qemu-aarch64",
    "qemu-alpha",
    "qemu-arm",
    "qemu-armeb",
    "qemu-cris",
    "qemu-m68k",
    "qemu-microblaze",
    "qemu-mips",
    "qemu-mips64",
    "qemu-mips64el",
    "qemu-mipsel",
    "qemu-ppc",
    "qemu-ppc64",
    "qemu-ppc64abi32",
    "qemu-ppc64le",
    "qemu-s390x",
    "qemu-sh4",
    "qemu-sh4eb",
    "qemu-sparc",
    "qemu-sparc32plus",
    "qemu-sparc64"
  ]
}

From the VM that doesn't:

{
  "supported": [
    "linux/amd64",
    "linux/386"
  ],
  "emulators": [
    "qemu-aarch64",
    "qemu-ppc64le"
  ]
}

I am not sure why "linux/arm64" and "linux/ppc64le" is missing from the vm list. I am using systemd-binfmt to install the emulators.

@tonistiigi
Copy link
Member

Distro packages often do not install support properly for use in containers. Uninstall the package (or https://github.com/tonistiigi/binfmt#uninstalling-emulators) and install emulators you need with https://github.com/tonistiigi/binfmt#installing-emulators

@demizer
Copy link
Author

demizer commented Jan 14, 2021

I was able to reproduce the issue on my working machine by installing qemu 5.2 from source. Using the privileged container fixed the problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants