Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Docker failed to start container due to cgroup v2 #127146

Open
Abdillah opened this issue Jun 16, 2021 · 1 comment
Open

NVIDIA Docker failed to start container due to cgroup v2 #127146

Abdillah opened this issue Jun 16, 2021 · 1 comment
Labels
0.kind: bug Something is broken

Comments

@Abdillah
Copy link
Contributor

Describe the bug
NVIDIA Docker (virtualisation.docker.enableNvidia) cannot be used on default NixOS option due to cgroup v2 not supported by libnvidia-container (the error, root cause). The container refuse to spawn because this runtime error.

$ nvidia-docker run -it -p 3000:3000 mycroft/mimic2:gpu

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0003] error waiting for container: context canceled 

There are two potential solutions as NVIDIA/libnvidia-container#111 (comment),

  1. Change systemd to hybrid mode when NVIDIA enabled (systemd.enableUnifiedCgroupHierarchy = false;)
  2. Switching off cgroup v2 support in nvidia-container-runtime per Non-default nvidia-container-runtime-hook config file NVIDIA/nvidia-container-runtime#47 (comment).

    I encountered this issue exactly because I'm running rootless docker with nvidia runtime using your usernetes. Everything works, except have to set no-cgroups = true in /etc/nvidia-container-runtime/config.toml

To Reproduce
Steps to reproduce the behavior:

  1. Clone mycroft/mimic2 repository and enter the directory. This might be any repository or docker image with gpu requirements.
  2. Execute the build command docker build -t mycroft/mimic2:gpu -f gpu.Dockerfile .
  3. Execute the run command nvidia-docker run -it -p 3000:3000 mycroft/mimic2:gpu.

Expected behavior
Run happily ever after.

Metadata
Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.40, NixOS, 21.11pre293089.1c2986bbb80 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4pre20210503_6d2553a`
 - channels(root): `"nixos-21.11pre293089.1c2986bbb80, nixos-hardware, nixos-unstable-21.05pre283367.0a5f5bab0e0"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
- systemd.enableUnifiedCgroupHierarchy
- virtualisation.docker.enableNvidia

# a list of nixos modules affected by the problem
module:
- systemd
- nvidia-docker
@Abdillah
Copy link
Contributor Author

Abdillah commented Feb 5, 2022

Cgroupv2 is now supported by https://github.com/NVIDIA/libnvidia-container per v1.8.0 release.

YodaEmbedding added a commit to YodaEmbedding/nixos that referenced this issue Mar 24, 2022
Took an entire day to find the linked comment [1] by @biggs, which says:

> Fix on NixOS (where cgroup v2 is also now default): add
> `systemd.enableUnifiedCgroupHierarchy = false;`
> and restart.

Indeed, after applying this commit and then running
`sudo systemctl restart docker`, any of the following commands works:

```bash
sudo docker run --gpus=all nvidia/cuda:10.0-runtime nvidia-smi
sudo docker run --runtime=nvidia nvidia/cuda:10.0-runtime nvidia-smi
sudo nvidia-docker run nvidia/cuda:10.0-runtime nvidia-smi
```

ARGH!!!1

Links:
[1] NVIDIA/nvidia-docker#1447 (comment)
[2] NixOS/nixpkgs#127146
[3] NixOS/nixpkgs#73800
[4] https://blog.zentria.company/posts/nixos-cgroupsv2/

P.S.
I use Colemak, but typing arstarstarst doesn't have the same ring to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

No branches or pull requests

1 participant