Updating CPU quota causes NVML unknown error #138

dvenza · 2017-11-02T08:22:00Z

I'm testing nvidia-docker 2, starting containers through Zoe Analytics, that uses the network Docker API.
What Zoe does is to dynamically set CPU quotas to redistribute spare capacity, but it makes nvidia-docker break down:

Start a container (the nvidia plugin is set as default in daemon.json):

$ docker run -d -e NVIDIA_VISIBLE_DEVICES=all -p 8888 gcr.io/tensorflow/tensorflow:1.3.0-gpu-py3

Test with nvidia-smi (it works):

$ docker exec -it 9e nvidia-smi
Thu Nov  2 08:03:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   26C    P0    31W / 250W |      0MiB / 16276MiB |      0%      Default |

[...]

Change the CPU quota:

$ docker update --cpu-quota 640000 9e

Test with nvidia-smi (it breaks):

$ docker exec -it 9e nvidia-smi
Failed to initialize NVML: Unknown Error

If I set the cpu quota at the beginning, it works.
I tried with different values for the quota, it always breaks
I could find no messages in the logs
The same happens updating the memory soft limit (--memory-reservation)

The text was updated successfully, but these errors were encountered:

…ker/issues/515

Workaround bug in nvidia-docker: https://github.com/NVIDIA/nvidia-docker/issues/515 See merge request !41

3XX0 · 2017-11-02T12:29:51Z

Good catch, it looks like Docker is resetting all the cgroups when it only needs to update one (CPU quota in this case).
Not sure how we can workaround that though.

mrjackbo · 2019-01-10T19:18:19Z

Has there been any progress on this? It seems I ran into the same problem while trying to set up the Kubernetes cpu-manager with „static“ policy.
(https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/)

klueska · 2019-04-04T11:39:01Z

@3XX0 I think it is unlikely that this will ever be addressed upstream.

From docker's perspective, they own and control all of the cgroups/devices set up for the containers they launch. If something comes along (in this case, libnvidia-container) and changes those cgroups/device settings outside of docker, then docker should be free to resolve these discrepancies in order to keep its state in sync.

The long-term solution should probably involve making libnvidia-container "docker-aware" in some way so that it can update the necessary state changes via:

https://docs.docker.com/engine/api/v1.25/#operation/ContainerUpdate

I know this goes against the current design (i.e. making libnvidia-container container runtime agnostic), but I don't see any other way around this.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly. However, once some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager does in Kubernetes), docker resolves this empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

@RenaudWasTaken How does the new --gpu flag for docker handle the fact that libnvidia-container is messing with cgroups/devices outside of docker's control?

klueska · 2019-04-04T11:43:18Z

@mrjackbo if your setup is constrained such that GPUs will only ever be used by containers that have CPUsets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

arlofaria · 2019-08-22T00:37:05Z

Here's a workaround that might be helpful:
docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 ...
(Replace/repeat nvidia0 with other/more devices as needed.)

This seems to fix the problem with both --runtime=nvidia or the newer --gpus option.

lucidyan · 2024-02-14T04:04:31Z

@elezar @klueska

In the official guide on how to deal with that problem written:

You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.

For Docker environments
Run a test container:

$ docker run -d --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"

But logically and in my experience, we should not use --device command flags because it fixes the problem instead of making the bug reproducible. Am I missing something?

klueska · 2024-02-14T09:19:01Z

For the issue related to missing /dev/char/*devices, the bug would occur even if you added the --device nodes as above.

lucidyan · 2024-02-14T21:17:35Z

For the issue related to missing /dev/char/*devices, the bug would occur even if you added the --device nodes as above.

It's interesting because I was able to reproduce the NVML bug only without --device arguments, as stated above. And this was also fixed by changing systemd to cgroupfs manager in the docker daemon config.

klueska · 2024-02-15T12:11:52Z

Don't get me wrong -- it will definitely happen if you don't pass --device. But without the /dev/char symlinks it will also happen even if you do pass --device.

zlianzhuang · 2024-04-26T02:02:47Z

i set cgroup-driver=cgroupfs on docker and k8s to fix my cluster。

dvenza referenced this issue in DistributedSystemsGroup/zoe Nov 2, 2017

Workaround bug in nvidia-docker: https://github.com/NVIDIA/nvidia-doc…

2b1e337

…ker/issues/515

dvenza referenced this issue in DistributedSystemsGroup/zoe Nov 2, 2017

Merge branch 'devel/platform_status' into 'master'

b518441

Workaround bug in nvidia-docker: https://github.com/NVIDIA/nvidia-docker/issues/515 See merge request !41

3XX0 added the bug Issue/PR to expose/discuss/fix a bug label Nov 14, 2017

klueska mentioned this issue Apr 25, 2019

Updating cpu-manager-policy=static causes NVML unknown error NVIDIA/nvidia-docker#966

Closed

tkanng mentioned this issue May 1, 2019

Fix bug causing GPU unavaliable after updating container resource NVIDIA/nvidia-container-runtime#55

Closed

klueska mentioned this issue Nov 13, 2019

k8s-device-plugin fails with k8s static CPU policy NVIDIA/k8s-device-plugin#145

Closed

10 tasks

zerosnake0 mentioned this issue Dec 30, 2019

Updating CPU quota causes NVML unknown error zerosnake0/cncf_issues#1

Open

klueska mentioned this issue Mar 5, 2021

GPU becomes unavailable after some time in Docker container NVIDIA/nvidia-docker#1469

Closed

9 tasks

elezar transferred this issue from NVIDIA/nvidia-docker Oct 30, 2023

Cocaynee90 mentioned this issue Feb 15, 2024

Nvidia #364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating CPU quota causes NVML unknown error #138

Updating CPU quota causes NVML unknown error #138

dvenza commented Nov 2, 2017

3XX0 commented Nov 2, 2017

mrjackbo commented Jan 10, 2019

klueska commented Apr 4, 2019 •

edited

Loading

klueska commented Apr 4, 2019

arlofaria commented Aug 22, 2019 •

edited

Loading

lucidyan commented Feb 14, 2024

klueska commented Feb 14, 2024 •

edited

Loading

lucidyan commented Feb 14, 2024

klueska commented Feb 15, 2024

zlianzhuang commented Apr 26, 2024

Updating CPU quota causes NVML unknown error #138

Updating CPU quota causes NVML unknown error #138

Comments

dvenza commented Nov 2, 2017

3XX0 commented Nov 2, 2017

mrjackbo commented Jan 10, 2019

klueska commented Apr 4, 2019 • edited Loading

klueska commented Apr 4, 2019

arlofaria commented Aug 22, 2019 • edited Loading

lucidyan commented Feb 14, 2024

klueska commented Feb 14, 2024 • edited Loading

lucidyan commented Feb 14, 2024

klueska commented Feb 15, 2024

zlianzhuang commented Apr 26, 2024

klueska commented Apr 4, 2019 •

edited

Loading

arlofaria commented Aug 22, 2019 •

edited

Loading

klueska commented Feb 14, 2024 •

edited

Loading