Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating CPU quota causes NVML unknown error #138

Open
Tracked by #364
dvenza opened this issue Nov 2, 2017 · 10 comments
Open
Tracked by #364

Updating CPU quota causes NVML unknown error #138

dvenza opened this issue Nov 2, 2017 · 10 comments
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@dvenza
Copy link

dvenza commented Nov 2, 2017

I'm testing nvidia-docker 2, starting containers through Zoe Analytics, that uses the network Docker API.
What Zoe does is to dynamically set CPU quotas to redistribute spare capacity, but it makes nvidia-docker break down:

Start a container (the nvidia plugin is set as default in daemon.json):

$ docker run -d -e NVIDIA_VISIBLE_DEVICES=all -p 8888 gcr.io/tensorflow/tensorflow:1.3.0-gpu-py3

Test with nvidia-smi (it works):

$ docker exec -it 9e nvidia-smi
Thu Nov  2 08:03:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   26C    P0    31W / 250W |      0MiB / 16276MiB |      0%      Default |

[...]

Change the CPU quota:

$ docker update --cpu-quota 640000 9e

Test with nvidia-smi (it breaks):

$ docker exec -it 9e nvidia-smi
Failed to initialize NVML: Unknown Error
  • If I set the cpu quota at the beginning, it works.
  • I tried with different values for the quota, it always breaks
  • I could find no messages in the logs
  • The same happens updating the memory soft limit (--memory-reservation)
dvenza referenced this issue in DistributedSystemsGroup/zoe Nov 2, 2017
@3XX0
Copy link
Member

3XX0 commented Nov 2, 2017

Good catch, it looks like Docker is resetting all the cgroups when it only needs to update one (CPU quota in this case).
Not sure how we can workaround that though.

@3XX0 3XX0 added the bug Issue/PR to expose/discuss/fix a bug label Nov 14, 2017
@mrjackbo
Copy link

Has there been any progress on this? It seems I ran into the same problem while trying to set up the Kubernetes cpu-manager with „static“ policy.
(https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/)

@klueska
Copy link
Contributor

klueska commented Apr 4, 2019

@3XX0 I think it is unlikely that this will ever be addressed upstream.

From docker's perspective, they own and control all of the cgroups/devices set up for the containers they launch. If something comes along (in this case, libnvidia-container) and changes those cgroups/device settings outside of docker, then docker should be free to resolve these discrepancies in order to keep its state in sync.

The long-term solution should probably involve making libnvidia-container "docker-aware" in some way so that it can update the necessary state changes via:

https://docs.docker.com/engine/api/v1.25/#operation/ContainerUpdate

I know this goes against the current design (i.e. making libnvidia-container container runtime agnostic), but I don't see any other way around this.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly. However, once some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager does in Kubernetes), docker resolves this empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

@RenaudWasTaken How does the new --gpu flag for docker handle the fact that libnvidia-container is messing with cgroups/devices outside of docker's control?

@klueska
Copy link
Contributor

klueska commented Apr 4, 2019

@mrjackbo if your setup is constrained such that GPUs will only ever be used by containers that have CPUsets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

@arlofaria
Copy link

arlofaria commented Aug 22, 2019

Here's a workaround that might be helpful:
docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 ...
(Replace/repeat nvidia0 with other/more devices as needed.)

This seems to fix the problem with both --runtime=nvidia or the newer --gpus option.

@lucidyan
Copy link

@elezar @klueska

In the official guide on how to deal with that problem written:

You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.

For Docker environments
Run a test container:

$ docker run -d --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"

But logically and in my experience, we should not use --device command flags because it fixes the problem instead of making the bug reproducible. Am I missing something?

@klueska
Copy link
Contributor

klueska commented Feb 14, 2024

For the issue related to missing /dev/char/*devices, the bug would occur even if you added the --device nodes as above.

@lucidyan
Copy link

For the issue related to missing /dev/char/*devices, the bug would occur even if you added the --device nodes as above.

It's interesting because I was able to reproduce the NVML bug only without --device arguments, as stated above. And this was also fixed by changing systemd to cgroupfs manager in the docker daemon config.

@Cocaynee90 Cocaynee90 mentioned this issue Feb 15, 2024
@klueska
Copy link
Contributor

klueska commented Feb 15, 2024

Don't get me wrong -- it will definitely happen if you don't pass --device. But without the /dev/char symlinks it will also happen even if you do pass --device.

@zlianzhuang
Copy link

i set cgroup-driver=cgroupfs on docker and k8s to fix my cluster。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

7 participants