-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating CPU quota causes NVML unknown error #138
Comments
Workaround bug in nvidia-docker: https://github.com/NVIDIA/nvidia-docker/issues/515 See merge request !41
Good catch, it looks like Docker is resetting all the cgroups when it only needs to update one (CPU quota in this case). |
Has there been any progress on this? It seems I ran into the same problem while trying to set up the Kubernetes cpu-manager with „static“ policy. |
@3XX0 I think it is unlikely that this will ever be addressed upstream. From docker's perspective, they own and control all of the cgroups/devices set up for the containers they launch. If something comes along (in this case, The long-term solution should probably involve making https://docs.docker.com/engine/api/v1.25/#operation/ContainerUpdate I know this goes against the current design (i.e. making For example, if you do a @RenaudWasTaken How does the new |
@mrjackbo if your setup is constrained such that GPUs will only ever be used by containers that have CPUsets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched.
|
Here's a workaround that might be helpful: This seems to fix the problem with both |
In the official guide on how to deal with that problem written:
But logically and in my experience, we should not use |
For the issue related to missing |
It's interesting because I was able to reproduce the |
Don't get me wrong -- it will definitely happen if you don't pass |
i set cgroup-driver=cgroupfs on docker and k8s to fix my cluster。 |
I'm testing nvidia-docker 2, starting containers through Zoe Analytics, that uses the network Docker API.
What Zoe does is to dynamically set CPU quotas to redistribute spare capacity, but it makes nvidia-docker break down:
Start a container (the nvidia plugin is set as default in daemon.json):
Test with nvidia-smi (it works):
Change the CPU quota:
Test with nvidia-smi (it breaks):
--memory-reservation
)The text was updated successfully, but these errors were encountered: