-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s-device-plugin fails with k8s static CPU policy #145
Comments
This is a known issue and reported before: Unfortunately, there is no upstream fix for this yet. The plan is to address it as part of the upcoming redeisgn for the device plugins: https://docs.google.com/document/d/1wPlJL8DsVpHnbVbTaad35ILB-jqoMLkGFLnQpWWNduc/edit |
Thank you for the links, I read through the ticket and its further links to gain more context, including your PRs for Kubernetes upstream. This is a more nuanced issue than I expected. In our case, we'd like the static policy but it's not required, so we'll watch as this develops. Glad to see the document about the redesign on device plugins, been wondering where that was heading while dealing with the plugin for RDMA as well in a fork. |
Should have updated this issue back in April with this comment: |
Is this fixed? |
There was a flag added a while back called compatWithCPUManager. It’s should be explained by n the README. Are you setting this when you run the plugin? |
Yes, we set compatWithCPUManager=true. It works without MIG enabled, but it's not working if we enable MIG, no matter single or mixed mode. MIG works without cpu manager policy. |
I see. I think I can picture what the issue might be. Let me confirm it later today and I’ll provide an update here. Thanks. |
Yes, I can confirm that this is an issue. MIG support in the Without going into too much detail, when the underlying driver switched its implementation for this, it broke The fix should be fairly straightforward and will involve listing out the set of device nodes associated with the I have added this to our list of tasks for In the meantime, if you need this to work today, you can follow the advice in "Working with nvidia-capabilities"
That should get things working again until a fix come out. It is not a long-term fix, however, as support for Thanks for reporting! |
Thanks for your reply! It's good to confirm this is an issue in current version. |
Please do -- and if it doesn't work, let me know (it should). |
Another tricky solution is disabling the modification of cgroup device list during setting cgroup cpuset by runc . See https://github.com/NVIDIA/nvidia-container-runtime/pull/55/files. Apply the patch to runc and it works. I'm using it in my system. I don't have V100 GPU, so I'm not sure if it works with current version. |
@xial-thu the nvidia-container-runtime moved away from a fork of |
Do the same thing to runc, it still works~After all, the origin of the issue is that kubelet' operation bypasses nv-container-runtime. |
PR to fix this is tested and ready to be merged. https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/80 |
This has now been merged. |
1. Issue or feature description
Kubelet configured with a static CPU policy (e.g. --cpu-manager-policy=static --kube-reserved cpu=0.1) will cause nvidia-smi to fail after short delay.
Configure a test pod to request a nvidia.com/gpu resource, then run a simple nvidia-smi command as "sleep 30; nvidia-smi" and this will always fail with:
"Failed to initialize NVML: Unknown Error"
Running the same without the sleep, command works and nvidia-smi returns the expected info
2. Steps to reproduce the issue
Kubernetes 1.14
$ kubelet --version
Kubernetes v1.14.8
Device plugin: nvidia/k8s-device-plugin:1.11 (also with 1.0.0.0-beta4)
apply the daemonset for the nvidia plugin
then apply a pod yaml for a pod requesting one device:
then follow the pod logs:
The pod persists in this state
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)repeated:
Additional information that might help better understand your environment and reproduce the bug:
Docker version from
docker version
Version: 18.09.1
Docker command, image and tag used
Kernel version from
uname -a
dmesg
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: