-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dcgmi version and dcgm-exporter version #319
Comments
The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible. |
Thank you for the response, helpful info on versions. :-) When I try running this container with
For GPU 0 which shows ERR!, NVSMI Log shows:
|
You need to install and configure the NVIDIA Container Toolkit, it seems, that it is not configured correctly and that is why you see the error:
|
Thanks for the response. nvidia-container-toolkit is installed.
sounds like I will need to debug this further. I will report back when if I determine a root cause |
@nghtm , Try to run the sample workload as suggested here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker. This will tell us if the Nvidia runtime configured correctly or not. |
We are installing nvidia-container-toolkit on the node via this script: The docker configuration defaults to:
But I can typically run nvidia commands via docker with this. For example: However when I try launching the dcgmi container and tracking
|
Trying to go back to the base dcgm-exporter container, which uses
|
For reference, this is the install script for dcgm exporter which has been causing the container failures on g5.48xlarge (a10 GPUs) |
It seems to be working without issues on h100s, so perhaps some of the custom metrics are not available on a10s (just a hypothesis) |
Repeated error trying to run container on a10 GPUs, but it works on h100 GPUs. On a10s, the docker logs show:
on h100s, the docker logs show
|
Reporting findings from today: h100 nodes (8x GPU) no issue, all versions of DCGM exporter appear to be working All versions above
|
Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM 3.3.5-3.4.0-ubuntu22.04 We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg. Similar to this issue reporter here: awslabs/amazon-eks-ami#1523 Anyways, thanks for the help and quick responses |
@nghtm Thank you for the update. I am closing the issue as solved. |
Ask your question
Hi,
I am hoping to understand the difference between the
dcgmi -v
version and the version ofdcgm exporter
which should be used.I want to undertstand what version of dcgm exporter I should specify for my docker container. When I run the following, I see dcgmi version = 3.3.5
When I create my docker container, what verison should I specify?
The text was updated successfully, but these errors were encountered: