Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcgmi version and dcgm-exporter version #319

Closed
nghtm opened this issue Apr 30, 2024 · 13 comments
Closed

dcgmi version and dcgm-exporter version #319

nghtm opened this issue Apr 30, 2024 · 13 comments
Labels
question Further information is requested

Comments

@nghtm
Copy link

nghtm commented Apr 30, 2024

Ask your question

Hi,

I am hoping to understand the difference between the dcgmi -v version and the version of dcgm exporter which should be used.

I want to undertstand what version of dcgm exporter I should specify for my docker container. When I run the following, I see dcgmi version = 3.3.5

ubuntu@ip-10-1-22-213:~$ dcgmi -v
Version : 3.3.5
Build ID : 14
Build Date : 2024-02-24
Build Type : Release
Commit ID : 93088b0e1286c6e7723af1930251298870e26c19
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 08a0d9624b562a1342bf5f8828939294

When I create my docker container, what verison should I specify?

    # Set DCGM Exporter version
    DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04

    # Run the DCGM Exporter Docker container
    sudo docker run -d --restart always \
       --gpus all \
       --net host \
       --cap-add SYS_ADMIN \
       -v /opt/dcgm-exporter/dcgm-golden-metrics.csv:/etc/dcgm-exporter/dcgm-golden-metrics.csv \
       nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION} \
       -f /etc/dcgm-exporter/dcgm-golden-metrics.csv || { echo "Failed to run DCGM Exporter Docker container"; exit 1; }
@nghtm nghtm added the question Further information is requested label Apr 30, 2024
@glowkey
Copy link
Collaborator

glowkey commented Apr 30, 2024

The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

Thank you for the response, helpful info on versions. :-)

When I try running this container with DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04 and dcgmi -v = 3.3.5 it fails, causing Nvidia-SMI to throw errors on gpu 0. Prior to running the container, Nvidia-smi showed all GPUs to be healthy. I examined nvidia-bug-report and found the following message:

Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"

For GPU 0 which shows ERR!, NVSMI Log shows:

==============NVSMI LOG==============

Timestamp                                 : Tue Apr 30 21:16:21 2024
Driver Version                            : 535.161.08
CUDA Version                              : 12.2

Attached GPUs                             : 8
GPU 00000000:00:16.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : Unknown Error
    Product Architecture                  : Ampere
    Display Mode                          : N/A
    Display Active                        : N/A
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Unknown Error
        Pending                           : Unknown Error
    Accounting Mode                       : N/A
    Accounting Mode Buffer Size           : N/A
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1652222014738
    GPU UUID                              : Unknown Error
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : N/A
    Board ID                              : N/A
    Board Part Number                     : 900-2G133-A840-100
    GPU Part Number                       : 2237-892-A1
    FRU Part Number                       : N/A
    Module ID                             : Unknown Error
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.161.08
    GPU Virtualization Mode
        Virtualization Mode               : N/A
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : N/A
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x16
        Domain                            : 0x0000
        Device Id                         : 0x223710DE
        Bus Id                            : 00000000:00:16.0
        Sub System Id                     : 0x152F10DE
        GPU Link Info
            PCIe Generation
                Max                       : N/A
                Current                   : N/A
                Device Current            : N/A
                Device Max                : N/A
                Host Max                  : N/A
            Link Width
                Max                       : N/A
                Current                   : N/A
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : Unknown Error
        Replay Number Rollovers           : Unknown Error
        Tx Throughput                     : Unknown Error
        Rx Throughput                     : Unknown Error
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : Unknown Error
    Performance State                     : Unknown Error
    Clocks Event Reasons                  : N/A
    Sparse Operation Mode                 : Unknown Error
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 512 MiB
        Used                              : 0 MiB
        Free                              : 22515 MiB
    BAR1 Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : N/A
        Decoder                           : N/A
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    FBC Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : Unknown Error
    Temperature
        GPU Current Temp                  : Unknown Error
        GPU T.Limit Temp                  : Unknown Error
        GPU Shutdown T.Limit Temp         : Unknown Error
        GPU Slowdown T.Limit Temp         : Unknown Error
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : Unknown Error
    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : 670166.31 W
        Requested Power Limit             : 0.00 W
        Default Power Limit               : Unknown Error
        Min Power Limit                   : Unknown Error
        Max Power Limit                   : Unknown Error
    Module Power Readings
        Power Draw                        : Unknown Error
        Current Power Limit               : Unknown Error
        Requested Power Limit             : 0.00 W
        Default Power Limit               : Unknown Error
        Min Power Limit                   : Unknown Error
        Max Power Limit                   : Unknown Error
    Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Applications Clocks
        Graphics                          : Unknown Error
        Memory                            : Unknown Error
    Default Applications Clocks
        Graphics                          : Unknown Error
        Memory                            : Unknown Error
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Max Customer Boost Clocks
        Graphics                          : Unknown Error
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : Unknown Error
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None

@nvvfedorov
Copy link
Collaborator

You need to install and configure the NVIDIA Container Toolkit, it seems, that it is not configured correctly and that is why you see the error:

Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

Thanks for the response.

nvidia-container-toolkit is installed.

ubuntu@ip-10-1-5-148:/var/log$ dpkg -l | grep nvidia-container-toolkit
ii  nvidia-container-toolkit               1.15.0-1                              amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.15.0-1                              amd64        NVIDIA Container Toolkit Base
ubuntu@ip-10-1-5-148:/var/log$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

sounds like I will need to debug this further. I will report back when if I determine a root cause

@nvvfedorov
Copy link
Collaborator

@nghtm , Try to run the sample workload as suggested here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker. This will tell us if the Nvidia runtime configured correctly or not.

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

We are installing nvidia-container-toolkit on the node via this script:

The docker configuration defaults to:

{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}

But I can typically run nvidia commands via docker with this. For example: sudo docker run --rm --gpus all ubuntu nvidia-smi works.

However when I try launching the dcgmi container and tracking docker logs, it fails after about 1 minute:

docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

Trying to go back to the base dcgm-exporter container, which uses etc/dcgm-exporter/dcp-metrics-included.csv instead of the custom CSV file I have writen, to see if that fixes the container.

    sudo docker run -d --rm \
       --gpus all \
       --net host \
       --cap-add SYS_ADMIN \
       nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
       -f /etc/dcgm-exporter/dcp-metrics-included.csv 

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

For reference, this is the install script for dcgm exporter which has been causing the container failures on g5.48xlarge (a10 GPUs)

https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.sh

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

It seems to be working without issues on h100s, so perhaps some of the custom metrics are not available on a10s (just a hypothesis)

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

Repeated error trying to run container on a10 GPUs, but it works on h100 GPUs.

On a10s, the docker logs show:

ubuntu@ip-10-1-5-148:~$ docker logs ca88122482d5
time="2024-04-30T23:14:28Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:14:28Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:14:29Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:14:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-30T23:14:29Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:15:06Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

on h100s, the docker logs show

ubuntu@ip-10-1-22-213:~$ docker logs 01a9236f1495
time="2024-04-30T23:05:43Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:05:43Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:05:43Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:05:43Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T23:05:43Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:05:46Z" level=info msg="Pipeline starting"
time="2024-04-30T23:05:46Z" level=info msg="Starting webserver"
level=info ts=2024-04-30T23:05:46.033Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-04-30T23:05:46.034Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400

@nghtm
Copy link
Author

nghtm commented Apr 30, 2024

Reporting findings from today:

h100 nodes (8x GPU) no issue, all versions of DCGM exporter appear to be working
a10 nodes (8x GPU)
older version of dcgm works 2.1.4-2.3.1-ubuntu20.04

All versions above 3.1.6-3.1.3-ubuntu20.04 are failing, docker logs show the following:

level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

@nghtm nghtm closed this as completed May 3, 2024
@nghtm
Copy link
Author

nghtm commented May 3, 2024

Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM 3.3.5-3.4.0-ubuntu22.04

We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.

Similar to this issue reporter here: awslabs/amazon-eks-ami#1523

Anyways, thanks for the help and quick responses

@nghtm nghtm reopened this May 3, 2024
@nvvfedorov
Copy link
Collaborator

@nghtm Thank you for the update. I am closing the issue as solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants