Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error watching fields: The third-party Profiling module returned an unrecoverable error #148

Closed
AlanFokCo opened this issue Mar 27, 2023 · 6 comments

Comments

@AlanFokCo
Copy link

When I run my dcgm-exporter by container image, It has a error which is F0327 08:39:37.211158 120084 run.go:196] Error watching fields: The third-party Profiling module returned an unrecoverable error
The detailed Golang error is:
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:1026 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2361980, 0x3, {0x0, 0x0}, 0xc00026c070, 0x1, {0x1b375c4, 0xc000283000}, 0xc0005302a0, 0x0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:975 +0x63d
k8s.io/klog/v2.(*loggingT).printDepth(0x0, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x163cfb0, {0xc0005302a0, 0x1, 0x1})
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:735 +0x1ba
k8s.io/klog/v2.(*loggingT).print(...)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:717
k8s.io/klog/v2.Fatal(...)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:1494
main.RunCollector(0xc0004acd80, 0xc0004acd80)
/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/run.go:196 +0xbc5
main.main.func1(0xc0004a8160)
/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:178 +0x45
github.com/urfave/cli/v2.(*App).RunContext(0xc0002356c0, {0x180da48, 0xc0002aa000}, {0xc000292150, 0x3, 0x3})
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/github.com/urfave/cli/v2/app.go:322 +0x7a8
github.com/urfave/cli/v2.(*App).Run(...)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:181 +0x12d0

goroutine 35 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:1169 +0x6a
created by k8s.io/klog/v2.init.0
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:420 +0xfb

goroutine 164 [select]:
net/http.(*persistConn).writeLoop(0xc00031d560)
/usr/local/go/src/net/http/transport.go:2386 +0xfb
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1748 +0x1e65

goroutine 163 [IO wait]:
internal/poll.runtime_pollWait(0x7f2819ed07c0, 0x72)
/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0002cec00, 0xc0005c4000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0002cec00, {0xc0005c4000, 0x3850, 0x3850})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0002cec00, {0xc0005c4000, 0xc000318905, 0x166})
/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000010008, {0xc0005c4000, 0x6, 0xc00055f7f8})
/usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc000584138, {0xc0005c4000, 0x0, 0x41810d})
/usr/local/go/src/crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc0003433f8, {0x17e2040, 0xc000584138})
/usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc000343180, {0x17e3c40, 0xc000010008}, 0x7f281c097700)
/usr/local/go/src/crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc000343180, 0x0)
/usr/local/go/src/crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
/usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc000343180, {0xc0004c6000, 0x1000, 0x1})
/usr/local/go/src/crypto/tls/conn.go:1277 +0x16f
net/http.(*persistConn).Read(0xc00031d560, {0xc0004c6000, 0xc0005a0000, 0xc00055fd30})
/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc000213440)
/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc000213440, 0x1)
/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00031d560)
/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1747 +0x1e05

goroutine 139 [IO wait]:
internal/poll.runtime_pollWait(0x7f2819ed06d8, 0x72)
/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000785580, 0xc0007d8000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000785580, {0xc0007d8000, 0x1000, 0x1000})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000785580, {0xc0007d8000, 0x4455c7, 0xc0001a7c30})
/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc00058eb08, {0xc0007d8000, 0x18, 0xc0000001a0})
/usr/local/go/src/net/net.go:183 +0x45
net/http.(*persistConn).Read(0xc00049fe60, {0xc0007d8000, 0xc0002809c0, 0xc0001a7d30})
/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc0001cb320)
/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc0001cb320, 0x1)
/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00049fe60)
/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1747 +0x1e05

goroutine 140 [select]:
net/http.(*persistConn).writeLoop(0xc00049fe60)
/usr/local/go/src/net/http/transport.go:2386 +0xfb
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1748 +0x1e65

goroutine 178 [select]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc0001d50c0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/balancer_conn_wrappers.go:69 +0x95
created by google.golang.org/grpc.newCCBalancerWrapper
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/balancer_conn_wrappers.go:60 +0x1d5

goroutine 179 [chan receive]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc000546000)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/clientconn.go:1213 +0x48d
created by google.golang.org/grpc.(*addrConn).connect
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/clientconn.go:843 +0x147

goroutine 180 [IO wait]:
internal/poll.runtime_pollWait(0x7f2819ed05f0, 0x72)
/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0002cee00, 0xc000888000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0002cee00, {0xc000888000, 0x8000, 0x8000})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0002cee00, {0xc000888000, 0x1040100000000, 0x0})
/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0005324a8, {0xc000888000, 0xc000281258, 0x18})
/usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc0004c17a0, {0xc0003642d8, 0x9, 0xc000052c00})
/usr/local/go/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x17e1ea0, 0xc0004c17a0}, {0xc0003642d8, 0x9, 0x9}, 0x9)
/usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0003642d8, 0x9, 0xc000783a40}, {0x17e1ea0, 0xc0004c17a0})
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/golang.org/x/net/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003642a0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/golang.org/x/net/http2/frame.go:492 +0x95
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc000548000)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:1330 +0x233
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:345 +0x176f

goroutine 181 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0001cd9f0, 0x1)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:395 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc0004c1a40)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:515 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Client.func3()
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:391 +0x65
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:389 +0x1991

@AlanFokCo
Copy link
Author

What can I do for this error?

@AlanFokCo
Copy link
Author

My dcgm version is 2.3.6, and my exporter version is 2.6.6.

@avickars
Copy link

avickars commented Jun 1, 2023

Also experiencing this issue as well

@quanguachong
Copy link

I experiencing the same issue. Besides, run nvidia-smi in node will print GPU 0 ERR after dcgm-exporter err.

dcgm-exporter log

$ k -n gpu-operator logs nvidia-dcgm-exporter-a45 -f
time="2023-08-01T02:47:12Z" level=info msg="Starting dcgm-exporter"
time="2023-08-01T02:47:13Z" level=info msg="DCGM successfully initialized!"
time="2023-08-01T02:47:13Z" level=info msg="Collecting DCP Metrics"
time="2023-08-01T02:47:13Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2023-08-01T02:47:13Z" level=info msg="Initializing system entities of type: GPU"
time="2023-08-01T02:47:46Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

run nvidia-smi in node

$ nvidia-smi 
Mon Jul 31 12:48:18 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:19:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!              ERR! / ERR! |      4MiB / 46068MiB |    ERR!      Default |
|                                         |                      |                 ERR! |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                     On  | 00000000:1A:00.0 Off |                    0 |
|  0%   22C    P8              12W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A40                     On  | 00000000:1B:00.0 Off |                    0 |
|  0%   24C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A40                     On  | 00000000:1C:00.0 Off |                    0 |
|  0%   24C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A40                     On  | 00000000:B3:00.0 Off |                    0 |
|  0%   24C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A40                     On  | 00000000:B4:00.0 Off |                    0 |
|  0%   24C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A40                     On  | 00000000:B5:00.0 Off |                    0 |
|  0%   22C    P8              12W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A40                     On  | 00000000:B6:00.0 Off |                    0 |
|  0%   24C    P8              21W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

@nikkon-dev
Copy link
Collaborator

Hi @quanguachong,

I have a few questions to help me understand the issue better:

  • Can you confirm if this behavior is reproducible?
  • Is this issue only affecting one system or multiple systems?
  • Does the failure always occur with the A40 GPU, or have you encountered it with other GPUs as well?
  • Can you run DCGM on the system with debug logs (nv-hostengine -f host.debug.log --log-level debug) and then execute dcgmi dmon -e 1001? Please provide the host.debug.log file.
  • Could you also share the dmesg output from the affected system? Additionally, it would be helpful if you could provide the results of nvidia-bug-report.sh (if it's okay to send that data).

@quanguachong
Copy link

quanguachong commented Aug 2, 2023

Thanks for reply. The error I encounter is caused by NVIDA Driver's bug instead of dcgm-exporter. The NVIDA Driver's bug is reported in NVIDIA/open-gpu-kernel-modules#446.

Dmesg snippet when dcgm-exporter prints error

lenovo@a45:~$ dmesg | grep NVRM
[   14.266736] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.125.06  Tue May 30 05:11:37 UTC 2023
[ 7042.270398] NVRM: GPU at PCI:0000:19:00: GPU-1b159a19-1e87-b522-4668-7cc9c22fa49f
[ 7042.270402] NVRM: GPU Board Serial Number: 1323422077588
[ 7042.270407] NVRM: Xid (PCI:0000:19:00): 119, pid=192928, name=dcgm-exporter, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x90cc0301 0xc).
[ 7048.290460] NVRM: Xid (PCI:0000:19:00): 119, pid=192928, name=dcgm-exporter, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xc0000005 0x0).
[ 7054.294543] NVRM: Xid (PCI:0000:19:00): 119, pid=192928, name=dcgm-exporter, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xc0000002 0x0).
[ 7060.298606] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:19:00 (printing 1 of every 30).  The GPU likely needs to be reset.
[ 7216.816816] NVRM: Xid (PCI:0000:19:00): 119, pid=216541, name=nvc:[driver], Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[ 7397.359409] NVRM: Xid (PCI:0000:19:00): 119, pid=235851, name=nvc:[driver], Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).

My Solution

DCGM exporter works fine after disable GSP feature(reference: NVIDIA/open-gpu-kernel-modules#446 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants