-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error watching fields: The third-party Profiling module returned an unrecoverable error #148
Comments
What can I do for this error? |
My dcgm version is 2.3.6, and my exporter version is 2.6.6. |
Also experiencing this issue as well |
I experiencing the same issue. Besides, run nvidia-smi in node will print GPU 0 ERR after dcgm-exporter err. dcgm-exporter log $ k -n gpu-operator logs nvidia-dcgm-exporter-a45 -f
time="2023-08-01T02:47:12Z" level=info msg="Starting dcgm-exporter"
time="2023-08-01T02:47:13Z" level=info msg="DCGM successfully initialized!"
time="2023-08-01T02:47:13Z" level=info msg="Collecting DCP Metrics"
time="2023-08-01T02:47:13Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2023-08-01T02:47:13Z" level=info msg="Initializing system entities of type: GPU"
time="2023-08-01T02:47:46Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error" run nvidia-smi in node $ nvidia-smi
Mon Jul 31 12:48:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:19:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 4MiB / 46068MiB | ERR! Default |
| | | ERR! |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:1A:00.0 Off | 0 |
| 0% 22C P8 12W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:1B:00.0 Off | 0 |
| 0% 24C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:1C:00.0 Off | 0 |
| 0% 24C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A40 On | 00000000:B3:00.0 Off | 0 |
| 0% 24C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:B4:00.0 Off | 0 |
| 0% 24C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:B5:00.0 Off | 0 |
| 0% 22C P8 12W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:B6:00.0 Off | 0 |
| 0% 24C P8 21W / 300W | 4MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+ |
Hi @quanguachong, I have a few questions to help me understand the issue better:
|
Thanks for reply. The error I encounter is caused by NVIDA Driver's bug instead of dcgm-exporter. The NVIDA Driver's bug is reported in NVIDIA/open-gpu-kernel-modules#446. Dmesg snippet when dcgm-exporter prints errorlenovo@a45:~$ dmesg | grep NVRM
[ 14.266736] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.125.06 Tue May 30 05:11:37 UTC 2023
[ 7042.270398] NVRM: GPU at PCI:0000:19:00: GPU-1b159a19-1e87-b522-4668-7cc9c22fa49f
[ 7042.270402] NVRM: GPU Board Serial Number: 1323422077588
[ 7042.270407] NVRM: Xid (PCI:0000:19:00): 119, pid=192928, name=dcgm-exporter, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x90cc0301 0xc).
[ 7048.290460] NVRM: Xid (PCI:0000:19:00): 119, pid=192928, name=dcgm-exporter, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xc0000005 0x0).
[ 7054.294543] NVRM: Xid (PCI:0000:19:00): 119, pid=192928, name=dcgm-exporter, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xc0000002 0x0).
[ 7060.298606] NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:19:00 (printing 1 of every 30). The GPU likely needs to be reset.
[ 7216.816816] NVRM: Xid (PCI:0000:19:00): 119, pid=216541, name=nvc:[driver], Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[ 7397.359409] NVRM: Xid (PCI:0000:19:00): 119, pid=235851, name=nvc:[driver], Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0). My SolutionDCGM exporter works fine after disable GSP feature(reference: NVIDIA/open-gpu-kernel-modules#446 (comment)). |
When I run my dcgm-exporter by container image, It has a error which is
F0327 08:39:37.211158 120084 run.go:196] Error watching fields: The third-party Profiling module returned an unrecoverable error
The detailed Golang error is:
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:1026 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2361980, 0x3, {0x0, 0x0}, 0xc00026c070, 0x1, {0x1b375c4, 0xc000283000}, 0xc0005302a0, 0x0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:975 +0x63d
k8s.io/klog/v2.(*loggingT).printDepth(0x0, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x163cfb0, {0xc0005302a0, 0x1, 0x1})
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:735 +0x1ba
k8s.io/klog/v2.(*loggingT).print(...)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:717
k8s.io/klog/v2.Fatal(...)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:1494
main.RunCollector(0xc0004acd80, 0xc0004acd80)
/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/run.go:196 +0xbc5
main.main.func1(0xc0004a8160)
/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:178 +0x45
github.com/urfave/cli/v2.(*App).RunContext(0xc0002356c0, {0x180da48, 0xc0002aa000}, {0xc000292150, 0x3, 0x3})
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/github.com/urfave/cli/v2/app.go:322 +0x7a8
github.com/urfave/cli/v2.(*App).Run(...)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
/go/src/github.com/NVIDIA/dcgm-exporter/cmd/dcgm-exporter/main.go:181 +0x12d0
goroutine 35 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:1169 +0x6a
created by k8s.io/klog/v2.init.0
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/k8s.io/klog/v2/klog.go:420 +0xfb
goroutine 164 [select]:
net/http.(*persistConn).writeLoop(0xc00031d560)
/usr/local/go/src/net/http/transport.go:2386 +0xfb
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1748 +0x1e65
goroutine 163 [IO wait]:
internal/poll.runtime_pollWait(0x7f2819ed07c0, 0x72)
/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0002cec00, 0xc0005c4000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0002cec00, {0xc0005c4000, 0x3850, 0x3850})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0002cec00, {0xc0005c4000, 0xc000318905, 0x166})
/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000010008, {0xc0005c4000, 0x6, 0xc00055f7f8})
/usr/local/go/src/net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc000584138, {0xc0005c4000, 0x0, 0x41810d})
/usr/local/go/src/crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc0003433f8, {0x17e2040, 0xc000584138})
/usr/local/go/src/bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc000343180, {0x17e3c40, 0xc000010008}, 0x7f281c097700)
/usr/local/go/src/crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc000343180, 0x0)
/usr/local/go/src/crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
/usr/local/go/src/crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc000343180, {0xc0004c6000, 0x1000, 0x1})
/usr/local/go/src/crypto/tls/conn.go:1277 +0x16f
net/http.(*persistConn).Read(0xc00031d560, {0xc0004c6000, 0xc0005a0000, 0xc00055fd30})
/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc000213440)
/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc000213440, 0x1)
/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00031d560)
/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1747 +0x1e05
goroutine 139 [IO wait]:
internal/poll.runtime_pollWait(0x7f2819ed06d8, 0x72)
/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000785580, 0xc0007d8000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000785580, {0xc0007d8000, 0x1000, 0x1000})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000785580, {0xc0007d8000, 0x4455c7, 0xc0001a7c30})
/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc00058eb08, {0xc0007d8000, 0x18, 0xc0000001a0})
/usr/local/go/src/net/net.go:183 +0x45
net/http.(*persistConn).Read(0xc00049fe60, {0xc0007d8000, 0xc0002809c0, 0xc0001a7d30})
/usr/local/go/src/net/http/transport.go:1926 +0x4e
bufio.(*Reader).fill(0xc0001cb320)
/usr/local/go/src/bufio/bufio.go:101 +0x103
bufio.(*Reader).Peek(0xc0001cb320, 0x1)
/usr/local/go/src/bufio/bufio.go:139 +0x5d
net/http.(*persistConn).readLoop(0xc00049fe60)
/usr/local/go/src/net/http/transport.go:2087 +0x1ac
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1747 +0x1e05
goroutine 140 [select]:
net/http.(*persistConn).writeLoop(0xc00049fe60)
/usr/local/go/src/net/http/transport.go:2386 +0xfb
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1748 +0x1e65
goroutine 178 [select]:
google.golang.org/grpc.(*ccBalancerWrapper).watcher(0xc0001d50c0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/balancer_conn_wrappers.go:69 +0x95
created by google.golang.org/grpc.newCCBalancerWrapper
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/balancer_conn_wrappers.go:60 +0x1d5
goroutine 179 [chan receive]:
google.golang.org/grpc.(*addrConn).resetTransport(0xc000546000)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/clientconn.go:1213 +0x48d
created by google.golang.org/grpc.(*addrConn).connect
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/clientconn.go:843 +0x147
goroutine 180 [IO wait]:
internal/poll.runtime_pollWait(0x7f2819ed05f0, 0x72)
/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc0002cee00, 0xc000888000, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0002cee00, {0xc000888000, 0x8000, 0x8000})
/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0002cee00, {0xc000888000, 0x1040100000000, 0x0})
/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc0005324a8, {0xc000888000, 0xc000281258, 0x18})
/usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc0004c17a0, {0xc0003642d8, 0x9, 0xc000052c00})
/usr/local/go/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x17e1ea0, 0xc0004c17a0}, {0xc0003642d8, 0x9, 0x9}, 0x9)
/usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
/usr/local/go/src/io/io.go:347
golang.org/x/net/http2.readFrameHeader({0xc0003642d8, 0x9, 0xc000783a40}, {0x17e1ea0, 0xc0004c17a0})
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/golang.org/x/net/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0003642a0)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/golang.org/x/net/http2/frame.go:492 +0x95
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc000548000)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:1330 +0x233
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:345 +0x176f
goroutine 181 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0001cd9f0, 0x1)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:395 +0x11b
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc0004c1a40)
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:515 +0x85
google.golang.org/grpc/internal/transport.newHTTP2Client.func3()
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:391 +0x65
created by google.golang.org/grpc/internal/transport.newHTTP2Client
/go/src/github.com/NVIDIA/dcgm-exporter/vendor/google.golang.org/grpc/internal/transport/http2_client.go:389 +0x1991
The text was updated successfully, but these errors were encountered: