Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

dcgm-exporter reports stale metrics if nvhost-engine is restarted #188

Open
bchess opened this issue May 12, 2021 · 0 comments
Open

dcgm-exporter reports stale metrics if nvhost-engine is restarted #188

bchess opened this issue May 12, 2021 · 0 comments

Comments

@bchess
Copy link

bchess commented May 12, 2021

Running dcgm-exporter 2.1.8 connecting to nv-hostengine via DCGM_REMOTE_HOSTENGINE_INFO=localhost:5555

If nv-hostengine is restarted, dcgm-exporter starts repeating the below message every 30 secs. Meanwhile it continues to serve up old metrics from the last point prior to the restart. The /health endpoint indicates that everything is fine.

time="2021-05-12T17:31:12Z" level=error msg="Failed to collect metrics with error: Failed to collect metrics with error: Error getting the latest value for fields: Host engine connection invalid/disconnected"
time="2021-05-12T17:31:42Z" level=error msg="Failed to collect metrics with error: Failed to collect metrics with error: Error getting the latest value for fields: Host engine connection invalid/disconnected"

dcgm-exporter should either crash hard in response to this error, or re-connect to nv-hostengine. It should not continue to report stale metrics.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant