Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Allow excluding certain hwmon data sources #2681

Closed
xen0n opened this issue May 6, 2023 · 8 comments
Closed

[Feature Request] Allow excluding certain hwmon data sources #2681

xen0n opened this issue May 6, 2023 · 8 comments

Comments

@xen0n
Copy link

xen0n commented May 6, 2023

Host operating system

Linux lily 6.3.0-next-20230505-14814-g94feb5819da7 #1 SMP PREEMPT Tue Aug 30 11:11:44 AM CST 2022 loongarch64 GNU/Linux

This behavior is not tied to architecture quirks though, even with this exotic arch.

node_exporter version:

2023/05/06 14:50:14 github.com/josharian/native: unrecognized arch loong64 (LittleEndian), please file an issue
node_exporter, version 1.5.0 (branch: non-git, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)
  build user:       portage@lily
  build date:       20230104-05:23:50
  go version:       go1.19.4
  platform:         linux/loong64

node_exporter command line flags

/usr/sbin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/

node_exporter log output

(not really applicable, current behavior is expected)

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Just running node_exporter as a regular systemd service without any additional config.

What did you expect to see?

hwmon data collection wouldn't pull from an amdgpu source that's runtime suspended in D3hot state, or at least could be configured to exclude this source.

What did you see instead?

Each metrics pull wakes up the GPU that's only going back to sleep because there's no monitor attached, flooding the dmesg log. (Each wakeup produces 25 lines of amdgpu log, and in my case Prometheus pulls every 8~10 seconds, just enough for the GPU to go back to sleep again.) See also the finding on amd-gfx and dri-devel.

Right now without modifying sources I can only either (1) disable the hwmon collector altogether, or (2) modprobe amdgpu with runpm=0 that's going to waste some power. If this certain use case (running node_exporter on machines with GPUs but seldom used via GUI) is worthwhile to support, it could be useful to allow excluding certain hwmon sources based on name (e.g. amdgpu) or device ID (e.g. hwmon1).

Adding awareness for devices' power states could be useful too, but as not all device classes provide such information in their sysfs entries, this might or might not be easy. PCI devices all have power_state though, and we could choose to skip collection if power state is not D0 i.e. fully powered on.

@discordianfish
Copy link
Member

Yeah I think include/exclude flags like in other collectors would be reasonable to add for this.

@discordianfish
Copy link
Member

Feel free to submit a PR. Have a look at the include/exclude flag implementation in other collectors for guidance

@conallob
Copy link
Contributor

@xen0n, if you don't want to implement this, i'm happy to take a look at this either

@xen0n
Copy link
Author

xen0n commented May 17, 2023

@xen0n, if you don't want to implement this, i'm happy to take a look at this either

Thanks for offering help! Indeed it may take a long time for me to implement this myself: for one thing I'm way too occupied already (I take on many upstream roles and that's not counting the day job), and for another thing I'm not very familiar with node_exporter internals so it may be better UX-wise for someone else to design the API and wire up things. I'm happy to help reviewing the design and code though!

@conallob
Copy link
Contributor

Happy to be corrected, but since the include/exclude flags are a suggested way to implement this, probably doesn't need a design.

I'll take a look at making a PR

@xen0n
Copy link
Author

xen0n commented May 18, 2023

Happy to be corrected, but since the include/exclude flags are a suggested way to implement this, probably doesn't need a design.

I was initially referring to the decision one has to make regarding which kind of identifier the include/exclude patterns are going to match against, e.g. IDs like hwmon1 or device names like coretemp. But looking at the implementation of hwMonCollector.hwmonName it seems device names is the way to go (sequential hwmon IDs are already documented to be unstable and not relied upon).

@conallob
Copy link
Contributor

conallob commented Jul 7, 2023

This issue can be closed out now. The next release will include #2699

@xen0n
Copy link
Author

xen0n commented Jul 7, 2023

This issue can be closed out now. The next release will include #2699

Thanks very much for your contribution!

However I just noticed the README changes and a corresponding CHANGELOG entry weren't added along with the PR. I'll add them shortly.

@xen0n xen0n closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants