-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netclass Collector Performs Slowly on Nodes with Heavy Workload #2477
Comments
I see 2 potential ways to improve the performance of NetClass collector. The first method is to cache the metrics which rarely change, such as addr_assign_type, addr_len, dev_id. Non-carrier-related metrics are mostly stable and changes are made out of planned configuration adjustments. The second method is using netlink instead of sysfs to send fewer syscalls to avoid holding RTNL lock at least fortunate moments. Though the RTM_GETLINK request through netlink returns most metrics which sysfs returns, some hardware related information are missing:
The missing metrics are:
|
netlink seems like a good solution. We recently refactored the netdev collector to use netlink with good results. Related questions, what is your system configuration like?
For example, we have some 24xlarge nodes where we configure node_exporter with |
Hello @SuperQ , we have tested on 2 setups:
Resource request on the node_exporter daemonset is 8m CPU and 32MB memory, no limit is set. I have tested GOMAXPROCS=2 and raising CPU request to 100m for node exporter on the smaller kubernetes cluster (4 CPU x 6 nodes) using OVN as its CNI. The CPU usage and the scrape time with the setting are higher than those without the settings of GOMAXPROCS=2 and 100m CPU, as the diagram below: |
Shall we add a new collector using netlink to collect the metrics collected by netclass collector? As RTM_GETLINK does not return all metrics sysfs can provide, it may be safer to leave the original netclass collector untouched and add a faster collector with a little fewer metrics. |
Hrmm good question.. But yeah feels like a new collector might make most sense. |
A pull request is created to add a new collector to collect most of the metrics the netclass collector does: #2492 It is also possible to merge netclass collector with netdev collector, because the response message to RTM_GETLINK already contains these metrics. (please refer to netdev collector codes for details) |
Some test results comparing the performance of netclass collector using sysfs and netlink are posted on PR#2492. Scrape time is much lower than the sysfs implementation in most cases. |
All done, issue closed. |
The NetClass collector in Node Exporter performs slowly in some Kubernetes clusters serving heavy workloads(high CPU usage, frequent network configuration change). In worst cases, the NetClass collector blocks the process for several seconds, leading to a timeout when Prometheus scrapes Node Exporter, thus losing metrics from other collectors.
When Node Exporter slows down in these clusters, it spends most of its CPU time in NetClass collector. Here is a typical profile below.
Strace from Node Exporter on overloaded worker nodes shows the syscall with worst performance is read in /sys/class/net//. Several read executions looped on ERESTARTNOINTR for more than 5 seconds.
From strace we found read() returns ERESTARTNOINTR the following files in /sys/class/net/*/:
Host operating system: output of
uname -a
4.18.0-372.26.1.el8_6.x86_64
node_exporter version: output of
node_exporter --version
Node Exporter 1.1.2 + Golang 1.14
Node Exporter 1.3.1 + golang 1.18
node_exporter command line flags
Are you running node_exporter in Docker?
No, it is running in a Kubernetes cluster.
What did you do that produced an error?
/
What did you expect to see?
/
What did you see instead?
/
The text was updated successfully, but these errors were encountered: