Netclass Collector Performs Slowly on Nodes with Heavy Workload #2477

raptorsun · 2022-09-22T09:53:45Z

The NetClass collector in Node Exporter performs slowly in some Kubernetes clusters serving heavy workloads(high CPU usage, frequent network configuration change). In worst cases, the NetClass collector blocks the process for several seconds, leading to a timeout when Prometheus scrapes Node Exporter, thus losing metrics from other collectors.

When Node Exporter slows down in these clusters, it spends most of its CPU time in NetClass collector. Here is a typical profile below.

Strace from Node Exporter on overloaded worker nodes shows the syscall with worst performance is read in /sys/class/net//. Several read executions looped on ERESTARTNOINTR for more than 5 seconds.

13:51:10.581 openat(AT_FDCWD, "/host/sys/class/net/lo/phys_port_name", O_RDONLY|O_CLOEXEC) = 58
13:51:10.598 futex(0xc0013ec948, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
13:51:10.742 fcntl(58, F_GETFL)         = 0x8000 (flags O_RDONLY|O_LARGEFILE)
13:51:10.742 fcntl(58, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
13:51:10.743 fcntl(58, F_GETFL)         = 0x8800 (flags O_RDONLY|O_NONBLOCK|O_LARGEFILE)
13:51:10.743 fcntl(58, F_SETFL, O_RDONLY|O_LARGEFILE) = 0
13:51:10.744 read(58, 0xc001485710, 128) = ? ERESTARTNOINTR (To be restarted)
13:51:10.744 read(58, 0xc001485710, 128) = ? ERESTARTNOINTR (To be restarted)
….
13:51:11.644 read(58, 0xc001485710, 128) = ? ERESTARTNOINTR (To be restarted)
13:51:11.652 read(58, "0\n", 128)       = 2

From strace we found read() returns ERESTARTNOINTR the following files in /sys/class/net/*/:

threaded (most frequent)
netdev_group
type
addr_len
link_mode
testing
proto_down
speed
duplex
phys_port_name
gro_flush_timeout
broadcast

Host operating system: output of `uname -a`

4.18.0-372.26.1.el8_6.x86_64

node_exporter version: output of `node_exporter --version`

Node Exporter 1.1.2 + Golang 1.14
Node Exporter 1.3.1 + golang 1.18

node_exporter command line flags

--web.listen-address=127.0.0.1:9100
--path.sysfs=/host/sys
--path.rootfs=/host/root
--no-collector.wifi
--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|run/k3s/containerd/.+|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
--collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$
--collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$
--collector.cpu.info
--collector.textfile.directory=/var/node_exporter/textfile
--no-collector.cpufreq

Are you running node_exporter in Docker?

No, it is running in a Kubernetes cluster.

What did you do that produced an error?

/

What did you expect to see?

/

What did you see instead?

/

The text was updated successfully, but these errors were encountered:

raptorsun · 2022-09-22T10:10:05Z

I see 2 potential ways to improve the performance of NetClass collector.

The first method is to cache the metrics which rarely change, such as addr_assign_type, addr_len, dev_id. Non-carrier-related metrics are mostly stable and changes are made out of planned configuration adjustments.

The second method is using netlink instead of sysfs to send fewer syscalls to avoid holding RTNL lock at least fortunate moments. Though the RTM_GETLINK request through netlink returns most metrics which sysfs returns, some hardware related information are missing:

sysfs File	Netlink - RTM_GETLINK reply field
addr_assign_type	not available
addr_len	IFLA_ADDRESS
address	IFLA_ADDRESS
broadcast	IFLA_BROADCAST
Carrier	IFLA_CARRIER
Carrier_changes	IFLA_CARRIER_CHANGES
Carrier_up_count	IFLA_CARRIER_UP_COUNT
carrier_down_count	IFLA_CARRIER_DOWN_COUNT
dev_id	not available
dormant	IFLA_LINKMODE
duplex	not available
flags	header bytes[8:12]
ifalias	IFLA_IFALIAS
ifindex	header bytes[4:8]
iflink	IFLA_LINK
link_mode	IFLA_LINKMODE
mtu	IFLA_MTU
name_assign_type	not available
netdev_group	IFLA_GROUP
operstate	IFLA_OPERSTATE
phys_port_id	IFLA_PHYS_PORT_ID
phys_port_name	IFLA_PHYS_PORT_NAME
phys_switch_id	IFLA_PHYS_SWITCH_ID
speed	not available
tx_queue_len	IFLA_TXQLEN
type	IFLA_LINK

The missing metrics are:

addr_assign_type
duplex
name_assign_type
speed

SuperQ · 2022-09-22T10:24:12Z

netlink seems like a good solution. We recently refactored the netdev collector to use netlink with good results.

Related questions, what is your system configuration like?

How many CPUs on the node?
What CPU request/limits are on the node_exporter?
Are you configuring GOMAXPROCS?

For example, we have some 24xlarge nodes where we configure node_exporter with GOMAXPROCS=2 and a CPU request of 100m. This greatly improved the reliability of our node_exporter in our deployments.

raptorsun · 2022-09-26T16:03:10Z

Hello @SuperQ , we have tested on 2 setups:

4 CPU x 6 nodes
96 CPU x 27 nodes

Resource request on the node_exporter daemonset is 8m CPU and 32MB memory, no limit is set.

I have tested GOMAXPROCS=2 and raising CPU request to 100m for node exporter on the smaller kubernetes cluster (4 CPU x 6 nodes) using OVN as its CNI. The CPU usage and the scrape time with the setting are higher than those without the settings of GOMAXPROCS=2 and 100m CPU, as the diagram below:

GOMAXPROCS=2 Node Exporter CPU Usage
GOMAXPROCS=2 Scrape Time
GOMAXPROCS unset Node Exporter CPU Usage
GOMAXPROCS unset Scrape Time

raptorsun · 2022-09-26T16:47:45Z

Shall we add a new collector using netlink to collect the metrics collected by netclass collector? As RTM_GETLINK does not return all metrics sysfs can provide, it may be safer to leave the original netclass collector untouched and add a faster collector with a little fewer metrics.

discordianfish · 2022-09-27T12:48:00Z

Hrmm good question.. But yeah feels like a new collector might make most sense.

raptorsun · 2022-09-30T16:47:45Z

A pull request is created to add a new collector to collect most of the metrics the netclass collector does: #2492

It is also possible to merge netclass collector with netdev collector, because the response message to RTM_GETLINK already contains these metrics. (please refer to netdev collector codes for details)

raptorsun · 2022-10-14T08:22:23Z

Some test results comparing the performance of netclass collector using sysfs and netlink are posted on PR#2492.

Scrape time is much lower than the sysfs implementation in most cases.

raptorsun · 2022-11-21T18:14:33Z

PR #2492 has been merged.
PR #2528 is in progress, incorporating the Netlink implementation into the existing netclass collector.

raptorsun · 2023-02-17T10:36:53Z

All done, issue closed.
Thank you for the help @SuperQ @discordianfish :D

raptorsun closed this as completed Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Netclass Collector Performs Slowly on Nodes with Heavy Workload #2477

Netclass Collector Performs Slowly on Nodes with Heavy Workload #2477

raptorsun commented Sep 22, 2022 •

edited

Loading

raptorsun commented Sep 22, 2022 •

edited

Loading

SuperQ commented Sep 22, 2022

raptorsun commented Sep 26, 2022

raptorsun commented Sep 26, 2022

discordianfish commented Sep 27, 2022

raptorsun commented Sep 30, 2022

raptorsun commented Oct 14, 2022

raptorsun commented Nov 21, 2022 •

edited

Loading

raptorsun commented Feb 17, 2023

Netclass Collector Performs Slowly on Nodes with Heavy Workload #2477

Netclass Collector Performs Slowly on Nodes with Heavy Workload #2477

Comments

raptorsun commented Sep 22, 2022 • edited Loading

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

raptorsun commented Sep 22, 2022 • edited Loading

SuperQ commented Sep 22, 2022

raptorsun commented Sep 26, 2022

raptorsun commented Sep 26, 2022

discordianfish commented Sep 27, 2022

raptorsun commented Sep 30, 2022

raptorsun commented Oct 14, 2022

raptorsun commented Nov 21, 2022 • edited Loading

raptorsun commented Feb 17, 2023

raptorsun commented Sep 22, 2022 •

edited

Loading

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

raptorsun commented Sep 22, 2022 •

edited

Loading

raptorsun commented Nov 21, 2022 •

edited

Loading