Add perf exporter #1274

hodgesds · 2019-03-02T01:58:23Z

This implements #1238 by adding perf based profiling metrics. It's still a work in progress but so far it seems to be working locally. I'd like to instrument more of the already available metrics but figured I'd put this out there for others to see.

hodgesds · 2019-03-02T03:59:57Z

Example of L1 data cache hit rate:

SuperQ · 2019-03-02T17:24:56Z

I'm going to have to dig deeper into the new included libraries, but it looks like so far everything is being returned as gauge values.

We prefer to expose raw underlying counters in Prometheus, rather than try to use pre-calculated rates. Is this possible with perf?

hodgesds · 2019-03-02T17:29:47Z

Yeah, most of the underlying metrics are counters and I just copy pasted some of the other exporter code to get this started. I think most of the metrics will be mapped to counters. The one thing I need to figure out is how are overflows handled normally for other exporters.

SuperQ · 2019-03-02T17:32:29Z

For counters, they should all be named _total and use prometheus.CounterValue.

If you have concerns about 2^53 uint64 overflows, take a look at how we handle it in the snmp_exporter.

hodgesds · 2019-03-05T15:19:46Z

Pushed up changes, I think this should be good if you want to test it locally.

SuperQ · 2019-03-05T16:49:41Z

I started with echo 2 > sudo tee /proc/sys/kernel/perf_event_paranoid.

Then I ran a couple collections and it seems to be resetting counters after each scrape:

$ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total
# HELP node_perf_page_faults_total Number of page faults
# TYPE node_perf_page_faults_total counter
node_perf_page_faults_total{cpu="0"} 16
node_perf_page_faults_total{cpu="1"} 0
node_perf_page_faults_total{cpu="2"} 60
node_perf_page_faults_total{cpu="3"} 1796
node_perf_page_faults_total{cpu="4"} 31
node_perf_page_faults_total{cpu="5"} 0
node_perf_page_faults_total{cpu="6"} 42
node_perf_page_faults_total{cpu="7"} 5
$ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total
# HELP node_perf_page_faults_total Number of page faults
# TYPE node_perf_page_faults_total counter
node_perf_page_faults_total{cpu="0"} 2
node_perf_page_faults_total{cpu="1"} 3
node_perf_page_faults_total{cpu="2"} 10
node_perf_page_faults_total{cpu="3"} 0
node_perf_page_faults_total{cpu="4"} 0
node_perf_page_faults_total{cpu="5"} 10
node_perf_page_faults_total{cpu="6"} 4
node_perf_page_faults_total{cpu="7"} 62

Sadly, this seems to persist across restarts of the exporter as well.

hodgesds · 2019-03-05T17:31:40Z

I'm not sure what capabilities you are running so that may play a factor. The one thing to note from the perf_event_open man page:

pid == -1 and cpu >= 0
This measures all processes/threads on the specified CPU. This requires CAP_SYS_ADMIN capability or a /proc/sys/kernel/perf_event_paranoid value of less than 1.

Due to the fact that it is currently configured to trace all processes on the specific CPU I doubt a paranoid value of 2 (allow only user-space measurements) would work.

SuperQ · 2019-03-05T17:41:42Z

Even with -1, I still get reset counters.

I'm also seeing this error after a few scrapes:

2019/03/05 17:40:03 http: Accept error: accept tcp [::]:9100: accept4: too many open files; retrying in 5ms

Looks like there may be a leak.

hodgesds · 2019-03-06T04:02:56Z

Real dumb mistake, I was assuming that NewPerfCollector was only being called once and that certainly is not the case. I was able to replicate your results and changed the code so that the profilers are wrapped in a sync.Once for initialization. From there I was able to see things incrementing properly without leaking FDs:

~ daniel@p50 ✔ lsof -p $(pgrep -f node_exporter) | wc -l                                                                                                                                                                                       
305                                                                                                                                                                                                                                            
~ daniel@p50 ✔ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total                                                                                                                                     
# HELP node_perf_page_faults_total Number of page faults                                                                                                                                                                                       
# TYPE node_perf_page_faults_total counter                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="0"} 27447                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="1"} 27768                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="2"} 30826                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="3"} 27146                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="4"} 20667                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="5"} 17671                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="6"} 21991                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="7"} 27716                                                                                                                                                                                                     
~ daniel@p50 ✔ lsof -p $(pgrep -f node_exporter) | wc -l                                                                                                                                                                                       
305                                                                                                                                                                                                                                            
~ daniel@p50 ✔ curl -s 'http://localhost:9100/metrics?collect\[\]=perf' | grep node_perf_page_faults_total                                                                                                                                     
# HELP node_perf_page_faults_total Number of page faults                                                                                                                                                                                       
# TYPE node_perf_page_faults_total counter                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="0"} 27953                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="1"} 28097                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="2"} 31501                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="3"} 28217                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="4"} 21789                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="5"} 18465                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="6"} 22425                                                                                                                                                                                                     
node_perf_page_faults_total{cpu="7"} 28083

SuperQ · 2019-03-08T01:43:55Z

A couple of documentation items:

Please add a [FEATURE] to the CHANGELOG.md.
We should document the correct minimum permissions needed to enable this. I'm doing some testing to see what that is.

I would add something like this to the README.md:

The perf collector may not work by default on all Linux systems due to kernel security settings. To allow access, set the following kernel sysctl.

sysctl -w kernel.perf_event_paranoid=X

See the [upstream docs](link here).

hodgesds · 2019-03-08T02:17:14Z

👍 I pushed up some doc changes and cleaned up the commit history.

README.md

collector/perf_linux.go

discordianfish · 2019-03-22T10:48:48Z

collector/perf_linux.go

+			continue
+		}
+
+		if hwProfile.CPUCycles != nil {


Would be nice to refactor this a bit, maybe by using a map and loop over it like we do in similar cases in other collectors.

I made a couple of attempts at this but creating the map becomes rather difficult without using reflection because not all struct fields may be present. I can git it another attempt and it might save a little of the redundant checks, but would probably be slower. What do you think?

I think I got it working pretty well now.

discordianfish · 2019-04-03T10:39:25Z

@hodgesds Can you address the remaining comments?

hodgesds · 2019-04-04T02:12:15Z

Updated, let me know what you think. In the future I'd like to add support for kprobes but that requires a more thought into configuration if you have any ideas (does it make sense to have a config file?).

discordianfish

Some changes. Beside that, I still think we could do better to avoid repetition but I think it's fine for now.

collector/perf_linux.go

SuperQ · 2019-04-15T12:01:56Z

Ping, please rebase this.

hodgesds · 2019-04-22T12:34:36Z

Added a test that will skip if perf_event_paranoid is not properly set.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

discordianfish · 2019-05-03T11:28:15Z

Okay, looks good to me. We have worse cases of repetition and I know it's tricky.

SuperQ

LGTM

agolomoodysaada · 2019-05-07T15:21:27Z

Are these new metrics enabled by default? are there docs written somewhere for this new feature?

hodgesds · 2019-05-07T16:32:44Z

Are these new metrics enabled by default? are there docs written somewhere for this new feature?

See the README.

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

hodgesds force-pushed the perf-exporter branch 2 times, most recently from f17994f to 0468a0c Compare March 2, 2019 03:38

hodgesds changed the title ~~[WIP] Add perf exporter~~ Add perf exporter Mar 5, 2019

hodgesds force-pushed the perf-exporter branch from 8fc6bd6 to ddb3848 Compare March 8, 2019 02:16

SuperQ reviewed Mar 8, 2019

View reviewed changes

README.md Show resolved Hide resolved

hodgesds force-pushed the perf-exporter branch 3 times, most recently from 6893f47 to 24c8c5b Compare March 14, 2019 15:07

SuperQ reviewed Mar 14, 2019

View reviewed changes

collector/perf_linux.go Outdated Show resolved Hide resolved

hodgesds force-pushed the perf-exporter branch from 24c8c5b to b281d7d Compare March 14, 2019 15:26

discordianfish reviewed Mar 22, 2019

View reviewed changes

collector/perf_linux.go Outdated Show resolved Hide resolved

discordianfish reviewed Mar 22, 2019

View reviewed changes

collector/perf_linux.go Outdated Show resolved Hide resolved

discordianfish reviewed Mar 22, 2019

View reviewed changes

hodgesds force-pushed the perf-exporter branch from b281d7d to a30b6c7 Compare March 22, 2019 13:02

hodgesds force-pushed the perf-exporter branch 3 times, most recently from c37234b to 9b368d5 Compare April 4, 2019 02:09

hodgesds force-pushed the perf-exporter branch from 9b368d5 to c770a09 Compare April 4, 2019 02:24

discordianfish requested changes Apr 6, 2019

View reviewed changes

collector/perf_linux.go Outdated Show resolved Hide resolved

collector/perf_linux.go Outdated Show resolved Hide resolved

hodgesds force-pushed the perf-exporter branch from c770a09 to 1da9319 Compare April 6, 2019 15:52

hodgesds force-pushed the perf-exporter branch 10 times, most recently from 906ba8d to f121e09 Compare April 22, 2019 12:34

Add perf exporter

349b017

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

hodgesds force-pushed the perf-exporter branch from f121e09 to 349b017 Compare April 22, 2019 12:37

discordianfish approved these changes May 3, 2019

View reviewed changes

discordianfish requested a review from SuperQ May 3, 2019 11:28

SuperQ approved these changes May 7, 2019

View reviewed changes

SuperQ merged commit 7882009 into prometheus:master May 7, 2019

tzz mentioned this pull request May 18, 2020

Feature request: IPC metrics #1238

Closed

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024

Add perf exporter (prometheus#1274)

ac7785c

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this pull request Apr 9, 2024

Add perf exporter (prometheus#1274)

794bfc0

Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add perf exporter #1274

Add perf exporter #1274

hodgesds commented Mar 2, 2019

hodgesds commented Mar 2, 2019

SuperQ commented Mar 2, 2019

hodgesds commented Mar 2, 2019

SuperQ commented Mar 2, 2019

hodgesds commented Mar 5, 2019

SuperQ commented Mar 5, 2019

hodgesds commented Mar 5, 2019

SuperQ commented Mar 5, 2019

hodgesds commented Mar 6, 2019

SuperQ commented Mar 8, 2019 •

edited

Loading

hodgesds commented Mar 8, 2019

discordianfish Mar 22, 2019

hodgesds Apr 3, 2019

hodgesds Apr 4, 2019

discordianfish commented Apr 3, 2019

hodgesds commented Apr 4, 2019

discordianfish left a comment

SuperQ commented Apr 15, 2019

hodgesds commented Apr 22, 2019

discordianfish commented May 3, 2019

SuperQ left a comment

agolomoodysaada commented May 7, 2019

hodgesds commented May 7, 2019

Add perf exporter #1274

Add perf exporter #1274

Conversation

hodgesds commented Mar 2, 2019

hodgesds commented Mar 2, 2019

SuperQ commented Mar 2, 2019

hodgesds commented Mar 2, 2019

SuperQ commented Mar 2, 2019

hodgesds commented Mar 5, 2019

SuperQ commented Mar 5, 2019

hodgesds commented Mar 5, 2019

SuperQ commented Mar 5, 2019

hodgesds commented Mar 6, 2019

SuperQ commented Mar 8, 2019 • edited Loading

hodgesds commented Mar 8, 2019

discordianfish Mar 22, 2019

Choose a reason for hiding this comment

hodgesds Apr 3, 2019

Choose a reason for hiding this comment

hodgesds Apr 4, 2019

Choose a reason for hiding this comment

discordianfish commented Apr 3, 2019

hodgesds commented Apr 4, 2019

discordianfish left a comment

Choose a reason for hiding this comment

SuperQ commented Apr 15, 2019

hodgesds commented Apr 22, 2019

discordianfish commented May 3, 2019

SuperQ left a comment

Choose a reason for hiding this comment

agolomoodysaada commented May 7, 2019

hodgesds commented May 7, 2019

SuperQ commented Mar 8, 2019 •

edited

Loading