Node-Exporter : memory usage too high (OOME) #1008

SpeedBlack · 2018-07-18T14:40:58Z

Host operating system: output of `uname -a`

3.10.0-862.3.2.el7.x86_64

node_exporter version: output of `node_exporter --version`

sh-4.2$ node_exporter --version
node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70)
build user: root@node-exporter-binary-3-build
build date: 20180606-16:48:15
go version: go1.10

node_exporter command line flags

--path.procfs=/host/proc --path.sysfs=/host/sys

Are you running node_exporter in Docker?

Yes, in Openshift

Hi,

I use node-exporter (openshift/prometheus-node-exporter:v0.16.0) in Openshift with Prometheus and Grafana.
I have a problem with memory recycling.
The Pod is killed each time after at an OOME (OutOfMemory). Memory usage increases continuously without being recycled.

By default, the limits were (template here):

          resources:
            requests:
              memory: 30Mi
              cpu: 100m
            limits:
              memory: 50Mi
              cpu: 200m

I tested several configurations without success.
Today, the configuration is:

          resources:
            limits:
              cpu: 250m
              memory: 250Mi
            requests:
              cpu: 100m
              memory: 75Mi

Do you have any idea ?
An adjustment to make ?
Do you have recommended resources limits ?

Thanks in advance for your help !

The text was updated successfully, but these errors were encountered:

SuperQ · 2018-07-18T14:47:47Z

The easiest thing to do is get a pprof memory profile.

See #859 for some more tips.

It would be useful to look at the the node_exporter's metric, go_goroutines over time to see if there is a leak.

SpeedBlack · 2018-07-19T13:10:27Z

Hi SuperQ,

Below, the memory usage of the "node-exporter" pod deployed on a node :

below the graph with the metrics go_goroutines for the same node : go_goroutines{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",instance="xxxxxx",job="kubernetes-nodes-exporter",kubernetes_io_hostname="xxxxxx",.....}

What is this metric ?

I'm looking at how to get the pprof.

Thanks,
Best regards.

SuperQ · 2018-07-19T13:16:35Z

Yes, that looks like a goroutine leak. Can you post any pod logs to see if there are any errors?

SpeedBlack · 2018-07-19T13:36:51Z

Below the pod logs :

$ oc logs -f prometheus-node-exporter-krvrb
time="2018-07-19T03:57:03Z" level=info msg="Starting node_exporter (version=0.16.0, branch=HEAD, revision=d42bd70f4363dced6b77d8fc311ea57b63387e4f)" source="node_exporter.go:82"
time="2018-07-19T03:57:03Z" level=info msg="Build context (go=go1.10, user=root@node-exporter-binary-3-build, date=20180606-16:48:15)" source="node_exporter.go:83"
time="2018-07-19T03:57:03Z" level=info msg="Enabled collectors:" source="node_exporter.go:90"
time="2018-07-19T03:57:03Z" level=info msg=" - arp" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - bcache" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - bonding" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - conntrack" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - cpu" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - diskstats" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - edac" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - entropy" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - filefd" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - filesystem" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - hwmon" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - infiniband" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - ipvs" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - loadavg" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - mdadm" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - meminfo" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - netdev" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - netstat" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - nfs" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - nfsd" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - sockstat" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - stat" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - textfile" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - time" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - timex" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - uname" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - vmstat" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - wifi" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - xfs" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg=" - zfs" source="node_exporter.go:97"
time="2018-07-19T03:57:03Z" level=info msg="Listening on :9100" source="node_exporter.go:111"

Thanks

SuperQ · 2018-07-19T13:57:16Z

Thanks, hopefully the pprof dump will give us the information we need.

SpeedBlack · 2018-07-20T08:48:57Z

Hi Superq,

I could run the command but Graphviz is not installed and I can't install it in the pod:

go tool pprof -svg http://localhost:9090/debug/pprof/heap > heap.svg                                                                                                                      
Fetching profile over HTTP from http://localhost:9090/debug/pprof/heap                                                                                                                            
Could not use temp dir /pprof: mkdir /pprof: permission denied                                                                                                                                    
Saved profile in /tmp/pprof.prometheus.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz                                                                                              
Failed to execute dot. Is Graphviz installed? Error: exec: "dot": executable file not found in $PATH

I have to run the command in the pod because I have an OCP auth-proxy for prometheus.

But I have the raw data ("pprof.prometheus.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz" file).
Do you know how I can generate the graph afterwards with Graphviz on another system ? I haven't found the command yet... or the tool.

Thanks,
Best regards.

SpeedBlack · 2018-07-20T09:23:50Z

Sorry, it's Ok :

it's Ok ?

Thanks,
Best regards

SuperQ · 2018-07-20T09:42:37Z

That is a profile for Prometheus, not the node exporter. You need to use port 9100.

SpeedBlack · 2018-07-20T10:37:02Z

Hoo sorry,
Please, find attachment pprof svg for node-exporter.

pprof_node_exporter.zip

Thanks
Best regards

SpeedBlack · 2018-07-20T10:40:56Z

Below, the memory usage of the node-exporter since the restart of the pod:

SuperQ · 2018-07-20T15:09:50Z

It would help to take a pprof sample when it's using over 200MB of memory, this way we can more easily find the memory leak.

SpeedBlack · 2018-07-20T15:28:00Z

I took a pprof sample at 12:24:00. The node-exporter used 248MiB :

is it better to lower the MEMORY limit and take a pprof sample ? (Currently 250Mi | By default 50Mi)

Thanks,
Best regards.

SuperQ · 2018-07-20T15:33:52Z

Strange. The pprof only shows 9.6MiB fo memory in use.

I would probably eliminate the CPU limit. I have seen CPU limits in K8s cause bad performance stalling. But I haven't looked at the current implementation to see if that's improved or not.

/cc @juliusv Do you know if there is something specific we can do to trace this goroutine leak?

SpeedBlack · 2018-07-23T08:39:56Z

Hi,
I tried to remove CPU limits without result.
I'll take another pprof sample.

Thanks,
Best regards

SuperQ · 2018-07-23T08:55:23Z

One additional debugging option, get a dump of the goroutines with http://localhost:9100/debug/pprof/goroutine?debug=1. Normally it would be only a few, like this.

juliusv · 2018-07-23T10:23:53Z

@SuperQ Yeah, I would recommend listing the active goroutines, as we already know it's connected to a goroutine leak. But prefer debug=2 over debug=1, so: http://localhost:9100/debug/pprof/goroutine?debug=2

SpeedBlack · 2018-07-23T12:39:15Z

Hi,

I can't execute the command with debug=2 option :

# go tool pprof -svg http://xxxxx:9100/debug/pprof/goroutine?debug=2 > dump_goroutine_2.svg
Fetching profile over HTTP from http://xxxxx:9100/debug/pprof/goroutine?debug=2
http://xxxxx:9100/debug/pprof/goroutine?debug=2: parsing profile: unrecognized profile format
failed to fetch any profiles

Do you have any idea?

For the moment, please, find attachment dump of the goroutines (debug=1 option) and new pprof sample heap for node-exporter.
dumps.zip

For information, currently, the pod node-exporter uses 137MiB

Thanks,
Best regards.

SuperQ · 2018-07-23T13:00:09Z

The goroutine link doesn't require pprof, just curl. It's just a big text dump, not svg.

SpeedBlack · 2018-07-23T13:09:41Z

Ok, thanks for the clarification.

Please, find attachment dump of the goroutines (debug=1 & debug=2).
goroutine_dump.zip

Thanks,
Best regards.

SuperQ · 2018-07-23T13:26:53Z

Looks like it's getting stuck on netlink. This is used by the wifi collector, try running the node_exporter with --no-collector.wifi

/cc @mdlayher 😁

goroutine 17790 [chan receive, 501 minutes, locked to thread]:
github.com/prometheus/node_exporter/vendor/github.com/mdlayher/netlink.newSysSocket.func1(0xc420570e01, 0xc4206bc2f0, 0xc420570e40)
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/mdlayher/netlink/conn_linux.go:258 +0x8a
created by github.com/prometheus/node_exporter/vendor/github.com/mdlayher/netlink.newSysSocket
        /go/src/github.com/prometheus/node_exporter/vendor/github.com/mdlayher/netlink/conn_linux.go:241 +0xa0

simonpasquier · 2018-07-23T13:32:43Z

@SpeedBlack also as you're using OpenShift, please note that the wifi collector should now be disabled by default in Ansible-based installations.

mdlayher · 2018-07-23T13:34:07Z

Interesting. It appears to be stuck in the locked syscall goroutine I have set up. I probably don't have time to look at this immediately, but can OP please clarify if they have WiFi devices in this system?

SpeedBlack · 2018-07-23T13:43:50Z

Great ! Thanks for your help !
I just added "--no-collector.wifi". I'll confirm tomorrow if the problem is gone !

@mdlayher, no I don't have wifi devices in this system.

Thanks,
Best regards

SpeedBlack · 2018-07-24T08:50:19Z

Hi,

Today, I confirm it's Ok !
So, I close this case !

Thanks for your help !
Best regards.

mdlayher · 2018-07-24T11:34:25Z

I'm leaving this open for tracking.

SuperQ · 2018-07-24T11:42:14Z

@mdlayher Maybe we should file a new issue with a cleaned up summary.

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * #870 * #1008 Signed-off-by: Ben Kochie <superq@gmail.com>

* Disable wifi collector by default Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * #870 * #1008 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ · 2020-02-18T10:28:13Z

@mdlayher Did we ever get the upstream issue solved for this?

mdlayher · 2020-02-18T17:54:20Z

I don't believe so. I haven't touched that code in quite a while and wasn't able to easily reproduce these results.

alter · 2021-10-28T17:44:59Z

The same, hello from 2021

ovidiubuligan · 2021-11-03T10:06:08Z

Sorry, it's Ok :

it's Ok ?

Thanks, Best regards

Hello , you have an interesting chart there . How did you generate that map ?

tdudgeon · 2022-04-10T16:59:39Z

Anything new on this? First reported in 2018!
It does still seem to be a real problem, and for me the more memory the node has the more often the restarts happen (less than every hour for node with 512GB RAM).

discordianfish · 2022-04-11T09:58:48Z

@tdudgeon Do you have the wifi collector enabled? Can you provide the pprof details as described above?

tdudgeon · 2022-04-11T12:50:03Z

@discordianfish No, looks like it wasn't disabled. I've done so now by adding --no-collector.wifi to the spec.containers.args for the DaemonSet and so far I've seen no restarts. So maybe that's fixed it.
For Prometheus installed using the Prometheus Operator is editing the DaemonSet the right way to do this?

tdudgeon · 2022-04-11T16:04:24Z

@discordianfish So after monitoring for a while there are still pod restarts due to OOM, and memory usage does continually increase until the pod is restarted, but the restarts are less frequent than they were before adding --no-collector.wifi to the node exporter pod arguments. Again this affects nodes with large amounts of RAM much more severely than ones with smaller amounts.
I'll try to work out how to generate the pprof details.

tdudgeon · 2022-04-12T10:56:59Z

@discordianfish I'm struggling to work out how to generate the pprof details.
Looks like I need to run go tool pprof -svg http://localhost:9100/debug/pprof/heap > heap.svg but is looks like go isn't present in the node exporter pod. What's the way to do this?

discordianfish · 2022-04-12T16:16:01Z

@tdudgeon You can just run go tool pprof on your dev laptop as long as it can reach the node-exporter port. You could also use kubectl port-forward to forward the port

tdudgeon · 2022-04-13T11:49:05Z

@discordianfish Here's the data. Hope I did it right!

pprof.node_exporter.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

discordianfish · 2022-04-14T09:42:34Z

This shows that node-exporter is using 3515.72kB memory in your case. So I don't know why and what restarts your node-exporter but memory usage doesn't seem to be a problem. At least not when you created the profile.

tdudgeon · 2022-04-22T12:40:37Z

@discordianfish Here are dumps from just before the pod was restarted.
This is memory usage profile I see.
The memory request is 30Mi and the limit is 50Mi.

pprof.node_exporter.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz

discordianfish · 2022-04-26T14:32:28Z

This is again only using 2064.26kB in the go process. There must be something else in the container using that memory. But you're sure it's actually getting OOM killed and it's not just the go GC or linux fs cache leading to this memory usage pattern?

tdudgeon · 2022-04-29T14:06:01Z

Yes, definitely being killed:

  containerStatuses:
    - name: node-exporter
      state:
        running:
          startedAt: '2022-04-29T13:14:57Z'
      lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2022-04-29T12:28:57Z'
          finishedAt: '2022-04-29T13:14:56Z'
          containerID: >-
            docker://eba0813c8ac002f553e0234bad1cc819714767a29e58ccd74debede6bca28c2b
      ready: true
      restartCount: 537
      image: rancher/mirrored-prometheus-node-exporter:v1.1.2
      imageID: >-
        docker-pullable://rancher/mirrored-prometheus-node-exporter@sha256:b71ec5c0a45792183d5506ac86d1cab286eed997152ca35602e58d5e75e8eba8
      containerID: >-
        docker://f2e0d2711b04d48c4dbb08412c3df87f229d171b7e74a8beabea15388889e418
      started: true
  qosClass: Burstable

What is strange is that the more RAM is on the node the more a problem this is.

discordianfish · 2022-06-07T10:04:25Z

So either there is something else in the container causing this or it's the node-exporter itself, which seems unlikely but of course is possible. But if the later is the case, it should show up in pprof and when you captured it it wasn't present. So I'd suggest keep trying to pprof it to catch the condition or look what else in the container might cause it. I don't know what rancher/mirrored-prometheus-node-exporter:v1.1.2 is doing. You could also just try our official images and see if it happens there as well.

thenamehasbeentake · 2023-06-14T09:14:03Z

Amazing that node-exporter can report itself when OOM.

SuperQ · 2023-06-14T10:46:25Z

There have been a number of releases since this issue was reported. The only root cause, the wifi collector, has been disabled by default for quite a while. I think we can close this.

* Disable wifi collector by default Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * prometheus#870 * prometheus#1008 Signed-off-by: Ben Kochie <superq@gmail.com>

simonpasquier mentioned this issue Jul 24, 2018

Disable the wifi collector in node_exporter openshift/openshift-ansible#8914

Merged

SpeedBlack closed this as completed Jul 24, 2018

mdlayher reopened this Jul 24, 2018

SuperQ assigned mdlayher Jul 24, 2018

SuperQ added a commit that referenced this issue Aug 6, 2018

Disable wifi collector by default

f8c74d0

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * #870 * #1008 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ mentioned this issue Aug 6, 2018

Disable wifi collector by default #1037

Merged

SuperQ added this to the v0.17.0 milestone Sep 11, 2018

discordianfish mentioned this issue Oct 10, 2018

The node-exporter stop with nothing message #1090

Closed

SuperQ modified the milestones: v0.17.0, v2.0.0 Jun 12, 2020

mdlayher removed their assignment Oct 28, 2021

SuperQ closed this as completed Jun 14, 2023

csatib02 mentioned this issue Oct 10, 2024

buffer-metrics-sidecar should have healthy-check kube-logging/logging-operator#1811

Closed

Node-Exporter : memory usage too high (OOME) #1008

Node-Exporter : memory usage too high (OOME) #1008

Comments

SpeedBlack commented Jul 18, 2018 • edited Loading

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

SuperQ commented Jul 18, 2018

SpeedBlack commented Jul 19, 2018 • edited Loading

SuperQ commented Jul 19, 2018

SpeedBlack commented Jul 19, 2018

SuperQ commented Jul 19, 2018

SpeedBlack commented Jul 20, 2018

SpeedBlack commented Jul 20, 2018

SuperQ commented Jul 20, 2018

SpeedBlack commented Jul 20, 2018

SpeedBlack commented Jul 20, 2018

SuperQ commented Jul 20, 2018

SpeedBlack commented Jul 20, 2018 • edited Loading

SuperQ commented Jul 20, 2018

SpeedBlack commented Jul 23, 2018

SuperQ commented Jul 23, 2018 • edited Loading

juliusv commented Jul 23, 2018

SpeedBlack commented Jul 23, 2018 • edited Loading

SuperQ commented Jul 23, 2018

SpeedBlack commented Jul 23, 2018 • edited Loading

SuperQ commented Jul 23, 2018

simonpasquier commented Jul 23, 2018

mdlayher commented Jul 23, 2018

SpeedBlack commented Jul 23, 2018 • edited Loading

SpeedBlack commented Jul 24, 2018

mdlayher commented Jul 24, 2018

SuperQ commented Jul 24, 2018

SuperQ commented Feb 18, 2020

mdlayher commented Feb 18, 2020

alter commented Oct 28, 2021

ovidiubuligan commented Nov 3, 2021

tdudgeon commented Apr 10, 2022

discordianfish commented Apr 11, 2022

tdudgeon commented Apr 11, 2022

tdudgeon commented Apr 11, 2022

tdudgeon commented Apr 12, 2022

discordianfish commented Apr 12, 2022

tdudgeon commented Apr 13, 2022

discordianfish commented Apr 14, 2022

tdudgeon commented Apr 22, 2022

discordianfish commented Apr 26, 2022

tdudgeon commented Apr 29, 2022

discordianfish commented Jun 7, 2022

thenamehasbeentake commented Jun 14, 2023

SuperQ commented Jun 14, 2023

SpeedBlack commented Jul 18, 2018 •

edited

Loading

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

SpeedBlack commented Jul 19, 2018 •

edited

Loading

SpeedBlack commented Jul 20, 2018 •

edited

Loading

SuperQ commented Jul 23, 2018 •

edited

Loading

SpeedBlack commented Jul 23, 2018 •

edited

Loading

SpeedBlack commented Jul 23, 2018 •

edited

Loading

SpeedBlack commented Jul 23, 2018 •

edited

Loading