-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node-Exporter : memory usage too high (OOME) #1008
Comments
The easiest thing to do is get a pprof memory profile. See #859 for some more tips. It would be useful to look at the the node_exporter's metric, |
Yes, that looks like a goroutine leak. Can you post any pod logs to see if there are any errors? |
Below the pod logs :
Thanks |
Thanks, hopefully the pprof dump will give us the information we need. |
Hi Superq, I could run the command but Graphviz is not installed and I can't install it in the pod:
I have to run the command in the pod because I have an OCP auth-proxy for prometheus. But I have the raw data ( Thanks, |
That is a profile for Prometheus, not the node exporter. You need to use port 9100. |
Hoo sorry, Thanks |
It would help to take a pprof sample when it's using over 200MB of memory, this way we can more easily find the memory leak. |
Strange. The pprof only shows 9.6MiB fo memory in use. I would probably eliminate the CPU limit. I have seen CPU limits in K8s cause bad performance stalling. But I haven't looked at the current implementation to see if that's improved or not. /cc @juliusv Do you know if there is something specific we can do to trace this goroutine leak? |
Hi, Thanks, |
One additional debugging option, get a dump of the goroutines with |
@SuperQ Yeah, I would recommend listing the active goroutines, as we already know it's connected to a goroutine leak. But prefer |
Hi, I can't execute the command with
Do you have any idea? For the moment, please, find attachment dump of the goroutines (debug=1 option) and new pprof sample heap for node-exporter. For information, currently, the pod node-exporter uses 137MiB Thanks, |
The goroutine link doesn't require pprof, just |
Ok, thanks for the clarification. Please, find attachment dump of the goroutines ( Thanks, |
Looks like it's getting stuck on netlink. This is used by the wifi collector, try running the node_exporter with /cc @mdlayher 😁
|
@SpeedBlack also as you're using OpenShift, please note that the wifi collector should now be disabled by default in Ansible-based installations. |
Interesting. It appears to be stuck in the locked syscall goroutine I have set up. I probably don't have time to look at this immediately, but can OP please clarify if they have WiFi devices in this system? |
Great ! Thanks for your help ! @mdlayher, no I don't have wifi devices in this system. Thanks, |
Hi, Today, I confirm it's Ok ! Thanks for your help ! |
I'm leaving this open for tracking. |
@mdlayher Maybe we should file a new issue with a cleaned up summary. |
@mdlayher Did we ever get the upstream issue solved for this? |
I don't believe so. I haven't touched that code in quite a while and wasn't able to easily reproduce these results. |
The same, hello from 2021 |
Anything new on this? First reported in 2018! |
@tdudgeon Do you have the wifi collector enabled? Can you provide the pprof details as described above? |
@discordianfish No, looks like it wasn't disabled. I've done so now by adding |
@discordianfish So after monitoring for a while there are still pod restarts due to OOM, and memory usage does continually increase until the pod is restarted, but the restarts are less frequent than they were before adding |
@discordianfish I'm struggling to work out how to generate the pprof details. |
@tdudgeon You can just run go tool pprof on your dev laptop as long as it can reach the node-exporter port. You could also use kubectl port-forward to forward the port |
@discordianfish Here's the data. Hope I did it right! pprof.node_exporter.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz |
This shows that node-exporter is using 3515.72kB memory in your case. So I don't know why and what restarts your node-exporter but memory usage doesn't seem to be a problem. At least not when you created the profile. |
@discordianfish Here are dumps from just before the pod was restarted. pprof.node_exporter.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz |
This is again only using 2064.26kB in the go process. There must be something else in the container using that memory. But you're sure it's actually getting OOM killed and it's not just the go GC or linux fs cache leading to this memory usage pattern? |
Yes, definitely being killed:
What is strange is that the more RAM is on the node the more a problem this is. |
So either there is something else in the container causing this or it's the node-exporter itself, which seems unlikely but of course is possible. But if the later is the case, it should show up in pprof and when you captured it it wasn't present. So I'd suggest keep trying to pprof it to catch the condition or look what else in the container might cause it. I don't know what |
Amazing that node-exporter can report itself when OOM. |
There have been a number of releases since this issue was reported. The only root cause, the wifi collector, has been disabled by default for quite a while. I think we can close this. |
* Disable wifi collector by default Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * prometheus#870 * prometheus#1008 Signed-off-by: Ben Kochie <superq@gmail.com>
Host operating system: output of
uname -a
3.10.0-862.3.2.el7.x86_64
node_exporter version: output of
node_exporter --version
sh-4.2$ node_exporter --version
node_exporter, version 0.16.0 (branch: HEAD, revision: d42bd70)
build user: root@node-exporter-binary-3-build
build date: 20180606-16:48:15
go version: go1.10
node_exporter command line flags
--path.procfs=/host/proc --path.sysfs=/host/sys
Are you running node_exporter in Docker?
Yes, in Openshift
Hi,
I use node-exporter (openshift/prometheus-node-exporter:v0.16.0) in Openshift with Prometheus and Grafana.
I have a problem with memory recycling.
The Pod is killed each time after at an OOME (OutOfMemory). Memory usage increases continuously without being recycled.
By default, the limits were (template here):
I tested several configurations without success.
Today, the configuration is:
Do you have any idea ?
An adjustment to make ?
Do you have recommended resources limits ?
Thanks in advance for your help !
The text was updated successfully, but these errors were encountered: