-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fatal error: runtime.unlock: lock count with Go >= 1.10 #870
Comments
I have the same issue as @daenney Host operating system: output of
|
If it's reproducible, could you try bisecting which collectors are enabled to see which it is? |
Also, running with debug level logs would help identify if there is a specific collector problem. |
@mschroeder223 That's interesting. In my case the machine it blows up on is bare metal. I've set systemd to restart the node_exporter for now, but looking at the logs a new error surfaced too: I've updated my unit file to run with debug logs, I'll check in on it in 24 hours, see what gives. |
It took about 4.5hrs for it to crash this morning, this is the output that was captured https://gist.github.com/mschroeder223/16e48f64a8b17888eae0e1d597a83504 I do see several collectors which are running and do not need to be enabled so I am going to disable them and see if it makes any difference. |
There is quite a lot going on here. I believe the "lock count" panic is a red herring, and is the result of the go runtime raising a panic while handling another panic (see "panic during panic" in the log). This issue appears to be highly related to what we see here golang/go#24059 A fix was included in go 1.10.1. As far as I understand, an actual panic is masked by a bug in the panic handling code in the go runtime. I suggest to build a new release candidate with go 1.10.1 and let @mschroeder223 and @daenney run that release candidate for a while. Hopefully we'll be able to catch the actual panic afterwards. |
I'll work on building a new RC release. |
I don't know if this will help isolate the problem at all, but between the two servers that are continuously crashing, I updated one to exclude some unused collectors: and the other server which continued to crash all weekend: Also one more crash log from the weekend but with no real trace history |
I've published v0.16.0-rc.1. This is now built with Go 1.10.1. Please give it a try and see if it fixes the crash. |
Pushed it out today, I'll check back in 24hrs, see if any crashes have shown up. @grobie's line of thought sounds rather plausible though, so lets hope that is it. |
I'm still getting this error even with v0.16.0-rc.1. It runs for a while and then it crashes. uname -a
node_exporter --version
journaltctl -u node_exporter
|
We really need a stack trace here to debug that further. Please check your journalctl output at that time without the |
Without the filter, it doesn't show anything more useful I think.
|
I can confirm this is still happening with rc.0:
That happened 95268 times within two seconds. No more details/stacktrace :-/ |
Oh didn't know somebody ran into this long time ago already and filled an upstream issue: golang/go#15438 |
There is also an issue here #228 |
I can confirm I'm still experiencing this. Node Exporter crashes about once every two days now with that error. Running in debug mode but not getting any stack traces in the journal either. |
Actually... this seems to have gotten a bit worse for me. Before once the |
Although we don't make much use of cgo on linux, it would make sense to rule this out. |
Alright. Built one with Go 1.10.1, with
It's deployed now, lets see what happens. It might take some time for the bug to manifest though. |
So far so good. It's been running for 3 days now roughly, no errors, hasn't been restarted by systemd since I deployed it and no missing data in any graphs (or complaints by Prometheus that the target is unreachable). This seems to suggest it is indeed related to cgo somehow. What can I do to help narrow this down? Can we do some kind of a build matrix building a node_exporter with only one extension that uses cgo at a time to try and narrow this down? Or build it with cgo, disable all cgo-powered extensions, and then enable one at a time? Would it help to build it with cgo but on an older Go version (say 1.8)? |
The only collector enabled on linux that use any cgo reference itself is timex, but this only reads some constants and I doubt that's the problem. But you could try running the official release and disable that collector. Though I suspect it's rather related to some stdlib stuff that uses c implementation instead of native go implementation. |
@daenney Let's try Go 1.9.5 + CGO first. Go 1.9 seemed stable enough in node_exporter 0.15.x. So it should be the same with 0.16. If that still crashes, we can narrow it down to code changes in the node_exporter. |
A note from our conversation on that thread:
I am curious if we have a Cgo bug somewhere in one of the collectors and the interaction with the WiFi collector is causing it all to explode. |
The fact that it crashes with just the wifi collector suggests this isn't an issue with another collector. Does the node_exporter core rely on Cgo for anything? |
Kind of an aside here, but given the node exporter seems to primarly target backend systems, should we consider disabling the WiFi collector by default, at least until it the issue is resolved? |
While I would like to turn it off, I think it would make too many eyes stop looking into it and it would never get fixed. I'll consider it. |
…heus prometheus-node-exporter: work around prometheus/node_exporter#870
Reading up on golang/go#25128 it seems @mdlayher has applied a fix to to netlink in the form of mdlayher/netlink@3d8cc9a that causes the error to disappear. Even though the WiFi collector is now disabled by default it seems worthwhile to upgrade to at least that commit so that those that do need the WiFi collector and enable it can have a more stable experience. |
@daenney Now that we isolated this issue a bit further, maybe it makes sense to update the title and description of this issue since it seems like it might take a while to get this fixed upstream and released. |
@discordianfish Mmm, indeed. I updated it to "when WiFi collector is enabled" only to realise that this is no longer the case as with the changes @mdlayher has done the issue appears resolved. Since there is an upstream issue about the particulars of this bug, should we consider closing this? Is there anything left to do from a Prometheus/node_exporter side of things? |
@daenney Agree, probably is fine to close this. |
Chaneges since v0.16.0-rc.2 * Remove gmond collector prometheus#852 * Build with Go 1.9[0] * Fix /proc/net/dev/ interface name handling prometheus#910 [0]: prometheus#870 Signed-off-by: Ben Kochie <superq@gmail.com>
* Disable wifi collector by default Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * prometheus#870 * prometheus#1008 Signed-off-by: Ben Kochie <superq@gmail.com>
Host operating system: output of
uname -a
node_exporter version: output of
node_exporter --version
Used the release artifact at: https://github.com/prometheus/node_exporter/releases/download/v0.16.0-rc.0/node_exporter-0.16.0-rc.0.linux-amd64.tar.gz
node_exporter command line flags
None, the defaults for 0.16 match my needs
Are you running node_exporter in Docker?
No
What did you do that produced an error?
Just ran it for a couple of days
What did you expect to see?
It not to crash
What did you see instead?
That
fatal error
line got spewed about a 1000 times, all logged at19:47:54
according to systemd.The text was updated successfully, but these errors were encountered: