fatal error: runtime.unlock: lock count with Go >= 1.10 #870

daenney · 2018-03-29T09:39:48Z

Host operating system: output of `uname -a`

Linux endor 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 0.16.0-rc.0 (branch: HEAD, revision: 002c1ca02917406cbecc457162e2bdb1f29c2f49)
  build user:       root@5ff5455ac873
  build date:       20180309-15:09:26
  go version:       go1.10

Used the release artifact at: https://github.com/prometheus/node_exporter/releases/download/v0.16.0-rc.0/node_exporter-0.16.0-rc.0.linux-amd64.tar.gz

node_exporter command line flags

None, the defaults for 0.16 match my needs

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Just ran it for a couple of days

What did you expect to see?

It not to crash

What did you see instead?

Mar 28 19:47:54 endor node_exporter[18076]: fatal error: runtime·unlock: lock count
Mar 28 19:48:03 endor systemd[1]: node_exporter.service: Main process exited, code=killed, status=11/SEGV

That fatal error line got spewed about a 1000 times, all logged at 19:47:54 according to systemd.

The text was updated successfully, but these errors were encountered:

mschroeder223 · 2018-03-29T20:29:26Z

I have the same issue as @daenney

Host operating system: output of `uname -a`

Linux lb1 4.13.0-26-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 0.16.0-rc.0 (branch: HEAD, revision: 002c1ca02917406cbecc457162e2bdb1f29c2f49)
  build user:       root@5ff5455ac873
  build date:       20180309-15:09:26
  go version:       go1.10

Observation: I have node_exporter installed on 6 machines at this point. The two machines it keeps failing on are the only two vms I am currently using. The others are bare metal and are running great.

brian-brazil · 2018-03-29T20:37:55Z

If it's reproducible, could you try bisecting which collectors are enabled to see which it is?

SuperQ · 2018-03-29T20:55:50Z

Also, running with debug level logs would help identify if there is a specific collector problem.

daenney · 2018-03-30T10:49:30Z

@mschroeder223 That's interesting. In my case the machine it blows up on is bare metal.

I've set systemd to restart the node_exporter for now, but looking at the logs a new error surfaced too:
Mar 30 09:15:14 endor systemd[1]: node_exporter.service: Main process exited, code=killed, status=5/TRAP. Unsure what that's about.

I've updated my unit file to run with debug logs, I'll check in on it in 24 hours, see what gives.

mschroeder223 · 2018-03-30T18:01:48Z

It took about 4.5hrs for it to crash this morning, this is the output that was captured https://gist.github.com/mschroeder223/16e48f64a8b17888eae0e1d597a83504

I do see several collectors which are running and do not need to be enabled so I am going to disable them and see if it makes any difference.

grobie · 2018-03-31T10:41:03Z

There is quite a lot going on here. I believe the "lock count" panic is a red herring, and is the result of the go runtime raising a panic while handling another panic (see "panic during panic" in the log).

This issue appears to be highly related to what we see here golang/go#24059 A fix was included in go 1.10.1. As far as I understand, an actual panic is masked by a bug in the panic handling code in the go runtime.

I suggest to build a new release candidate with go 1.10.1 and let @mschroeder223 and @daenney run that release candidate for a while. Hopefully we'll be able to catch the actual panic afterwards.

SuperQ · 2018-03-31T21:54:18Z

I'll work on building a new RC release.

mschroeder223 · 2018-04-02T14:00:36Z

I don't know if this will help isolate the problem at all, but between the two servers that are continuously crashing, I updated one to exclude some unused collectors:
node_exporter --no-collector.infiniband --no-collector.zfs --no-collector.wifi --no-collector.nfs --no-collector.nfsd

and the other server which continued to crash all weekend:
node_exporter

Also one more crash log from the weekend but with no real trace history
https://gist.github.com/mschroeder223/ca7243393e3c13d619e8ba0f517ca9b9

SuperQ · 2018-04-05T05:54:29Z

I've published v0.16.0-rc.1. This is now built with Go 1.10.1. Please give it a try and see if it fixes the crash.

daenney · 2018-04-06T07:38:12Z

Pushed it out today, I'll check back in 24hrs, see if any crashes have shown up. @grobie's line of thought sounds rather plausible though, so lets hope that is it.

JBauto · 2018-04-12T16:11:14Z

I'm still getting this error even with v0.16.0-rc.1. It runs for a while and then it crashes.

uname -a

Linux node8 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter --version

node_exporter, version 0.16.0-rc.1 (branch: HEAD, revision: f3ff44fcca5bb20a181cd7b99734afeae553c78e)
  build user:       root@448aa887a7ca
  build date:       20180404-17:47:46
  go version:       go1.10.1

journaltctl -u node_exporter

Apr 09 14:13:47 node8 systemd[1]: Started Node Exporter.
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Starting node_exporter (version=0.16.0-rc.1, branch=HEAD, revision=f3ff44fcca5bb20a181cd7b99734afeae553c78e)" source="node_exporter.go:82"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Build context (go=go1.10.1, user=root@448aa887a7ca, date=20180404-17:47:46)" source="node_exporter.go:83"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - arp" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bcache" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bonding" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - conntrack" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - cpu" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - diskstats" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - edac" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - entropy" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filefd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filesystem" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - hwmon" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - infiniband" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - ipvs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - loadavg" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - mdadm" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - meminfo" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netdev" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfsd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - sockstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - stat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - textfile" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - time" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - timex" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - uname" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - vmstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - wifi" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - xfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - zfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Listening on :9100" source="node_exporter.go:111"
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count

grobie · 2018-04-12T16:17:20Z

We really need a stack trace here to debug that further. Please check your journalctl output at that time without the -u node_exporter filter and paste the output.

JBauto · 2018-04-12T16:46:37Z

Without the filter, it doesn't show anything more useful I think.

Apr 09 14:13:47 node8 ansible-command[1576231]: Invoked with warn=True executable=None _uses_shell=True _raw_params=systemctl restart node_exporter removes=None creates=None chdir=None stdin=None
Apr 09 14:13:47 node8 systemd[1]: Stopped Node Exporter.
Apr 09 14:13:47 node8 systemd[1]: Started Node Exporter.
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Starting node_exporter (version=0.16.0-rc.1, branch=HEAD, revision=f3ff44fcca5bb20a181cd7b99734afeae553c78e)" source="node_exporter.go:82"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Build context (go=go1.10.1, user=root@448aa887a7ca, date=20180404-17:47:46)" source="node_exporter.go:83"
Apr 09 14:13:47 node8 sudo[1576228]: pam_unix(sudo:session): session closed for user root
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - arp" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bcache" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bonding" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - conntrack" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - cpu" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - diskstats" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - edac" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - entropy" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filefd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filesystem" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - hwmon" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - infiniband" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - ipvs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - loadavg" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - mdadm" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - meminfo" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netdev" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfsd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - sockstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - stat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - textfile" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - time" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - timex" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - uname" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - vmstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - wifi" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - xfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - zfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Listening on :9100" source="node_exporter.go:111"
Apr 09 14:14:47 node8 sshd[1576083]: Received disconnect from IP_ADDRESS port 43820:11: disconnected by user
Apr 09 14:14:47 node8 sshd[1576083]: Disconnected from IP_ADDRESS port 43820
Apr 09 14:14:47 node8 sshd[1576003]: pam_unix(sshd:session): session closed for user manager
Apr 09 14:14:47 node8 systemd-logind[1082]: Removed session 501.
Apr 09 14:14:47 node8 systemd[1]: Stopping User Manager for UID 1000...
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Default.
Apr 09 14:14:47 node8 systemd[1576005]: Reached target Shutdown.
Apr 09 14:14:47 node8 systemd[1576005]: Starting Exit the Session...
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Basic System.
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Sockets.
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Paths.
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Timers.
Apr 09 14:14:47 node8 systemd[1576005]: Received SIGRTMIN+24 from PID 1576309 (kill).
Apr 09 14:14:47 node8 systemd[1576007]: pam_unix(systemd-user:session): session closed for user manager
Apr 09 14:14:47 node8 systemd[1]: Stopped User Manager for UID 1000.
Apr 09 14:14:47 node8 systemd[1]: Removed slice User Slice of manager.
Apr 09 14:17:01 node8 CRON[1576450]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 14:17:01 node8 CRON[1576451]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 14:17:01 node8 CRON[1576450]: pam_unix(cron:session): session closed for user root
Apr 09 15:17:01 node8 CRON[1579817]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 15:17:01 node8 CRON[1579818]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 15:17:01 node8 CRON[1579817]: pam_unix(cron:session): session closed for user root
Apr 09 16:17:01 node8 CRON[1581532]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 16:17:01 node8 CRON[1581533]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 16:17:01 node8 CRON[1581532]: pam_unix(cron:session): session closed for user root
Apr 09 17:17:01 node8 CRON[1583185]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 17:17:01 node8 CRON[1583186]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 17:17:01 node8 CRON[1583185]: pam_unix(cron:session): session closed for user root
Apr 09 18:17:01 node8 CRON[1584837]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 18:17:01 node8 CRON[1584838]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 18:17:01 node8 CRON[1584837]: pam_unix(cron:session): session closed for user root
Apr 09 18:30:42 node8 dhclient[1269]: DHCPREQUEST of IP_ADDRESS on eno1 to IP_ADDRESS port 67 (xid=0x61fb15b0)
Apr 09 18:30:42 node8 dhclient[1269]: DHCPACK of IP_ADDRESS from IP_ADDRESS
Apr 09 18:30:42 node8 dhclient[1269]: bound to IP_ADDRESS -- renewal in 42306 seconds.
Apr 09 18:44:44 node8 systemd[1]: Starting Cleanup of Temporary Directories...
Apr 09 18:44:44 node8 systemd-tmpfiles[1585610]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Apr 09 18:44:44 node8 systemd[1]: Started Cleanup of Temporary Directories.
Apr 09 19:17:01 node8 CRON[1586506]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 19:17:01 node8 CRON[1586507]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 19:17:01 node8 CRON[1586506]: pam_unix(cron:session): session closed for user root
Apr 09 20:17:01 node8 CRON[1589398]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 20:17:01 node8 CRON[1589399]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 20:17:01 node8 CRON[1589398]: pam_unix(cron:session): session closed for user root
Apr 09 21:17:01 node8 CRON[1593667]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 21:17:01 node8 CRON[1593668]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 21:17:01 node8 CRON[1593667]: pam_unix(cron:session): session closed for user root
Apr 09 22:17:01 node8 CRON[1597959]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 22:17:01 node8 CRON[1597960]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 22:17:01 node8 CRON[1597959]: pam_unix(cron:session): session closed for user root
Apr 09 23:17:01 node8 CRON[1602223]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 23:17:01 node8 CRON[1602224]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 23:17:01 node8 CRON[1602223]: pam_unix(cron:session): session closed for user root
Apr 10 00:17:01 node8 CRON[1606511]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 10 00:17:01 node8 CRON[1606512]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 10 00:17:01 node8 CRON[1606511]: pam_unix(cron:session): session closed for user root
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count

discordianfish · 2018-04-13T10:31:14Z

I can confirm this is still happening with rc.0:

2018-04-10T16:48:52.441683157Z time="2018-04-10T16:48:52Z" level=info msg="Starting node_exporter (version=0.16.0-rc.1, branch=master, revision=9e485f11578b40915db8668ef9060a014d9cea2f)" source="node_exporter.go:82"
...
2018-04-10T16:48:52.442221476Z time="2018-04-10T16:48:52Z" level=info msg=" - zfs" source="node_exporter.go:97"
2018-04-10T16:48:52.442223509Z time="2018-04-10T16:48:52Z" level=info msg="Listening on :9100" source="node_exporter.go:111"
2018-04-13T10:06:30.639476572Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.63951748Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639523112Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639526776Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639530495Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639535581Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639539318Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639542781Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639546352Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639549837Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.63955346Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639556917Z fatal error: runtime·unlock: lock count

That happened 95268 times within two seconds. No more details/stacktrace :-/

discordianfish · 2018-04-13T10:36:37Z

Oh didn't know somebody ran into this long time ago already and filled an upstream issue: golang/go#15438
Suggestions were to check unsafe and cgo code. I don't think we use any unsafe code on linux/amd64 so might be cgo or still an upstream issue.

discordianfish · 2018-04-13T10:57:50Z

There is also an issue here #228

daenney · 2018-04-14T13:27:47Z

I can confirm I'm still experiencing this. Node Exporter crashes about once every two days now with that error. Running in debug mode but not getting any stack traces in the journal either.

daenney · 2018-04-15T09:37:59Z

Actually... this seems to have gotten a bit worse for me. Before once the fatal error: runtime.unlock: lock count happened the node_exporter would crash and systemd would then restart it. However, now it just gets stuck in that mode, so unless I automate checking if the node exporter is still running and if not manually restart it I'm missing metrics.

discordianfish · 2018-04-16T10:46:26Z

Although we don't make much use of cgo on linux, it would make sense to rule this out.
@daenney: Can you deploy a cgo disabled version of the node-exporter and see if that problem goes away? You can build it yourself by running CGO_ENABLED=0 go build in the repo or you can use my build:

node_exporter.gz

daenney · 2018-04-16T13:12:34Z

Alright. Built one with Go 1.10.1, with HEAD at 768be13.

$ go get github.com/prometheus/node_exporter
$ cd ~/Development/go/src/github.com/prometheus/node_exporter
$ env CGO_ENABLED=0 go build

It's deployed now, lets see what happens. It might take some time for the bug to manifest though.

daenney · 2018-04-19T09:45:49Z

So far so good. It's been running for 3 days now roughly, no errors, hasn't been restarted by systemd since I deployed it and no missing data in any graphs (or complaints by Prometheus that the target is unreachable).

This seems to suggest it is indeed related to cgo somehow. What can I do to help narrow this down? Can we do some kind of a build matrix building a node_exporter with only one extension that uses cgo at a time to try and narrow this down? Or build it with cgo, disable all cgo-powered extensions, and then enable one at a time? Would it help to build it with cgo but on an older Go version (say 1.8)?

discordianfish · 2018-04-19T11:13:15Z

The only collector enabled on linux that use any cgo reference itself is timex, but this only reads some constants and I doubt that's the problem. But you could try running the official release and disable that collector. Though I suspect it's rather related to some stdlib stuff that uses c implementation instead of native go implementation.
Beside that, yeah using an older go version and see if it still happens would help isolate this too.

SuperQ · 2018-04-19T12:51:05Z

@daenney Let's try Go 1.9.5 + CGO first. Go 1.9 seemed stable enough in node_exporter 0.15.x. So it should be the same with 0.16. If that still crashes, we can narrow it down to code changes in the node_exporter.

mdlayher · 2018-08-03T15:14:45Z

A note from our conversation on that thread:

Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere.

I am curious if we have a Cgo bug somewhere in one of the collectors and the interaction with the WiFi collector is causing it all to explode.

marcan · 2018-08-03T15:38:56Z

The fact that it crashes with just the wifi collector suggests this isn't an issue with another collector. Does the node_exporter core rely on Cgo for anything?

daenney · 2018-08-03T16:13:33Z

Kind of an aside here, but given the node exporter seems to primarly target backend systems, should we consider disabling the WiFi collector by default, at least until it the issue is resolved?

SuperQ · 2018-08-03T16:17:17Z

While I would like to turn it off, I think it would make too many eyes stop looking into it and it would never get fixed.

I'll consider it.

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * #870 * #1008 Signed-off-by: Ben Kochie <superq@gmail.com>

* Disable wifi collector by default Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * #870 * #1008 Signed-off-by: Ben Kochie <superq@gmail.com>

See: prometheus/node_exporter#870

NixOS#44824

…heus prometheus-node-exporter: work around prometheus/node_exporter#870

daenney · 2018-08-14T17:51:34Z

Reading up on golang/go#25128 it seems @mdlayher has applied a fix to to netlink in the form of mdlayher/netlink@3d8cc9a that causes the error to disappear.

Even though the WiFi collector is now disabled by default it seems worthwhile to upgrade to at least that commit so that those that do need the WiFi collector and enable it can have a more stable experience.

discordianfish · 2018-08-18T10:50:59Z

@daenney Now that we isolated this issue a bit further, maybe it makes sense to update the title and description of this issue since it seems like it might take a while to get this fixed upstream and released.

daenney · 2018-08-18T11:02:23Z

@discordianfish Mmm, indeed. I updated it to "when WiFi collector is enabled" only to realise that this is no longer the case as with the changes @mdlayher has done the issue appears resolved.

Since there is an upstream issue about the particulars of this bug, should we consider closing this? Is there anything left to do from a Prometheus/node_exporter side of things?

discordianfish · 2018-11-05T17:06:35Z

@daenney Agree, probably is fine to close this.

Chaneges since v0.16.0-rc.2 * Remove gmond collector prometheus#852 * Build with Go 1.9[0] * Fix /proc/net/dev/ interface name handling prometheus#910 [0]: prometheus#870 Signed-off-by: Ben Kochie <superq@gmail.com>

* Disable wifi collector by default Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * prometheus#870 * prometheus#1008 Signed-off-by: Ben Kochie <superq@gmail.com>

grobie mentioned this issue Mar 31, 2018

Use go1.10.1 and go1.9.5 prometheus/golang-builder#39

Merged

SuperQ mentioned this issue Apr 3, 2018

node exporter crashes regularly on vm ubuntu 16.04 #881

Closed

discordianfish added bug accepted labels Apr 13, 2018

daenney changed the title ~~fatal error: runtime.unlock: lock count (0.16.0-rc.0)~~ fatal error: runtime.unlock: lock count (0.16.0-rc.*) Apr 15, 2018

SuperQ mentioned this issue Apr 18, 2018

fatal error: runtime·unlock: lock count #908

Closed

discordianfish mentioned this issue Apr 19, 2018

Update golang to 1.10.1 #915

Closed

SuperQ added a commit that referenced this issue Aug 6, 2018

Disable wifi collector by default

f8c74d0

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks. * #870 * #1008 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ mentioned this issue Aug 6, 2018

Disable wifi collector by default #1037

Merged

lukateras added a commit to lukateras/nixpkgs that referenced this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870

bdf2850

See: prometheus/node_exporter#870

lukateras mentioned this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870 NixOS/nixpkgs#44824

Merged

9 tasks

lukateras added a commit to serokell/nixpkgs that referenced this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870

d7418c1

See: prometheus/node_exporter#870

lukateras added a commit to serokell/nixpkgs that referenced this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870

0e932db

See: prometheus/node_exporter#870

lukateras added a commit to serokell/nixpkgs that referenced this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870

67f0fe6

See: prometheus/node_exporter#870

lukateras added a commit to serokell/nixpkgs that referenced this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870

fee61ed

NixOS#44824

lukateras added a commit to serokell/nixpkgs that referenced this issue Aug 9, 2018

prometheus-node-exporter: work around prometheus/node_exporter#870

c2e46fe

NixOS#44824

xeji added a commit to NixOS/nixpkgs that referenced this issue Aug 9, 2018

Merge pull request #44824 from yegortimoshenko/20180809.194746/promet…

fc283bd

…heus prometheus-node-exporter: work around prometheus/node_exporter#870

mdlayher mentioned this issue Aug 14, 2018

vendor: bump github.com/mdlayher/wifi and dependencies #1045

Merged

daenney changed the title ~~fatal error: runtime.unlock: lock count (0.16.0-rc.*)~~ fatal error: runtime.unlock: lock count when WiFi collector is enabled Aug 18, 2018

daenney changed the title ~~fatal error: runtime.unlock: lock count when WiFi collector is enabled~~ fatal error: runtime.unlock: lock count with Go >= 1.10 Aug 18, 2018

SuperQ added this to the v0.17.0 milestone Sep 11, 2018

discordianfish closed this as completed Nov 5, 2018

majewsky added a commit to majewsky/system-configuration that referenced this issue Nov 16, 2018

defend against prometheus/node_exporter#870

9bd7526

SuperQ mentioned this issue Jul 6, 2020

Export gauge of last ZFS snapshot timestamp #1776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

daenney commented Mar 29, 2018 •

edited

Loading

mschroeder223 commented Mar 29, 2018

brian-brazil commented Mar 29, 2018

SuperQ commented Mar 29, 2018

daenney commented Mar 30, 2018

mschroeder223 commented Mar 30, 2018

grobie commented Mar 31, 2018

SuperQ commented Mar 31, 2018

mschroeder223 commented Apr 2, 2018

SuperQ commented Apr 5, 2018

daenney commented Apr 6, 2018

JBauto commented Apr 12, 2018 •

edited

Loading

grobie commented Apr 12, 2018

JBauto commented Apr 12, 2018 •

edited

Loading

discordianfish commented Apr 13, 2018

discordianfish commented Apr 13, 2018

discordianfish commented Apr 13, 2018

daenney commented Apr 14, 2018

daenney commented Apr 15, 2018 •

edited

Loading

discordianfish commented Apr 16, 2018

daenney commented Apr 16, 2018 •

edited

Loading

daenney commented Apr 19, 2018 •

edited

Loading

discordianfish commented Apr 19, 2018

SuperQ commented Apr 19, 2018

mdlayher commented Aug 3, 2018

marcan commented Aug 3, 2018

daenney commented Aug 3, 2018

SuperQ commented Aug 3, 2018

daenney commented Aug 14, 2018

discordianfish commented Aug 18, 2018

daenney commented Aug 18, 2018 •

edited

Loading

discordianfish commented Nov 5, 2018

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

Comments

daenney commented Mar 29, 2018 • edited Loading

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

mschroeder223 commented Mar 29, 2018

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

brian-brazil commented Mar 29, 2018

SuperQ commented Mar 29, 2018

daenney commented Mar 30, 2018

mschroeder223 commented Mar 30, 2018

grobie commented Mar 31, 2018

SuperQ commented Mar 31, 2018

mschroeder223 commented Apr 2, 2018

SuperQ commented Apr 5, 2018

daenney commented Apr 6, 2018

JBauto commented Apr 12, 2018 • edited Loading

uname -a

node_exporter --version

journaltctl -u node_exporter

grobie commented Apr 12, 2018

JBauto commented Apr 12, 2018 • edited Loading

discordianfish commented Apr 13, 2018

discordianfish commented Apr 13, 2018

discordianfish commented Apr 13, 2018

daenney commented Apr 14, 2018

daenney commented Apr 15, 2018 • edited Loading

discordianfish commented Apr 16, 2018

daenney commented Apr 16, 2018 • edited Loading

daenney commented Apr 19, 2018 • edited Loading

discordianfish commented Apr 19, 2018

SuperQ commented Apr 19, 2018

mdlayher commented Aug 3, 2018

marcan commented Aug 3, 2018

daenney commented Aug 3, 2018

SuperQ commented Aug 3, 2018

daenney commented Aug 14, 2018

discordianfish commented Aug 18, 2018

daenney commented Aug 18, 2018 • edited Loading

discordianfish commented Nov 5, 2018

daenney commented Mar 29, 2018 •

edited

Loading

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

JBauto commented Apr 12, 2018 •

edited

Loading

JBauto commented Apr 12, 2018 •

edited

Loading

daenney commented Apr 15, 2018 •

edited

Loading

daenney commented Apr 16, 2018 •

edited

Loading

daenney commented Apr 19, 2018 •

edited

Loading

daenney commented Aug 18, 2018 •

edited

Loading