Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf service continuously restarting, fatal error: MHeap_AllocLocked - MSpan not free #3299

Closed
emexelem opened this issue Oct 3, 2017 · 4 comments
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf

Comments

@emexelem
Copy link

emexelem commented Oct 3, 2017

Bug report

Relevant telegraf.conf:

Almost the default config, interval + flush is 20s, inputs.net is enabled.
single output is an influxdb on a remote server

System info:

Telegraf 1.4.1 installed from deb package on Ubuntu Server 16.04
The host is an Azure D4s-v3 instance running Docker and a single container, gitlab-ce

Steps to reproduce:

Just start Telegraf service and wait less than 5 minutes to see the service crashing with the stack trace shared below.

Expected behavior:

The service should not crash

Actual behavior:

The service crashes every 2 minutes approximately

Additional info:

There is plenty of free memory on the system. The host has 16GB and only 4GB is used
I have 2 other similar instances running with the same setup (but different services in docker containers) and the problem can be reproduced only with this instance.

free -m
              total        used        free      shared  buff/cache   available
Mem:          16028        4103        7250          39        4673       11341
Swap:             0           0           0

journalctl
Oct 03 11:35:20 gitlab telegraf[27405]: fatal error: MHeap_AllocLocked - MSpan not free
Oct 03 11:35:20 gitlab telegraf[27405]: runtime stack:
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.throw(0x14690f2, 0x22)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/panic.go:596 +0x95
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mheap).allocSpanLocked(0x1d1a860, 0x1, 0x7f1157961c58)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mheap.go:690 +0x435
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mheap).alloc_m(0x1d1a860, 0x1, 0x2b, 0x7f115786d000)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mheap.go:562 +0xe2
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mheap).alloc.func1()
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mheap.go:627 +0x4b
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.systemstack(0x7f114dffad98)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/asm_amd64.s:343 +0xab
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mheap).alloc(0x1d1a860, 0x1, 0x1000000002b, 0x417a6d)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mheap.go:628 +0xa0
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mcentral).grow(0x1d1ce60, 0x0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mcentral.go:212 +0x8e
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mcentral).cacheSpan(0x1d1ce60, 0x7f114dffaeb0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mcentral.go:93 +0x12a
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mcache).refill(0x7f11579ad000, 0x2b, 0xc4203e7901)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/mcache.go:122 +0xa0
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mcache).nextFree.func1()
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/malloc.go:538 +0x32
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.systemstack(0xc420018600)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/asm_amd64.s:327 +0x79
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.mstart()
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/proc.go:1132
Oct 03 11:35:20 gitlab telegraf[27405]: goroutine 546 [running]:
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.systemstack_switch()
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/asm_amd64.s:281 fp=0xc4208fa978 sp=0xc4208fa970
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.(*mcache).nextFree(0x7f11579ad000, 0x41732b, 0xc4208faa80, 0x4112c2, 0xc4202232c0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/malloc.go:539 +0xb9 fp=0xc4208fa9d0 sp=0xc4208fa978
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.mallocgc(0x1000, 0x11f2880, 0xc4208faa01, 0x411b18)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/malloc.go:691 +0x827 fp=0xc4208faa70 sp=0xc4208fa9d0
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.makeslice(0x11f2880, 0x1000, 0x1000, 0x1210d60, 0x14baf50, 0xc420a443f8)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/slice.go:54 +0x7b fp=0xc4208faac0 sp=0xc4208faa70
Oct 03 11:35:20 gitlab telegraf[27405]: os.(*File).readdirnames(0xc420a443f8, 0xffffffffffffffff, 0x18, 0x54e003, 0xc4208fabb8, 0xc4208fabb0, 0xc42030c0a0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/os/dir_unix.go:50 +0x437 fp=0xc4208fab68 sp=0xc4208faac0
Oct 03 11:35:20 gitlab telegraf[27405]: os.(*File).Readdirnames(0xc420a443f8, 0xffffffffffffffff, 0xc420a443f8, 0x0, 0x0, 0x0, 0x0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/os/dir.go:45 +0x3e fp=0xc4208fabb0 sp=0xc4208fab68
Oct 03 11:35:20 gitlab telegraf[27405]: path/filepath.glob(0xc4202ad030, 0x9, 0xc4203e782d, 0x4, 0xc4207687c0, 0x3, 0x4, 0xc4207687c0, 0x3, 0x4, ...)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/path/filepath/match.go:324 +0x172 fp=0xc4208fac68 sp=0xc4208fabb0
Oct 03 11:35:20 gitlab telegraf[27405]: path/filepath.Glob(0xc4203e7820, 0x11, 0x5, 0x1444a80, 0xc, 0xc4203e7820, 0x11)
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/path/filepath/match.go:265 +0x181 fp=0xc4208fad18 sp=0xc4208fac68
Oct 03 11:35:20 gitlab telegraf[27405]: github.com/influxdata/telegraf/plugins/inputs/system.(*Processes).gatherFromProc(0xc420134540, 0xc4202231d0, 0x0, 0x0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/processes.go:135 +0x84 fp=0xc4208faf28 sp=0xc4208fad18
Oct 03 11:35:20 gitlab telegraf[27405]: github.com/influxdata/telegraf/plugins/inputs/system.(*Processes).Gather(0xc420134540, 0x1c2b160, 0xc4201e3e80, 0x0, 0x0)
Oct 03 11:35:20 gitlab telegraf[27405]:         /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/system/processes.go:56 +0xf2 fp=0xc4208faf80 sp=0xc4208faf28
Oct 03 11:35:20 gitlab telegraf[27405]: github.com/influxdata/telegraf/agent.gatherWithTimeout.func1(0xc4207ccea0, 0xc42015e5c0, 0xc4201e3e80)
Oct 03 11:35:20 gitlab telegraf[27405]:         /home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/agent/agent.go:153 +0x49 fp=0xc4208fafc8 sp=0xc4208faf80
Oct 03 11:35:20 gitlab telegraf[27405]: runtime.goexit()
Oct 03 11:35:20 gitlab telegraf[27405]:         /usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4208fafd0 sp=0xc4208fafc8
Oct 03 11:35:20 gitlab telegraf[27405]: created by github.com/influxdata/telegraf/agent.gatherWithTimeout
...
Oct 03 11:35:20 gitlab systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
...
Oct 03 11:35:20 gitlab systemd[1]: telegraf.service: Service hold-off time over, scheduling restart.
Oct 03 11:35:20 gitlab systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Oct 03 11:35:20 gitlab systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.

tail -n 25 /var/log/telegraf/telegraf.log
2017-10-03T11:56:40Z D! Output [influxdb] wrote batch of 25 metrics in 39.86744ms
2017-10-03T11:57:00Z D! Attempting connection to output: influxdb
2017-10-03T11:57:00Z D! Successfully connected to output: influxdb
2017-10-03T11:57:00Z I! Starting Telegraf v1.4.1
2017-10-03T11:57:00Z I! Loaded outputs: influxdb
2017-10-03T11:57:00Z I! Loaded inputs: inputs.cpu inputs.disk inputs.diskio inputs.swap inputs.system inputs.net inputs.kernel inputs.mem inputs.processes
2017-10-03T11:57:00Z I! Tags enabled: host=gitlab
2017-10-03T11:57:00Z I! Agent Config: Interval:20s, Quiet:false, Hostname:"gitlab", Flush Interval:20s
2017-10-03T11:57:40Z D! Output [influxdb] buffer fullness: 45 / 10000 metrics.
2017-10-03T11:57:40Z D! Output [influxdb] wrote batch of 45 metrics in 110.597542ms
2017-10-03T11:58:00Z D! Output [influxdb] buffer fullness: 25 / 10000 metrics.
2017-10-03T11:58:00Z D! Output [influxdb] wrote batch of 25 metrics in 81.409852ms
2017-10-03T11:58:20Z D! Attempting connection to output: influxdb
2017-10-03T11:58:20Z D! Successfully connected to output: influxdb
2017-10-03T11:58:20Z I! Starting Telegraf v1.4.1
2017-10-03T11:58:20Z I! Loaded outputs: influxdb
2017-10-03T11:58:20Z I! Loaded inputs: inputs.kernel inputs.processes inputs.net inputs.cpu inputs.diskio inputs.swap inputs.system inputs.disk inputs.mem
2017-10-03T11:58:20Z I! Tags enabled: host=gitlab
2017-10-03T11:58:20Z I! Agent Config: Interval:20s, Quiet:false, Hostname:"gitlab", Flush Interval:20s
2017-10-03T11:59:00Z D! Output [influxdb] buffer fullness: 45 / 10000 metrics.
2017-10-03T11:59:00Z D! Output [influxdb] wrote batch of 45 metrics in 87.449265ms
2017-10-03T11:59:20Z D! Output [influxdb] buffer fullness: 25 / 10000 metrics.
2017-10-03T11:59:20Z D! Output [influxdb] wrote batch of 25 metrics in 183.771203ms
2017-10-03T11:59:40Z D! Output [influxdb] buffer fullness: 25 / 10000 metrics.
2017-10-03T11:59:40Z D! Output [influxdb] wrote batch of 25 metrics in 41.447414ms

@emexelem
Copy link
Author

emexelem commented Oct 3, 2017

The service finally crashed with the stack trace below, and isn't restarted automatically by systemd as it did not exited:

Oct 03 12:24:00 gitlab telegraf[38285]: fatal error: workbuf is empty
Oct 03 12:24:00 gitlab telegraf[38285]: runtime: pointer 0xc420337180 to unallocated span idx=0x19b span.base()=0x7fa75a357e08 span.limit=0x7fa75a357e28 span.state=0
Oct 03 12:24:00 gitlab telegraf[38285]: runtime: found in object at *(0xc4202e8d80+0x2e8)
Oct 03 12:24:00 gitlab telegraf[38285]: object=0xc4202e8d80 k=0x6210174 s.base()=0xc4202e8000 s.limit=0xc4202e9f80 s.sizeclass=32 s.elemsize=1152 s.state=_MSpanInUse
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+0) = 0xc4202e8900
...
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+1016) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  ...
Oct 03 12:24:00 gitlab telegraf[38285]: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

systemctl status telegraf
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-10-03 12:21:20 UTC; 11min ago
     Docs: https://github.com/influxdata/telegraf
  Process: 37563 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
 Main PID: 38285 (telegraf)
    Tasks: 12
   Memory: 15.5M
      CPU: 475ms
   CGroup: /system.slice/telegraf.service
           └─38285 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+960) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+968) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+976) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+984) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+992) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+1000) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+1008) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  *(object+1016) = 0x0
Oct 03 12:24:00 gitlab telegraf[38285]:  ...
Oct 03 12:24:00 gitlab telegraf[38285]: fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

@danielnelson
Copy link
Contributor

Can you check if the problem persists with a minimal configuration, only a file output and socket_listener:

[[outputs.file]]
  files = ["stdout"]

[[inputs.socket_listener]]
  service_address = "udp://:8094"

@emexelem
Copy link
Author

emexelem commented Oct 4, 2017

No, the problem does not persist with this minimal config. I'll add inputs and outputs progressively to try to identify which one is causing the problem.

@danielnelson danielnelson added bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf labels Oct 24, 2017
@ssoroka
Copy link
Contributor

ssoroka commented Nov 4, 2020

Haven't heard back and this issue is pretty stale. open up a fresh issue if you're still having problems.

@ssoroka ssoroka closed this as completed Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior panic issue that results in panics from Telegraf
Projects
None yet
Development

No branches or pull requests

3 participants