Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.8.0 dies on Paho "nil pointer dereference" exception #4804

Closed
rgitzel opened this issue Oct 4, 2018 · 4 comments
Closed

1.8.0 dies on Paho "nil pointer dereference" exception #4804

rgitzel opened this issue Oct 4, 2018 · 4 comments
Labels
area/mqtt bug unexpected problem or unintended behavior
Milestone

Comments

@rgitzel
Copy link
Contributor

rgitzel commented Oct 4, 2018

Since I upgraded from 1.7.1 to 1.8.0 on Friday I've been having all manner of stability issues, with big gaps in my graphs.

Definitely I'm seeing issues similar to #4594. But I am also occasionally seeing outright crashes. Logs of one of them are below.

Relevant telegraf.conf:

[[inputs.mqtt_consumer]]
  servers = ["tcp://mosquitto:1883"]
  qos = 1
  persistent_session = true
  client_id = "foo"
  topics = [
    "influx/input"
  ]
  data_format = "influx"
  # note this exclude
  tagexclude = [
    "host",
    "topic"
  ]

System info:

  • Telegraf 1.8.0 official Docker image
  • Mosquitto 1.4.12 official Docker image
  • under Docker version 17.05.0-ce, build 9f07f0e-synology
  • on Synology Diskstation

Steps to reproduce:

Not deterministic.

Start Telegraf. Sometimes it runs for a couple minutes and stops receiving messages. Sometimes it runs for hours. And three times now it's crashed on a null pointer.

Expected behavior:

Don't crash. Handle the error gracefully.

Actual behavior:

2018/10/03 16:52:59 I! Using config file: /etc/telegraf/telegraf.conf
2018-10-03T16:52:59Z D! Attempting connection to output: influxdb
2018-10-03T16:53:00Z D! Successfully connected to output: influxdb
2018-10-03T16:53:00Z D! Attempting connection to output: file
2018-10-03T16:53:00Z D! Successfully connected to output: file
2018-10-03T16:53:00Z I! Starting Telegraf 1.8.0
2018-10-03T16:53:00Z I! Loaded inputs: inputs.mqtt_consumer
2018-10-03T16:53:00Z I! Loaded aggregators:
2018-10-03T16:53:00Z I! Loaded processors:
2018-10-03T16:53:00Z I! Loaded outputs: influxdb file
2018-10-03T16:53:00Z I! Tags enabled: host=telegraf18
2018-10-03T16:53:00Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"telegraf18", Flush Interval:10s
2018-10-03T16:53:10Z I! MQTT Client Connected
2018-10-03T16:53:20Z D! Output [file] buffer fullness: 4 / 10000 metrics.
2018-10-03T16:53:20Z D! Output [influxdb] buffer fullness: 4 / 10000 metrics.
2018-10-03T16:53:20Z E! Error in plugin [inputs.mqtt_consumer]: took longer to collect than collection interval (10s)
wifi,Device=sonoff4,Router=60:E3:27:A4:55:C2 Signal=100 1538585593000000000
temperature,Device=sonoff4,Sensor=DS18B20 Temperature=11.1 1538585593000000000
wifi,Device=sonoff-back-porch,Router=98:DE:D0:58:05:27 Signal=90 1538585598000000000
temperature,Device=sonoff-back-porch,Sensor=DS18B20 Temperature=7.8 1538585598000000000
2018-10-03T16:53:20Z D! Output [file] wrote batch of 4 metrics in 35.656621ms
2018-10-03T16:53:20Z D! Output [influxdb] wrote batch of 4 metrics in 84.026562ms
2018-10-03T16:53:30Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:30Z E! Error in plugin [inputs.mqtt_consumer]: took longer to collect than collection interval (10s)
2018-10-03T16:53:30Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:40Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:40Z E! Error in plugin [inputs.mqtt_consumer]: took longer to collect than collection interval (10s)
2018-10-03T16:53:40Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:43Z D! MQTT Consumer, connection error - Network Error : %!s(<nil>)
2018-10-03T16:53:50Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:53:50Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:54:00Z D! Output [file] buffer fullness: 0 / 10000 metrics.
2018-10-03T16:54:00Z D! Output [influxdb] buffer fullness: 0 / 10000 metrics.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x50b694]

goroutine 83 [running]:
bytes.(*Buffer).WriteTo(0xc42034cdc8, 0x0, 0x0, 0x40, 0x0, 0xc0)
	/usr/local/go/src/bytes/buffer.go:240 +0x84
github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/packets.(*PingreqPacket).Write(0xc420372610, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/packets/pingreq.go:21 +0x91
github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.keepalive(0xc420770000)
	/go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/ping.go:54 +0x75f
created by github.com/influxdata/telegraf/vendor/github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1
	/go/src/github.com/influxdata/telegraf/vendor/github.com/eclipse/paho.mqtt.golang/client.go:272 +0xa17

Additional info:

All sorts of interesting things in that log:

As mentioned elsewhere probably Paho is the problem -- but Telegraf should still handle the error gracefully.

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/mqtt labels Oct 4, 2018
@danielnelson
Copy link
Contributor

Thanks for the report.

We have had mixed reports of both version 1.1.0 (used in Telegraf 1.7.x) and 1.1.1 (Telegraf 1.8.x) not working properly. What would be the most helpful is if someone could volunteer to do a deep dive on the plugin and open some issues upstream.

I also think we should look into removing the AutoReconnect feature of mqtt and handle reconnection ourselves, this may allow us to sidestep some of the bugs in the library.

@rgitzel
Copy link
Contributor Author

rgitzel commented Oct 4, 2018

So far 1.7.x has been stable for me, FWIW.

I'm going to do some more diagnosis, and will probably still create an issue upstream for this crash -- if nothing else, that error message isn't helpful. ;-)

My experience suggests that it's the ping that's the problem. In most of my failure cases, Mosquitto has justifiably stopped publishing messages because it's not been recently pinged by Telegraf. I'll see if I can narrow that down.

@danielnelson
Copy link
Contributor

@rgitzel I made some changes in 1.8.2 that should resolve this issue. There are also some fairly large changes for 1.9.0 to support the decoupling of inputs and outputs (#4938), which could impact this plugin, do you think you could test with the latest release candidate (1.9.0-rc2 currently)?

https://github.com/influxdata/telegraf/releases

@danielnelson danielnelson added this to the 1.8.2 milestone Nov 12, 2018
@rgitzel
Copy link
Contributor Author

rgitzel commented Nov 16, 2018

@danielnelson I'll see if I can give it a try this weekend. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mqtt bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants