Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

adityask2 · 2018-01-18T17:47:48Z

Bug report

Telegraf seems to be choking after certain amount of time. I'm not using aggregators. Seems like there other issues resported on the same lines - #3629

Relevant telegraf.conf:

telegraf.txt

System info:

Telegraf v1.6.0~6e240567 (git: master 6e24056)
Using it on Centos 7

Steps to reproduce:

service start telegraf
wait for some time (varies from system to system) for it reproduce.

Expected behavior:

Telegraf publishing metrics to mqtt.

Actual behavior:

Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:
2018-01-18T16:10:00Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:00Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:00Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval

Additional info:

telegraf-dump.txt

The text was updated successfully, but these errors were encountered:

danielnelson · 2018-01-19T00:02:13Z

This was caused by the mqtt output, I believe this issue I opened upstream is the cause: eclipse-paho/paho.mqtt.golang#185

I should be able to improve the use of timeouts in the output to avoid the hang.

adityask2 · 2018-02-01T10:44:58Z

Hi,
I tested with timeout fix. Telegraf was able to send data successfully for 8 days. However, after 8 days 44 of machines choked. Attaching the telegraf conf.
telegraf.conf.txt

danielnelson · 2018-02-01T18:54:35Z

Do you think you can try to get another stack dump by killing the process with SIGQUIT?

adityask2 · 2018-02-01T19:25:40Z

I'm running the process in background and had closed the session. Used kill -3 command. However not getting the stack dump. Is there any other way to get it?

danielnelson · 2018-02-01T19:37:20Z

I don't know of any other way, any log messages when it started or just the took longer to collect than collection interval?

adityask2 · 2018-02-02T06:14:49Z

Attaching telegraf log
telegraf.log

Will see if I can get telegraf dump.

adityask2 · 2018-02-02T13:14:35Z

Does taking core dump through gcore be useful?

danielnelson · 2018-02-02T18:55:37Z

I haven't tried it, but according to this blog it should work.

adityask2 · 2018-02-05T06:49:30Z

Sharing the dump.
telegraf_coredump_131.txt

adityask2 · 2018-02-06T16:05:15Z

Was able to catch the dump using
kill -SIGABRT
This dumps the stack Trace to journal

danielnelson · 2018-02-06T19:45:15Z

This stack is trickier than the last one, it doesn't look like we are blocked on a channel like before, I do see the MQTT library is trying to reconnect though.

There has been some fixes to the MQTT library, so I updated to the latest release, it is currently available in the nightly builds, do you think you could retry with this?

adityask2 · 2018-02-07T02:36:33Z

We are building custom telegraf. It would be very helpful if you can tell the commits which can be taken so that we can build it. I can retry with it.

danielnelson · 2018-02-07T02:47:43Z

Here is the commit, don't forget to run gdm restore

5639d56

adityask2 · 2018-02-08T10:40:30Z

@danielnelson I tested it with latest changes. It didnt even sustain one day. Attaching the dump and conf file.
telegraf-conf.txt
telegraf-dump-1.txt

Used the changes you suggested.

bigonese · 2018-02-09T15:44:13Z

I get some of the same "took longer to collect" messages. We are also sending data upstream with MQTT. But I'm also getting what looks to be some sort of communication error...

2018-02-04T20:39:34Z E! Failed to connect to output mqtt, retrying in 15s, error was 'Network Error : dial tcp: lookup mqtt.redacted.com on [::1]:53: read udp [::1]:39369->[::1]:53: read: connection refused'
2018-02-04T20:38:08Z E! Network Error : dial tcp: lookup mqtt.redacted.com on [::1]:53: read udp [::1]:34352->[::1]:53: read: connection refused

@adityask2 I'm curious if I have the same issue as you... do you see things like this in the logs?

danielnelson · 2018-02-09T19:54:45Z

@adityask2 Thank you for doing this testing, the issue is in the paho libary which I am not very familiar with the details of but here is my analysis:

We are blocked sending a message, there is not currently a way in the MQTT library to timeout this send:

goroutine 2376 [chan send, 1286 minutes]:
	github.com/eclipse/paho%2emqtt%2egolang.(*client).Publish(0xc42019d880, 0xc42026c930, 0x61, 0xc420680001, 0x12b2d80, 0xc420682b20, 0x1346a20, 0xc4206bb4d0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/client.go:515 +0x4f8
	github.com/influxdata/telegraf/plugins/outputs/mqtt.(*MQTT).publish(0xc4200f0000, 0xc42026c930, 0x61, 0xc42026f800, 0x175, 0x180, 0x0, 0x0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/influxdata/telegraf/plugins/outputs/mqtt/mqtt.go:155 +0xe9

The channel in question is read in func outgoing(c *client), I'm assuming a network interruption has caused this function to return.

This should call reconnect, but is stuck on the c.workers.Wait() WaitGroup, I believe because keepalive has not returned yet:

goroutine 1878 [semacquire, 1337 minutes]:
sync.runtime_Semacquire(0xc42019dba8)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/runtime/sema.go:47 +0x34
sync.(*WaitGroup).Wait(0xc42019db9c)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/sync/waitgroup.go:131 +0x7a
github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost(0xc42019d880, 0x1d88d60, 0xc4200140b0)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/client.go:435 +0x7e

Can't sendPing because we can't acquire the lock in SetRead:

goroutine 14 [semacquire, 1356 minutes]:
sync.runtime_SemacquireMutex(0xc4202b7de4)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc4202b7de0)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/sync/mutex.go:87 +0x9d
github.com/eclipse/paho%2emqtt%2egolang.(*timer).SetRead(0xc4202b7de0, 0x2540be401)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:126 +0x2d
github.com/eclipse/paho%2emqtt%2egolang.sendPing(0xc4202b7de0, 0xc4202b7e60, 0xc42019d880)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:155 +0x37
github.com/eclipse/paho%2emqtt%2egolang.keepalive(0xc42019d880)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:104 +0x54d
created by github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1

The lock is taken here in Reset(), and is waiting to read from t.C:

goroutine 27 [chan receive, 1356 minutes]:
	github.com/eclipse/paho%2emqtt%2egolang.(*timer).Reset(0xc4202b7de0, 0x6fc23ac00, 0x0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:148 +0x113
	github.com/eclipse/paho%2emqtt%2egolang.keepalive.func4(0xc420306730, 0xc42019d880, 0xc420013500, 0xc4202b7de0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:94 +0x1c5

These last two functions seem to violating this comment from the time.Timer:

// To prevent a timer created with NewTimer from firing after a call to Stop,
// check the return value and drain the channel.
// For example, assuming the program has not received from t.C already:
//
// 	if !t.Stop() {
// 		<-t.C
// 	}
//
// This cannot be done concurrent to other receives from the Timer's
// channel.

So, the keepalive function in the paho library needs corrected to avoid deadlocking, but it looks like we can also skip this function by setting the right options. I will prepare a build with keepalive turned off.

adityask2 · 2018-02-09T19:59:52Z

Thanks @danielnelson. Very much appreciated. I will wait for the build.

danielnelson · 2018-02-09T20:02:29Z

It also appears that the keepalive functionality has been reworked upstream but is currently unreleased, and the comment on the latest commit makes me nervous about moving to a development version: eclipse-paho/paho.mqtt.golang@d896be2#diff-1ce7d97ddd43423f168d3cad79898551

danielnelson · 2018-02-09T20:19:23Z

Can you retest with 2a718df?

adityask2 · 2018-02-10T03:30:59Z

Ok. To consolidate - I will be testing with 3 changes. Timeout fix, paho library changes, keep alive fix.

adityask2 · 2018-02-10T03:37:11Z

Keep alive interval - this is the amount of time after which a ping request is sent by server to client. If the connection between server and client breaks, server wouldn't know it. Is setting it 0 a good option? If not we can probably increase the timeout.

danielnelson · 2018-02-13T00:36:28Z

My reading of the client code was that the keep alive interval is the time for the client to ping the server, so it would help keep the client connected and potentially detect disconnections faster. Since we are sending data regularly I think even without it we should detect disconnection on the next send.

However, we have to turn it to zero if we want to disable the keepalive routine completely, and this is what looks to be the cause of the deadlocks. I don't think increasing the timeout would help. Another option we could try is the very latest revision in the repo, bab87cde322cf9aec420645d1e9801d0f216f791, it looks like they have rewritten this code so it may work now.

adityask2 · 2018-02-23T05:37:51Z

Running the telergaf build. Running successfully for 7 days.
Planning to use this build for prod.

For the logging part - is log rotate enabled for linux and windows mode?
Planning to run telegraf in quiet mode (need just error msg) in prod

danielnelson · 2018-02-25T09:11:25Z

I added the keep alive change for 1.5.3.

Regarding logging, the Linux package installs a logrotate config file to /etc/logrotate.d/telegraf, but in Windows there is no mechanism for log rotation as part of the package.

adityask2 · 2018-02-25T12:34:28Z

How to prevent log filling up disk space in Windows?
Using quite mode.

danielnelson · 2018-02-25T16:29:23Z

I believe there are versions/clones of logrotate for Windows, but I haven't tried it. Maybe you can get some ideas from the InfluxData Community site.

Windows log rotation docs is issue #3393, if you come up with a good solution I would love to add it to the official documentation.

danielnelson added this to the 1.5.2 milestone Jan 19, 2018

danielnelson added the bug unexpected problem or unintended behavior label Jan 19, 2018

danielnelson mentioned this issue Jan 19, 2018

Limit wait time for writes in mqtt output #3699

Merged

3 tasks

danielnelson closed this as completed in #3699 Jan 22, 2018

danielnelson reopened this Feb 1, 2018

danielnelson mentioned this issue Feb 9, 2018

Disable keepalive in mqtt output. #3779

Merged

3 tasks

danielnelson closed this as completed Feb 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

adityask2 commented Jan 18, 2018

danielnelson commented Jan 19, 2018

adityask2 commented Feb 1, 2018

danielnelson commented Feb 1, 2018

adityask2 commented Feb 1, 2018

danielnelson commented Feb 1, 2018

adityask2 commented Feb 2, 2018 •

edited

Loading

adityask2 commented Feb 2, 2018

danielnelson commented Feb 2, 2018

adityask2 commented Feb 5, 2018

adityask2 commented Feb 6, 2018

danielnelson commented Feb 6, 2018

adityask2 commented Feb 7, 2018

danielnelson commented Feb 7, 2018

adityask2 commented Feb 8, 2018

bigonese commented Feb 9, 2018

danielnelson commented Feb 9, 2018

adityask2 commented Feb 9, 2018

danielnelson commented Feb 9, 2018

danielnelson commented Feb 9, 2018

adityask2 commented Feb 10, 2018

adityask2 commented Feb 10, 2018

danielnelson commented Feb 13, 2018

adityask2 commented Feb 23, 2018

danielnelson commented Feb 25, 2018

adityask2 commented Feb 25, 2018

danielnelson commented Feb 25, 2018

Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

Comments

adityask2 commented Jan 18, 2018

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jan 19, 2018

adityask2 commented Feb 1, 2018

danielnelson commented Feb 1, 2018

adityask2 commented Feb 1, 2018

danielnelson commented Feb 1, 2018

adityask2 commented Feb 2, 2018 • edited Loading

adityask2 commented Feb 2, 2018

danielnelson commented Feb 2, 2018

adityask2 commented Feb 5, 2018

adityask2 commented Feb 6, 2018

danielnelson commented Feb 6, 2018

adityask2 commented Feb 7, 2018

danielnelson commented Feb 7, 2018

adityask2 commented Feb 8, 2018

bigonese commented Feb 9, 2018

danielnelson commented Feb 9, 2018

adityask2 commented Feb 9, 2018

danielnelson commented Feb 9, 2018

danielnelson commented Feb 9, 2018

adityask2 commented Feb 10, 2018

adityask2 commented Feb 10, 2018

danielnelson commented Feb 13, 2018

adityask2 commented Feb 23, 2018

danielnelson commented Feb 25, 2018

adityask2 commented Feb 25, 2018

danielnelson commented Feb 25, 2018

adityask2 commented Feb 2, 2018 •

edited

Loading