Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf fails stop sending metrics - took longer to collect than collection interval #3697

Closed
adityask2 opened this issue Jan 18, 2018 · 26 comments · Fixed by #3699
Closed
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@adityask2
Copy link

Bug report

Telegraf seems to be choking after certain amount of time. I'm not using aggregators. Seems like there other issues resported on the same lines - #3629

Relevant telegraf.conf:

telegraf.txt

System info:

Telegraf v1.6.0~6e240567 (git: master 6e24056)
Using it on Centos 7

Steps to reproduce:

  1. service start telegraf
  2. wait for some time (varies from system to system) for it reproduce.

Expected behavior:

Telegraf publishing metrics to mqtt.

Actual behavior:

Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:
2018-01-18T16:10:00Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:00Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:00Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval

Additional info:

telegraf-dump.txt

@danielnelson
Copy link
Contributor

This was caused by the mqtt output, I believe this issue I opened upstream is the cause: eclipse-paho/paho.mqtt.golang#185

I should be able to improve the use of timeouts in the output to avoid the hang.

@danielnelson danielnelson added this to the 1.5.2 milestone Jan 19, 2018
@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jan 19, 2018
@adityask2
Copy link
Author

Hi,
I tested with timeout fix. Telegraf was able to send data successfully for 8 days. However, after 8 days 44 of machines choked. Attaching the telegraf conf.
telegraf.conf.txt

@danielnelson
Copy link
Contributor

Do you think you can try to get another stack dump by killing the process with SIGQUIT?

@danielnelson danielnelson reopened this Feb 1, 2018
@adityask2
Copy link
Author

I'm running the process in background and had closed the session. Used kill -3 command. However not getting the stack dump. Is there any other way to get it?

@danielnelson
Copy link
Contributor

I don't know of any other way, any log messages when it started or just the took longer to collect than collection interval?

@adityask2
Copy link
Author

adityask2 commented Feb 2, 2018

Attaching telegraf log
telegraf.log

Will see if I can get telegraf dump.

@adityask2
Copy link
Author

Does taking core dump through gcore be useful?

@danielnelson
Copy link
Contributor

I haven't tried it, but according to this blog it should work.

@adityask2
Copy link
Author

Sharing the dump.
telegraf_coredump_131.txt

@adityask2
Copy link
Author

Was able to catch the dump using
kill -SIGABRT
This dumps the stack Trace to journal

@danielnelson
Copy link
Contributor

This stack is trickier than the last one, it doesn't look like we are blocked on a channel like before, I do see the MQTT library is trying to reconnect though.

There has been some fixes to the MQTT library, so I updated to the latest release, it is currently available in the nightly builds, do you think you could retry with this?

@adityask2
Copy link
Author

We are building custom telegraf. It would be very helpful if you can tell the commits which can be taken so that we can build it. I can retry with it.

@danielnelson
Copy link
Contributor

Here is the commit, don't forget to run gdm restore

5639d56

@adityask2
Copy link
Author

@danielnelson I tested it with latest changes. It didnt even sustain one day. Attaching the dump and conf file.
telegraf-conf.txt
telegraf-dump-1.txt

Used the changes you suggested.

@bigonese
Copy link

bigonese commented Feb 9, 2018

I get some of the same "took longer to collect" messages. We are also sending data upstream with MQTT. But I'm also getting what looks to be some sort of communication error...

2018-02-04T20:39:34Z E! Failed to connect to output mqtt, retrying in 15s, error was 'Network Error : dial tcp: lookup mqtt.redacted.com on [::1]:53: read udp [::1]:39369->[::1]:53: read: connection refused'
2018-02-04T20:38:08Z E! Network Error : dial tcp: lookup mqtt.redacted.com on [::1]:53: read udp [::1]:34352->[::1]:53: read: connection refused

@adityask2 I'm curious if I have the same issue as you... do you see things like this in the logs?

@danielnelson
Copy link
Contributor

@adityask2 Thank you for doing this testing, the issue is in the paho libary which I am not very familiar with the details of but here is my analysis:

We are blocked sending a message, there is not currently a way in the MQTT library to timeout this send:

goroutine 2376 [chan send, 1286 minutes]:
	github.com/eclipse/paho%2emqtt%2egolang.(*client).Publish(0xc42019d880, 0xc42026c930, 0x61, 0xc420680001, 0x12b2d80, 0xc420682b20, 0x1346a20, 0xc4206bb4d0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/client.go:515 +0x4f8
	github.com/influxdata/telegraf/plugins/outputs/mqtt.(*MQTT).publish(0xc4200f0000, 0xc42026c930, 0x61, 0xc42026f800, 0x175, 0x180, 0x0, 0x0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/influxdata/telegraf/plugins/outputs/mqtt/mqtt.go:155 +0xe9

The channel in question is read in func outgoing(c *client), I'm assuming a network interruption has caused this function to return.

This should call reconnect, but is stuck on the c.workers.Wait() WaitGroup, I believe because keepalive has not returned yet:

goroutine 1878 [semacquire, 1337 minutes]:
sync.runtime_Semacquire(0xc42019dba8)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/runtime/sema.go:47 +0x34
sync.(*WaitGroup).Wait(0xc42019db9c)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/sync/waitgroup.go:131 +0x7a
github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost(0xc42019d880, 0x1d88d60, 0xc4200140b0)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/client.go:435 +0x7e

Can't sendPing because we can't acquire the lock in SetRead:

goroutine 14 [semacquire, 1356 minutes]:
sync.runtime_SemacquireMutex(0xc4202b7de4)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/runtime/sema.go:62 +0x34
sync.(*Mutex).Lock(0xc4202b7de0)
	/build/mts/release/sb-13687633/compcache/cayman_go/ob-6134445/linux64/lin64/src/sync/mutex.go:87 +0x9d
github.com/eclipse/paho%2emqtt%2egolang.(*timer).SetRead(0xc4202b7de0, 0x2540be401)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:126 +0x2d
github.com/eclipse/paho%2emqtt%2egolang.sendPing(0xc4202b7de0, 0xc4202b7e60, 0xc42019d880)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:155 +0x37
github.com/eclipse/paho%2emqtt%2egolang.keepalive(0xc42019d880)
	/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:104 +0x54d
created by github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1

The lock is taken here in Reset(), and is waiting to read from t.C:

goroutine 27 [chan receive, 1356 minutes]:
	github.com/eclipse/paho%2emqtt%2egolang.(*timer).Reset(0xc4202b7de0, 0x6fc23ac00, 0x0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:148 +0x113
	github.com/eclipse/paho%2emqtt%2egolang.keepalive.func4(0xc420306730, 0xc42019d880, 0xc420013500, 0xc4202b7de0)
/build/mts/release/sb-13687633/cayman_telegraf/telegraf/src/github.com/eclipse/paho.mqtt.golang/ping.go:94 +0x1c5

These last two functions seem to violating this comment from the time.Timer:

// To prevent a timer created with NewTimer from firing after a call to Stop,
// check the return value and drain the channel.
// For example, assuming the program has not received from t.C already:
//
// 	if !t.Stop() {
// 		<-t.C
// 	}
//
// This cannot be done concurrent to other receives from the Timer's
// channel.

So, the keepalive function in the paho library needs corrected to avoid deadlocking, but it looks like we can also skip this function by setting the right options. I will prepare a build with keepalive turned off.

@adityask2
Copy link
Author

Thanks @danielnelson. Very much appreciated. I will wait for the build.

@danielnelson
Copy link
Contributor

It also appears that the keepalive functionality has been reworked upstream but is currently unreleased, and the comment on the latest commit makes me nervous about moving to a development version: eclipse-paho/paho.mqtt.golang@d896be2#diff-1ce7d97ddd43423f168d3cad79898551

@danielnelson
Copy link
Contributor

Can you retest with 2a718df?

@adityask2
Copy link
Author

Ok. To consolidate - I will be testing with 3 changes. Timeout fix, paho library changes, keep alive fix.

@adityask2
Copy link
Author

Keep alive interval - this is the amount of time after which a ping request is sent by server to client. If the connection between server and client breaks, server wouldn't know it. Is setting it 0 a good option? If not we can probably increase the timeout.

@danielnelson
Copy link
Contributor

My reading of the client code was that the keep alive interval is the time for the client to ping the server, so it would help keep the client connected and potentially detect disconnections faster. Since we are sending data regularly I think even without it we should detect disconnection on the next send.

However, we have to turn it to zero if we want to disable the keepalive routine completely, and this is what looks to be the cause of the deadlocks. I don't think increasing the timeout would help. Another option we could try is the very latest revision in the repo, bab87cde322cf9aec420645d1e9801d0f216f791, it looks like they have rewritten this code so it may work now.

@adityask2
Copy link
Author

Running the telergaf build. Running successfully for 7 days.
Planning to use this build for prod.

For the logging part - is log rotate enabled for linux and windows mode?
Planning to run telegraf in quiet mode (need just error msg) in prod

@danielnelson
Copy link
Contributor

I added the keep alive change for 1.5.3.

Regarding logging, the Linux package installs a logrotate config file to /etc/logrotate.d/telegraf, but in Windows there is no mechanism for log rotation as part of the package.

@adityask2
Copy link
Author

How to prevent log filling up disk space in Windows?
Using quite mode.

@danielnelson
Copy link
Contributor

I believe there are versions/clones of logrotate for Windows, but I haven't tried it. Maybe you can get some ideas from the InfluxData Community site.

Windows log rotation docs is issue #3393, if you come up with a good solution I would love to add it to the official documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants