-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf fails stop sending metrics - took longer to collect than collection interval #3697
Comments
This was caused by the mqtt output, I believe this issue I opened upstream is the cause: eclipse-paho/paho.mqtt.golang#185 I should be able to improve the use of timeouts in the output to avoid the hang. |
Hi, |
Do you think you can try to get another stack dump by killing the process with SIGQUIT? |
I'm running the process in background and had closed the session. Used kill -3 command. However not getting the stack dump. Is there any other way to get it? |
I don't know of any other way, any log messages when it started or just the |
Attaching telegraf log Will see if I can get telegraf dump. |
Does taking core dump through gcore be useful? |
I haven't tried it, but according to this blog it should work. |
Sharing the dump. |
Was able to catch the dump using |
This stack is trickier than the last one, it doesn't look like we are blocked on a channel like before, I do see the MQTT library is trying to reconnect though. There has been some fixes to the MQTT library, so I updated to the latest release, it is currently available in the nightly builds, do you think you could retry with this? |
We are building custom telegraf. It would be very helpful if you can tell the commits which can be taken so that we can build it. I can retry with it. |
Here is the commit, don't forget to run |
@danielnelson I tested it with latest changes. It didnt even sustain one day. Attaching the dump and conf file. Used the changes you suggested. |
I get some of the same "took longer to collect" messages. We are also sending data upstream with MQTT. But I'm also getting what looks to be some sort of communication error...
@adityask2 I'm curious if I have the same issue as you... do you see things like this in the logs? |
@adityask2 Thank you for doing this testing, the issue is in the paho libary which I am not very familiar with the details of but here is my analysis: We are blocked sending a message, there is not currently a way in the MQTT library to timeout this send:
The channel in question is read in This should call reconnect, but is stuck on the
Can't sendPing because we can't acquire the lock in
The lock is taken here in
These last two functions seem to violating this comment from the
So, the keepalive function in the paho library needs corrected to avoid deadlocking, but it looks like we can also skip this function by setting the right options. I will prepare a build with keepalive turned off. |
Thanks @danielnelson. Very much appreciated. I will wait for the build. |
It also appears that the keepalive functionality has been reworked upstream but is currently unreleased, and the comment on the latest commit makes me nervous about moving to a development version: eclipse-paho/paho.mqtt.golang@d896be2#diff-1ce7d97ddd43423f168d3cad79898551 |
Can you retest with 2a718df? |
Ok. To consolidate - I will be testing with 3 changes. Timeout fix, paho library changes, keep alive fix. |
Keep alive interval - this is the amount of time after which a ping request is sent by server to client. If the connection between server and client breaks, server wouldn't know it. Is setting it 0 a good option? If not we can probably increase the timeout. |
My reading of the client code was that the keep alive interval is the time for the client to ping the server, so it would help keep the client connected and potentially detect disconnections faster. Since we are sending data regularly I think even without it we should detect disconnection on the next send. However, we have to turn it to zero if we want to disable the keepalive routine completely, and this is what looks to be the cause of the deadlocks. I don't think increasing the timeout would help. Another option we could try is the very latest revision in the repo, bab87cde322cf9aec420645d1e9801d0f216f791, it looks like they have rewritten this code so it may work now. |
Running the telergaf build. Running successfully for 7 days. For the logging part - is log rotate enabled for linux and windows mode? |
I added the keep alive change for 1.5.3. Regarding logging, the Linux package installs a logrotate config file to |
How to prevent log filling up disk space in Windows? |
I believe there are versions/clones of logrotate for Windows, but I haven't tried it. Maybe you can get some ideas from the InfluxData Community site. Windows log rotation docs is issue #3393, if you come up with a good solution I would love to add it to the official documentation. |
Bug report
Telegraf seems to be choking after certain amount of time. I'm not using aggregators. Seems like there other issues resported on the same lines - #3629
Relevant telegraf.conf:
telegraf.txt
System info:
Telegraf v1.6.0~6e240567 (git: master 6e24056)
Using it on Centos 7
Steps to reproduce:
Expected behavior:
Telegraf publishing metrics to mqtt.
Actual behavior:
Telegraf stops publishing metrics seemingly randomly, all input plugins start to fail with:
2018-01-18T16:10:00Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:00Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:00Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:10Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:20Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:30Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.mysql]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)
2018-01-18T16:10:40Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval
Additional info:
telegraf-dump.txt
The text was updated successfully, but these errors were encountered: