You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Somehow get the requestJSON call to fail when making a request to /api/healthchecks/node/<nodename>
Watch the log file
I achieved step 2 by modifying the RabbitMQ source file such that on the 3rd gather only it changes the username, and then made requestJSON return an error when it saw that username (to mimic a network failure).
Expected behavior:
A single gather times out but future gathers succeed without error (as soon as the broker has recovered from whatever caused the original failure).
Actual behavior:
After a single failure collecting node stats, the gather call to the RabbitMQ plugin times out every time.
Additional info:
Here is an example of what my log file looks like:
2019-12-18T05:48:38Z I! Loaded inputs: win_perf_counters rabbitmq internal
2019-12-18T05:48:38Z I! Loaded aggregators:
2019-12-18T05:48:38Z I! Loaded processors:
2019-12-18T05:48:38Z I! Loaded outputs: influxdb
2019-12-18T05:48:38Z I! Tags enabled: dc=London host=sms-main-1
2019-12-18T05:48:38Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"sms-main-1", Flush Interval:10s
2019-12-19T00:03:03Z E! [inputs.rabbitmq] Error in plugin: Get http://localhost:15672/api/healthchecks/node/rabbit@sms-main-1: net/http: timeout awaiting response headers
2019-12-19T00:03:10Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:00Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:10Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
<etc>
Looking at the source for the Rabbit input, the gatherNodes function starts a goroutine for each node that it needs to run health checks against and then waits on a channel for the same number of results. If that health check call fails, it returns an error from inside the goroutine but never sends anything to the channel, so I expect the problem is that the initial failing call to gatherNodes() is stuck blocking forever waiting on that channel for a health check that it will never receive.
The text was updated successfully, but these errors were encountered:
I have the same issue, on Ubuntu Linux 18.04 and Telegraf 1.13.0.
Only telegraf restart help, but after some few hours it's freeze again.
.....
2019-12-23T14:11:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:12:00Z W! [agent] [inputs.rabbitmq] did not complete within its interval
Relevant telegraf.conf:
System info:
Windows Server 2016 64-bit, Telegraf 1.13.0
Steps to reproduce:
requestJSON
call to fail when making a request to/api/healthchecks/node/<nodename>
I achieved step 2 by modifying the RabbitMQ source file such that on the 3rd gather only it changes the username, and then made
requestJSON
return an error when it saw that username (to mimic a network failure).Expected behavior:
A single gather times out but future gathers succeed without error (as soon as the broker has recovered from whatever caused the original failure).
Actual behavior:
After a single failure collecting node stats, the gather call to the RabbitMQ plugin times out every time.
Additional info:
Here is an example of what my log file looks like:
Looking at the source for the Rabbit input, the
gatherNodes
function starts a goroutine for each node that it needs to run health checks against and then waits on a channel for the same number of results. If that health check call fails, it returns an error from inside the goroutine but never sends anything to the channel, so I expect the problem is that the initial failing call togatherNodes()
is stuck blocking forever waiting on that channel for a health check that it will never receive.The text was updated successfully, but these errors were encountered: