RabbitMQ input reports forever that it did not complete in time after a single timeout #6814

jacquesh · 2019-12-19T07:48:20Z

Relevant telegraf.conf:

[agent]
  interval = "10s"
  logfile = "Z:/tick/telegraf/telegraf.log"
[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
[[inputs.rabbitmq]]
  username = "guest"
  password = "guest"

System info:

Windows Server 2016 64-bit, Telegraf 1.13.0

Steps to reproduce:

Run telegraf and collect RabbitMQ node stats.
Somehow get the requestJSON call to fail when making a request to /api/healthchecks/node/<nodename>
Watch the log file

I achieved step 2 by modifying the RabbitMQ source file such that on the 3rd gather only it changes the username, and then made requestJSON return an error when it saw that username (to mimic a network failure).

Expected behavior:

A single gather times out but future gathers succeed without error (as soon as the broker has recovered from whatever caused the original failure).

Actual behavior:

After a single failure collecting node stats, the gather call to the RabbitMQ plugin times out every time.

Additional info:

Here is an example of what my log file looks like:

2019-12-18T05:48:38Z I! Loaded inputs: win_perf_counters rabbitmq internal
2019-12-18T05:48:38Z I! Loaded aggregators: 
2019-12-18T05:48:38Z I! Loaded processors: 
2019-12-18T05:48:38Z I! Loaded outputs: influxdb
2019-12-18T05:48:38Z I! Tags enabled: dc=London host=sms-main-1
2019-12-18T05:48:38Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"sms-main-1", Flush Interval:10s
2019-12-19T00:03:03Z E! [inputs.rabbitmq] Error in plugin: Get http://localhost:15672/api/healthchecks/node/rabbit@sms-main-1: net/http: timeout awaiting response headers
2019-12-19T00:03:10Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:00Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:10Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
<etc>

Looking at the source for the Rabbit input, the gatherNodes function starts a goroutine for each node that it needs to run health checks against and then waits on a channel for the same number of results. If that health check call fails, it returns an error from inside the goroutine but never sends anything to the channel, so I expect the problem is that the initial failing call to gatherNodes() is stuck blocking forever waiting on that channel for a health check that it will never receive.

The text was updated successfully, but these errors were encountered:

thesimj · 2019-12-23T14:13:19Z

I have the same issue, on Ubuntu Linux 18.04 and Telegraf 1.13.0.
Only telegraf restart help, but after some few hours it's freeze again.

.....
2019-12-23T14:11:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:12:00Z W! [agent] [inputs.rabbitmq] did not complete within its interval

danielnelson added area/rabbitmq bug unexpected problem or unintended behavior labels Dec 23, 2019

danielnelson added this to the 1.13.1 milestone Dec 23, 2019

danielnelson mentioned this issue Dec 23, 2019

Report rabbitmq_node measurement and return on gather error #6819

Merged

3 tasks

danielnelson closed this as completed in #6819 Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ input reports forever that it did not complete in time after a single timeout #6814

RabbitMQ input reports forever that it did not complete in time after a single timeout #6814

jacquesh commented Dec 19, 2019

thesimj commented Dec 23, 2019

RabbitMQ input reports forever that it did not complete in time after a single timeout #6814

RabbitMQ input reports forever that it did not complete in time after a single timeout #6814

Comments

jacquesh commented Dec 19, 2019

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

thesimj commented Dec 23, 2019