wavefront output plugin: idle connection timeout causes lost metrics #7160

randallt · 2020-03-12T13:32:36Z

I ran into a very confusing issue that I'd like to make sure gets documented and hopefully fixed so others don't fall into the same issue. Using Telegraf 1.13.4 on an AWS EC2 instance passing through an AWS Internal ELB, I was seeing missing metrics and connection errors in the Telegraf logs.

Relevant telegraf.conf:

[[outputs.wavefront]]
  host = "wfproxy.example.net"
  port = 2878
  metric_separator = "."
  convert_paths = true

System info:

Telegraf 1.13.4
Amazon Linux 2
Connection through AWS Internal ELB with default 60-sec idle connection timeout

Steps to reproduce:

Create EC2 instance
Install telegraf 1.13.4
Configure to use WF Proxy endpoint through AWS Internal ELB
Configure telegraf interval to 60 sec
Start telegraf

Expected behavior:

Expected behavior is that metrics show normally, once every 60 sec.

Actual behavior:

The first 3-4 minutes of metrics are missed, followed by a single instance of metrics, followed by another 3-4 minutes of missed data.

Additional info:

The telegraf log shows the following connection reset message once every 3-4 minutes:
2020-02-28T21:47:13Z I! resetting wavefront proxy connection
2020-02-28T21:47:13Z I! write tcp 10.234.11.217:36870->10.234.245.107:2878: write: broken pipe
2020-02-28T21:48:10Z I! connected to Wavefront proxy at address: wfproxy.example.net:2878

Workarounds:

If I change the AWS Internal ELB idle connection timeout above 60 sec, then things seem to work normally.
If I change the Wavefront output plugin to use 'http' mode by specifying the 'url' setting instead of 'host' and 'port', then it also seems to work normally (perhaps an http keep-alive is sent).

danielnelson · 2020-03-12T22:15:56Z

cc @puckpuck

puckpuck · 2020-03-13T14:18:42Z

I need to defer this one to @vikramraman and @prydin whom will be taking over the Wavefront plugins.

From the surface this could be an issue as part of the wavefront-sdk. Perhaps a version update is needed?

KarthikAthisamy · 2020-08-27T13:41:27Z

Is there any update on this open issue? I too see the same kind of problem:

2020-08-25T14:12:08Z E! [agent] Error writing to outputs.wavefront: Wavefront sending error: write tcp 10.x.xxx.xx:35042->172.20.56.248:2878: write: broken pipe
2020-08-25T14:12:12Z E! [agent] Error writing to outputs.wavefront: Wavefront sending error: write tcp 10.x.xxx.xx:35042->172.20.56.248:2878: write: broken pipe
2020-08-25T14:12:13Z I! resetting wavefront proxy connection
2020-08-25T14:12:13Z I! write tcp 10.x.xxx.xx:35042->172.xx.xx.xxx:2878: write: broken pipe

prydin · 2020-08-27T14:00:40Z

Running the Wavefront output in "plain socket mode" assumes a stable connection between Telegraf and the Wavefront proxy. That requirement is not satisfied if the load balancer resets the connection after 60 seconds. I would recommend using the HTTP protocol instead, as it is a lot more load balancer friendly and doesn't require a stable connection.

randallt · 2020-08-27T14:46:45Z

@prydin, can you give the exact format we'd need to use for http mode, and since which telegraf version it has been supported?

randallt · 2020-12-15T12:41:12Z

See my workaround #2 in the description above for how to use http mode. Essentially, using a url field instead of host and port. The wavefront output plugin doc has an example I believe.

sspaink · 2022-11-02T16:58:31Z

I think this was resolved in: #11560

danielnelson added area/wavefront docs Issues related to Telegraf documentation and configuration descriptions labels Mar 12, 2020

LukeWinikates mentioned this issue Jul 28, 2022

fix(outputs.wavefront): update wavefront sdk and use non-deprecated APIs #11560

Merged

3 tasks

sspaink closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wavefront output plugin: idle connection timeout causes lost metrics #7160

wavefront output plugin: idle connection timeout causes lost metrics #7160

randallt commented Mar 12, 2020

danielnelson commented Mar 12, 2020

puckpuck commented Mar 13, 2020 •

edited

Loading

KarthikAthisamy commented Aug 27, 2020

prydin commented Aug 27, 2020

randallt commented Aug 27, 2020

randallt commented Dec 15, 2020

sspaink commented Nov 2, 2022

wavefront output plugin: idle connection timeout causes lost metrics #7160

wavefront output plugin: idle connection timeout causes lost metrics #7160

Comments

randallt commented Mar 12, 2020

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Mar 12, 2020

puckpuck commented Mar 13, 2020 • edited Loading

KarthikAthisamy commented Aug 27, 2020

prydin commented Aug 27, 2020

randallt commented Aug 27, 2020

randallt commented Dec 15, 2020

sspaink commented Nov 2, 2022

puckpuck commented Mar 13, 2020 •

edited

Loading