Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatsD plugin: Error parsing incoming stats under load #543

Closed
njawalkar opened this issue Jan 18, 2016 · 4 comments
Closed

StatsD plugin: Error parsing incoming stats under load #543

njawalkar opened this issue Jan 18, 2016 · 4 comments
Labels
bug unexpected problem or unintended behavior

Comments

@njawalkar
Copy link

I'm sending ~1000 stats per second to a telegraf instance in statsd format. I see errors like the following every time it flushes data to the backing influxdb instance:

`

statsd_cmd,host=vm1,metric_type=counter,op=CMD1,rcode=1300,source=metricsPrototype value=37i 1453134040013670914
statsd_cmd,host=vm3,metric_type=counter,op=CMD1,rcode=1301,source=metricsPrototype value=2i 1453134040013673140
statsd_cmd,host=vm2,metric_type=counter,op=CMD2,rcode=1000,source=metricsPrototype value=126i 1453134040013735233
statsd_cmd,host=vm6,metric_type=counter,op=CMD3,rcode=1000,source=metricsPrototype value=6i 1453134040013745935
2016/01/18 17:20:40 Gathered metrics, ({2s} interval), from 1 plugins in 10.769936ms
2016/01/18 17:20:40 Error: splitting ':', Unable to parse metric: cmd,source=metricsProto
2016/01/18 17:20:40 Error: splitting ':', Unable to parse metric: rtime.vm7,source=metricsProt
2016/01/18 17:20:40 Flushed 634 metrics to output influxdb in 20.330449ms
2016/01/18 17:20:41 Error: splitting ':', Unable to parse metric: cmd,source=metricsPrototype,rcode=1000,host=vm
2016/01/18 17:20:41 Error: splitting ':', Unable to parse metric: rtime,source=metricsPrototype,op
2016/01/18 17:20:41 Error: splitting ':', Unable to parse metric: cmd,source=me
2016/01/18 17:20:41 Error: splitting ':', Unable to parse metric: cmd,source=metricsPrototype,r
statsd_rtime_vm5_mean,host=vm,metric_type=timing,op=CMD1,source=metricsPrototype value=2 1453134042004054296
statsd_rtime_vm5_stddev,host=vm,metric_type=timing,op=CMD1,source=metricsPrototype value=0 1453134042004461904
statsd_rtime_vm5_upper,host=vm,metric_type=timing,op=CMD1,source=metricsPrototype value=2 1453134042004830888
statsd_rtime_vm5_lower,host=vm,metric_type=timing,op=CMD1,source=metricsPrototype value=2 1453134042004882409`

If I send the same stats at a much slower rate (~20 per sec), everything works fine. It looks like the plugin is grabbing incomplete lines during the flush event, and trying to parse them. Note that under the same load, the errors never show up between flush intervals. It only shows these errors when the flush actually happens.

Additionally, these stats are coming from multiple sources (as you can see from the "host" tag). Not sure if that makes a difference.

@sparrc
Copy link
Contributor

sparrc commented Jan 18, 2016

@njawalkar What are you using to send the statsd packets (ie, which client library?)

In theory the flushing and parsing of UDP stats are in their own threads and shouldn't affect one another, but it's possible that the system is being stressed too much, do you happen to have load, cpu, or mem stats for the machine the statsd server is running on?

Another possible solution would be to make the UDP listen buffer configurable: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L223, I'd be curious to see if increasing that buffer helps your situation (I can put that option into telegraf 0.10.1)

@njawalkar
Copy link
Author

I'm using this: https://github.com/njawalkar/java-influx-statsd-client
It is a modification of the dogstatsd client to follow influxdb line protocol (dogstatsd formats tags a little differently than what telegraf expects)

The test is deterministic. It reads a pre-existing file of events and sends them. I've verified that the file contents are valid. During this test, the cpu stays below 10%, and ram is also 50% free (core i7, 8gb)

What happens if ReadFromUDP returns a buffer that has a partial metric at the end?

Like:

cmd,host=vm,op=a:1|c\ncmd,host=vm,op=b:2|c\ncmd,host=...
^----------buffer-----------^^------next buffer------...

When you split by "\n" you get two strings

cmd,host=vm,op=a:1|c  (valid metric)
cmd,hos               (fails when parsed)

Is this what is happening?

@sparrc
Copy link
Contributor

sparrc commented Jan 19, 2016

I'm fairly certain this is the issue: https://github.com/njawalkar/java-influx-statsd-client/blob/master/src/main/java/com/timgroup/statsd/NonBlockingStatsDClient.java#L49

The client library is sending 1500-byte packets, and the telegraf statsd is accepting only 1024-byte packets: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L223. So as you said, the packets are getting split and it results in some invalid metrics being received.

to be clear, the issue is with telegraf, not your client. I will change telegraf to accept 1500-byte packets by default (this is more standard than 1024 anyways), and I'll also make that buffer size configurable.

thanks for the detailed report @njawalkar!

@sparrc sparrc added the bug unexpected problem or unintended behavior label Jan 19, 2016
sparrc added a commit that referenced this issue Jan 19, 2016
Also modifying the internal UDP listener/parser code to make it able to
handle higher load. The udp listener will no longer do any parsing or
string conversion. It will simply read UDP packets as bytes and put them
into a channel. The parser thread will now deal with splitting the UDP
metrics into separated strings.

This could probably be made even better by leaving everything as byte
arrays.

fixes #543
@sparrc sparrc closed this as completed in 839651f Jan 19, 2016
@njawalkar
Copy link
Author

Thanks! I should have mentioned the packet size being 1500 in the client, but it slipped my mind. But thanks for making the input buffer configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants