-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatsD plugin: Error parsing incoming stats under load #543
Comments
@njawalkar What are you using to send the statsd packets (ie, which client library?) In theory the flushing and parsing of UDP stats are in their own threads and shouldn't affect one another, but it's possible that the system is being stressed too much, do you happen to have load, cpu, or mem stats for the machine the statsd server is running on? Another possible solution would be to make the UDP listen buffer configurable: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L223, I'd be curious to see if increasing that buffer helps your situation (I can put that option into telegraf 0.10.1) |
I'm using this: https://github.com/njawalkar/java-influx-statsd-client The test is deterministic. It reads a pre-existing file of events and sends them. I've verified that the file contents are valid. During this test, the cpu stays below 10%, and ram is also 50% free (core i7, 8gb) What happens if ReadFromUDP returns a buffer that has a partial metric at the end? Like:
When you split by "\n" you get two strings
Is this what is happening? |
I'm fairly certain this is the issue: https://github.com/njawalkar/java-influx-statsd-client/blob/master/src/main/java/com/timgroup/statsd/NonBlockingStatsDClient.java#L49 The client library is sending 1500-byte packets, and the telegraf statsd is accepting only 1024-byte packets: https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L223. So as you said, the packets are getting split and it results in some invalid metrics being received. to be clear, the issue is with telegraf, not your client. I will change telegraf to accept 1500-byte packets by default (this is more standard than 1024 anyways), and I'll also make that buffer size configurable. thanks for the detailed report @njawalkar! |
Also modifying the internal UDP listener/parser code to make it able to handle higher load. The udp listener will no longer do any parsing or string conversion. It will simply read UDP packets as bytes and put them into a channel. The parser thread will now deal with splitting the UDP metrics into separated strings. This could probably be made even better by leaving everything as byte arrays. fixes #543
Thanks! I should have mentioned the packet size being 1500 in the client, but it slipped my mind. But thanks for making the input buffer configurable. |
I'm sending ~1000 stats per second to a telegraf instance in statsd format. I see errors like the following every time it flushes data to the backing influxdb instance:
`
If I send the same stats at a much slower rate (~20 per sec), everything works fine. It looks like the plugin is grabbing incomplete lines during the flush event, and trying to parse them. Note that under the same load, the errors never show up between flush intervals. It only shows these errors when the flush actually happens.
Additionally, these stats are coming from multiple sources (as you can see from the "host" tag). Not sure if that makes a difference.
The text was updated successfully, but these errors were encountered: