-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for retrying output writes, using independent threads #298
Conversation
@daviesalex this change will add support for retrying failed writes. Initially I've decided to implement it a bit more simply than a circular shared buffer. This will simply spin off a thread for each "batch" of points, retrying |
@sparrc looks great, we will test this out at scale this week and report back. |
I'm not 100% sure which issue is best to comment on this; apologies if I have picked the wrong one. But, as promised, we tested this. Summary:
We deleted all our data, and pushed a change to 1,000 hosts with flush_interval at 60s and jitter_interval 30. See below graph for the moment we brought this config up (before the obvious change old config; after change 1,000 of our several thousand hosts moved to the new config) These 1,000 hosts have pretty accurate timesync (<25us) and are a small number of cut through (=very fast) switches from the InfluxDB test nodes, so this is an extreme example of the problem, but we saw extreme microbursts (causing drops on Intel 10G NIC cards). We have not yet done network inspection of capture files, but I strongly suspect that we are going to see a <100us microburst once per second. This isnt affecting InfluxDB too much, but is making network hardware have a bad day in the middle. We have two further suggestions (as well as making it a fixed sized buffer in #285):
We are going to leave this configuration running for the weekend and will do more testing on Monday. Happy to do any specific testing you can suggest. Thanks! -Alex |
So a further graph of real world usage; this shows Rx traffic for 3 nodes (all writes going to one, the yellow) Box 1 - old client The spikiness in box 3/4 we believe is caused by clients bursting at the same time on a specific second. Our metric data is once per 10 seconds on the server, so we need to do more analysis to dig really into this, which will happen on Monday. |
Thank you @daviesalex, this is very good information. My first reaction was that we must be choosing a random number on 1s resolution, but we're actually choosing it on 1ns resolution. Most likely the problem is that each Telegraf binary is using the same seed from the From what I can tell, adding another random sleep between 0 and 1 seconds will only result in all Telegraf instances sleeping for the same "random" amount, and the microbursts will remain. I also see what you mean about the backoff on retries. The current implementation has the batches of points retrying independently. This means that if you have an InfluxDB server down for more than 2 flush intervals, then each Telegraf instance will have 3 batches of points backed up, and will be trying to flush those 3 batches on the same interval. Having the persistent buffer will fix this. |
PS @daviesalex can we send you some InfluxDB swag to say "thanks" for this? If you send your t-shirt size and address to cameron@influxdb.com we can get a package out to you :-) |
Fixes #285