Add support for retrying output writes, using independent threads #298

sparrc · 2015-10-21T17:01:45Z

Fixes #285

sparrc · 2015-10-21T17:13:02Z

@daviesalex this change will add support for retrying failed writes.

Initially I've decided to implement it a bit more simply than a circular shared buffer. This will simply spin off a thread for each "batch" of points, retrying flush_retries times if it fails.

Fixes #285

daviesalex · 2015-10-21T21:37:04Z

@sparrc looks great, we will test this out at scale this week and report back.

daviesalex · 2015-10-23T15:51:49Z

I'm not 100% sure which issue is best to comment on this; apologies if I have picked the wrong one. But, as promised, we tested this. Summary:

This works as you would expect
If you are at scale, you need to set the flush_interval to be the same as jitter_interval
Even with that, you are at risk of microbursts "on the second"

We deleted all our data, and pushed a change to 1,000 hosts with flush_interval at 60s and jitter_interval 30. See below graph for the moment we brought this config up (before the obvious change old config; after change 1,000 of our several thousand hosts moved to the new config)

These 1,000 hosts have pretty accurate timesync (<25us) and are a small number of cut through (=very fast) switches from the InfluxDB test nodes, so this is an extreme example of the problem, but we saw extreme microbursts (causing drops on Intel 10G NIC cards).

We have not yet done network inspection of capture files, but I strongly suspect that we are going to see a <100us microburst once per second. This isnt affecting InfluxDB too much, but is making network hardware have a bad day in the middle.

We have two further suggestions (as well as making it a fixed sized buffer in #285):

We will be adjusting our config to make the two intervals the same, but I suggest that when the snap to a second config is in place you add a non-configurable random sleep of between 0 and 1 second (measured in microseconds). This will prevent users who dont realize that they have microbursts from experiencing the worst of the problems.
Any failures should backoff exponentially up to some sane maximum. I've not dug into the code to see what the behaviour in HEAD Is, but when we started InfluxDB up after taking it down with just 1k agents, the microburst of them all retrying their writes was >10G (for a fraction of a second) and sufficient to cause a commercially supported loadbalancer running haproxy to segfault and 10G Intel NICs on appliances rated for 14 million packets per second to drop frames. I dont think we got to 14 million packets in a second, but we probably got over a few million in a few us.

We are going to leave this configuration running for the weekend and will do more testing on Monday. Happy to do any specific testing you can suggest.

Thanks!

-Alex

daviesalex · 2015-10-23T16:28:13Z

So a further graph of real world usage; this shows Rx traffic for 3 nodes (all writes going to one, the yellow)

Box 1 - old client
Box 2 - new client, flush_interval=60, jitter_interval=30 on 1k clients
Box 3 - new client, flush_interval=60, jitter_interval=60 on 1k clients
Box 4 - InfluxDB fell over with shard errors (Separate issue) and started rejecting a small % of writes

The spikiness in box 3/4 we believe is caused by clients bursting at the same time on a specific second. Our metric data is once per 10 seconds on the server, so we need to do more analysis to dig really into this, which will happen on Monday.

sparrc · 2015-10-23T16:52:20Z

Thank you @daviesalex, this is very good information. My first reaction was that we must be choosing a random number on 1s resolution, but we're actually choosing it on 1ns resolution. Most likely the problem is that each Telegraf binary is using the same seed from the rand package. I should be able to fix this by setting the random seed to the current time in nanoseconds, this should give you the same behavior as you were getting before, where the flush time was dependent on the Telegraf start time.

From what I can tell, adding another random sleep between 0 and 1 seconds will only result in all Telegraf instances sleeping for the same "random" amount, and the microbursts will remain.

I also see what you mean about the backoff on retries. The current implementation has the batches of points retrying independently. This means that if you have an InfluxDB server down for more than 2 flush intervals, then each Telegraf instance will have 3 batches of points backed up, and will be trying to flush those 3 batches on the same interval. Having the persistent buffer will fix this.

sparrc · 2015-10-23T16:54:57Z

PS @daviesalex can we send you some InfluxDB swag to say "thanks" for this? If you send your t-shirt size and address to cameron@influxdb.com we can get a package out to you :-)

sparrc force-pushed the cache-retry branch from 58cca8c to 48e9480 Compare October 21, 2015 17:16

Add support for retrying output writes, using independent threads

dfc5986

Fixes #285

sparrc force-pushed the cache-retry branch from 48e9480 to dfc5986 Compare October 21, 2015 17:17

sparrc merged commit dfc5986 into master Oct 21, 2015

sparrc deleted the cache-retry branch October 21, 2015 17:28

daviesalex mentioned this pull request Oct 23, 2015

Normalize telegraf collection interval on even intervals #301

Closed

daviesalex mentioned this pull request Oct 23, 2015

0.9.5-nightly-827c513 process write shard error: write shard 4: engine: WAL backed up flushing to index, hit max memory influxdata/influxdb#4552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for retrying output writes, using independent threads #298

Add support for retrying output writes, using independent threads #298

sparrc commented Oct 21, 2015

sparrc commented Oct 21, 2015

daviesalex commented Oct 21, 2015

daviesalex commented Oct 23, 2015

daviesalex commented Oct 23, 2015

sparrc commented Oct 23, 2015

sparrc commented Oct 23, 2015

Add support for retrying output writes, using independent threads #298

Add support for retrying output writes, using independent threads #298

Conversation

sparrc commented Oct 21, 2015

sparrc commented Oct 21, 2015

daviesalex commented Oct 21, 2015

daviesalex commented Oct 23, 2015

daviesalex commented Oct 23, 2015

sparrc commented Oct 23, 2015

sparrc commented Oct 23, 2015