Repair behaviour - AAE fullsync #773

martinsumner · 2017-10-23T10:13:10Z

There are two potentially curious aspects of repair behaviour with AAE full-sync. This may be a false reading of the code but:

If two clusters are fund to be out of sync - with a large number of objects being more up-to-date on Cluster A than Cluster B, bi-directional full-syncing between A and B will lead to B sending all its out-of-date objects to A (perhaps before A sends all the up-to-date objects to B).
If there are significant number of differences between the two clusters, the source side will try and resolve this through building a bloom filter of mismatched keys and performing a full object fold over the vnode using the bloom filter as an input. However, it only produces this fold after it has already sent 5% of the keyspace through random reads.

For the first part this appears to be a consequence of not storing the clocks in the AAE store, just the hashes. So AAE has no way of determining which side is up-to-date. This may require significant change to resolve, so this is a design rather than implementation issue.

For the second part, this is where the bloom is generated -
https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L379-L386. The 5% limit is defined here https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L292.

The actual transition between using random reads and a fold is defined here: https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L543-L582

So if you have 1M keys in the vnode and 50,001 differences - I think it will fix 50K differences through random reads, and resolve the last difference by creating a bloom and folding over all the objects. As you would expect the differences would be randomly distributed across the segments of the AAE tree, it does seem plausible that the decision could be made earlier (perhaps after a sample of 1000 random reads), that the 5% limit is likely to be breached - and the bloom approach invoked.

martinsumner changed the title ~~Repair behaviour~~ Repair behaviour - AAE fullsync Oct 23, 2017

martinsumner mentioned this issue Dec 4, 2017

Does ebloom always fail? #774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair behaviour - AAE fullsync #773

Repair behaviour - AAE fullsync #773

martinsumner commented Oct 23, 2017

Repair behaviour - AAE fullsync #773

Repair behaviour - AAE fullsync #773

Comments

martinsumner commented Oct 23, 2017