Only retain reasonable history for peer recoveries (#45208) #45355

original-brownbear · 2019-08-08T20:44:56Z

Today if a shard is not fully allocated we maintain a retention lease for a
lost peer for up to 12 hours, retaining all operations that occur in that time
period so that we can recover this replica using an operations-based recovery
if it returns. However it is not always reasonable to perform an
operations-based recovery on such a replica: if the replica is a very long way
behind the rest of the replication group then it can be much quicker to perform
a file-based recovery instead.

This commit introduces a notion of "reasonable" recoveries. If an
operations-based recovery would involve copying only a small number of
operations, but the index is large, then an operations-based recovery is
reasonable; on the other hand if there are many operations to copy across and
the index itself is relatively small then it makes more sense to perform a
file-based recovery. We measure the size of the index by computing its number
of documents (including deleted documents) in all segments belonging to the
current safe commit, and compare this to the number of operations a lease is
retaining below the local checkpoint of the safe commit. We consider an
operations-based recovery to be reasonable iff it would involve replaying at
most 10% of the documents in the index.

The mechanism for this feature is to expire peer-recovery retention leases
early if they are retaining so much history that an operations-based recovery
using that lease would be unreasonable.

Relates #41536

back port of #45208

Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates elastic#41536

elasticmachine · 2019-08-08T20:44:58Z

Pinging @elastic/es-distributed

original-brownbear · 2019-08-08T22:11:02Z

Jenkins run elasticsearch-ci/bwc

original-brownbear added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. backport labels Aug 8, 2019

original-brownbear merged commit 12ed6dc into elastic:7.x Aug 8, 2019

original-brownbear deleted the 45208-7.x branch August 8, 2019 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only retain reasonable history for peer recoveries (#45208) #45355

Only retain reasonable history for peer recoveries (#45208) #45355

original-brownbear commented Aug 8, 2019

elasticmachine commented Aug 8, 2019

original-brownbear commented Aug 8, 2019

Only retain reasonable history for peer recoveries (#45208) #45355

Only retain reasonable history for peer recoveries (#45208) #45355

Conversation

original-brownbear commented Aug 8, 2019

elasticmachine commented Aug 8, 2019

original-brownbear commented Aug 8, 2019