Retain history for peer recovery using leases #41536

DaveCTurner · 2019-04-25T14:30:43Z

The goal is that we can perform an operations-based recovery for all "reasonable" shard copies C:

There is a peer recovery retention lease L corresponding with C.
Every in-sync shard copy has a complete history of operations above the retained seqno of L.
The retained seqno r of L is no greater than the local checkpoint of the last safe commit of C.

Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last index.soft_deletes.retention_lease.period (i.e. 12h) might reasonably be a recovery target.

We also require that history is eventually released: in a stable cluster, for every operation with seqno s below the MSN of a replication group, eventually there are no leases that retain s:

Every active shard copy eventually advances its LCPoSC past s.
Every lease for an active shard copy eventually also passes s.
Every inactive shard copy eventually either becomes active or else its lease expires.

Concretely, this should ensure that operations-based recoveries are possible in the following cases (subject to the copy being allocated back to the same node):

a shard copy C is offline for a short period (<12h)
- even if the primary is relocated or a replica is promoted to primary while C is offline.
- even if C was part of a closed/frozen/readonly index that was opened while C was offline
  - but not if the index was closed/frozen again before C comes back
  - TBD: maybe we are ok with this being a file-based recovery?
a full-cluster restart

This breaks into a few conceptually-separate pieces:

Adjust peer recovery to start by recovering the target using the local translog as far as (the local copy of) the global checkpoint (Use global checkpoint as starting seq in ops-based recovery #43463)
- this means we can discard history that is behind every known global checkpoint
- replicas already share with the primary the necessary information about the movement of the global checkpoint
Create peer recovery retention leases to retain the history needed by each shard (Create peer-recovery retention leases #43190, Add missing GCP update #43632)
- For primary, on primary activation
- For replicas, during peer recovery
- retention leases don't guarantee that history is retained by every copy
Lazily create retention leases for tracked shards that don't exist because the primary was relocated from an older version. (Create missing PRRLs after primary activation #44009)
Advance existing peer recovery retention leases according to the history information exposed by each shard copy. (Advance PRRLs to match GCP of tracked shards #43751, Prevent invalid renewals of PRRLs #43898)
Make peer recovery work together with retention leases (Recover peers using history from Lucene #44853)
- Use the existence of a retention lease as the deciding factor for performing an ops-based recovery
- Reinstate recovery from history stored in Lucene if soft deletes are enabled
Tests should randomly set the lease expiry time very low sometimes to ensure that everything still works if leases are expiring. (Randomise retention lease expiry time #44067)
Discard translog more enthusiastically now that we don't need to retain it any more (Ignore translog retention policy if soft-deletes enabled #45473)
Expire leases based on more than time - if a file-based recovery would clearly be cheaper than an ops-based recovery then we may as well throw a lease away (Only retain reasonable history for peer recoveries #45208)

Followup work, out-of-scope for the feature branches.

Adjust translog retention
- Should we retain translog generations according to retention leases too?
- Trim translog files eagerly during the "verify-before-close" step for closed/frozen indices (Trim translog for closed indices #43156)
- Properly support peer-recovery retention leases on indices that are not using soft deletes too.
Make the ReplicaShardAllocator sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery. (relates Replica allocation consider no-op #42518)
Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery. (relates Replica allocation consider no-op #42518)

BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.

Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases; moreover any existing PRRLs will not be retaining any history. We cannot assert that all the copies of a replicated closed index have a corresponding lease without performing replicated write actions to create such leases as we create new replicas, and nor can we assert that there are no leases on a replicated closed index since again this would require replicated write actions. We elect to ignore PRRLs on closed indices: they might exist, but they might not, and either way is fine.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-25T14:30:48Z

Pinging @elastic/es-distributed

This creates a peer-recovery retention lease for every shard during recovery, ensuring that the replication group retains history for future peer recoveries. It also ensures that leases for active shard copies do not expire, and leases for inactive shard copies expire immediately if the shard is fully-allocated. Relates elastic#41536

This creates a peer-recovery retention lease for every shard during recovery, ensuring that the replication group retains history for future peer recoveries. It also ensures that leases for active shard copies do not expire, and leases for inactive shard copies expire immediately if the shard is fully-allocated. Relates #41536

This commit adjusts the behaviour of the retention lease sync to first renew any peer-recovery retention leases where either: - the corresponding shard's global checkpoint has advanced, or - the lease is older than half of its expiry time Relates elastic#41536

This commit adjusts the behaviour of the retention lease sync to first renew any peer-recovery retention leases where either: - the corresponding shard's global checkpoint has advanced, or - the lease is older than half of its expiry time Relates #41536

Today when a PRRL is created during peer recovery it retains history from the sequence number provided by the recovering peer. However this sequence number may be greater than the primary's knowledge of that peer's persisted global checkpoint. Subsequent renewals of this lease will attempt to set the retained sequence number back to the primary's knowledge of that peer's persisted global checkpoint tripping an assertion that retention leases must only advance. This commit accounts for this. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates elastic#41536

If the primary performs a file-based recovery to a node that has (or recently had) a copy of the shard then it is possible that the persisted global checkpoint of the new copy is behind that of the old copy since file-based recoveries are somewhat destructive operations. Today we leave that node's PRRL in place during the recovery with the expectation that it can be used by the new copy. However this isn't the case if the new copy needs more history to be retained, because retention leases may only advance and never retreat. This commit addresses this by removing any existing PRRL during a file-based recovery: since we are performing a file-based recovery we have already determined that there isn't enough history for an ops-based recovery, so there is little point in keeping the old lease in place. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates elastic#41536

If the primary performs a file-based recovery to a node that has (or recently had) a copy of the shard then it is possible that the persisted global checkpoint of the new copy is behind that of the old copy since file-based recoveries are somewhat destructive operations. Today we leave that node's PRRL in place during the recovery with the expectation that it can be used by the new copy. However this isn't the case if the new copy needs more history to be retained, because retention leases may only advance and never retreat. This commit addresses this by removing any existing PRRL during a file-based recovery: since we are performing a file-based recovery we have already determined that there isn't enough history for an ops-based recovery, so there is little point in keeping the old lease in place. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates #41536

Today peer recovery retention leases (PRRLs) are created when starting a replication group from scratch and during peer recovery. However, if the replication group was migrated from nodes running a version which does not create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was relocated or promoted without first establishing all the expected leases. It's not possible to establish these leases before or during primary activation, so we must create them as soon as possible afterwards. This gives weaker guarantees about history retention, since there's a possibility that history will be discarded before it can be used. In practice such situations are expected to occur only rarely. This commit adds the machinery to create missing leases after primary activation, and strengthens the assertions about the existence of such leases in order to ensure that once all the leases do exist we never again enter a state where there's a missing lease. Relates elastic#41536

Today when renewing PRRLs we assert that any invalid "backwards" renewals must be because we are recovering the shard. In fact it's also possible to have `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a tracked shard copy if the primary was just promoted and hasn't received checkpoints from all of its peers too. This commit weakens the assertion to match. Caught by a [failure of the full cluster restart tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605) Relates elastic#41536

Today when renewing PRRLs we assert that any invalid "backwards" renewals must be because we are recovering the shard. In fact it's also possible to have `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a tracked shard copy if the primary was just promoted and hasn't received checkpoints from all of its peers too. This commit weakens the assertion to match. Caught by a [failure of the full cluster restart tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605) Relates #41536

Today peer recovery retention leases (PRRLs) are created when starting a replication group from scratch and during peer recovery. However, if the replication group was migrated from nodes running a version which does not create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was relocated or promoted without first establishing all the expected leases. It's not possible to establish these leases before or during primary activation, so we must create them as soon as possible afterwards. This gives weaker guarantees about history retention, since there's a possibility that history will be discarded before it can be used. In practice such situations are expected to occur only rarely. This commit adds the machinery to create missing leases after primary activation, and strengthens the assertions about the existence of such leases in order to ensure that once all the leases do exist we never again enter a state where there's a missing lease. Relates #41536

Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts elastic#38904 and elastic#42211 Relates elastic#41536

Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts #38904 and #42211 Relates #41536

Today we recover a replica by copying operations from the primary's translog. However we also retain some historical operations in the index itself, as long as soft-deletes are enabled. This commit adjusts peer recovery to use the operations in the index for recovery rather than those in the translog, and ensures that the replication group retains enough history for use in peer recovery by means of retention leases. Reverts #38904 and #42211 Relates #41536 Backport of #45136 to 7.x.

Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates elastic#41536

Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates #41536

DaveCTurner · 2019-09-03T15:40:12Z

The only remaining item here is to update the ReplicaShardAllocator to be sensitive to sequence numbers (i.e. leases) which is tracked in #42518, so I am closing this.

DaveCTurner added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Meta 7x labels Apr 25, 2019

DaveCTurner mentioned this issue Apr 25, 2019

Retain history for peer recovery using leases #39133

Closed

dnhatn mentioned this issue May 4, 2019

Shard history retention leases #37165

Closed

24 tasks

DaveCTurner mentioned this issue Jun 13, 2019

Create peer-recovery retention leases #43190

Merged

tlrx mentioned this issue Jun 19, 2019

Trim translog for closed indices #43156

Merged

DaveCTurner mentioned this issue Jun 28, 2019

Advance PRRLs to match GCP of tracked shards #43751

Merged

DaveCTurner mentioned this issue Jul 3, 2019

Prevent invalid renewals of PRRLs #43898

Closed

DaveCTurner mentioned this issue Jul 3, 2019

Remove PRRLs before performing file-based recovery #43928

Merged

DaveCTurner mentioned this issue Jul 5, 2019

Return recovery to generic thread post-PRRL action #44000

Merged

DaveCTurner mentioned this issue Jul 5, 2019

Create missing PRRLs after primary activation #44009

Merged

DaveCTurner mentioned this issue Jul 5, 2019

Skip PRRL renewal on UNASSIGNED_SEQ_NO #44019

Merged

DaveCTurner added v7.4.0 and removed 7x labels Jul 25, 2019

This was referenced Jul 25, 2019

Recover peers using history from Lucene #44853

Merged

Failure in CloseWhileRelocatingShardsIT #44855

Closed

This was referenced Aug 2, 2019

Use index for peer recovery instead of translog #45136

Merged

Use index for peer recovery instead of translog #45137

Merged

DaveCTurner mentioned this issue Aug 2, 2019

Re-enable BWC builds after merging #45137 #45145

Merged

This was referenced Aug 5, 2019

Only retain reasonable history for peer recoveries #45208

Merged

Reduce enthusiasm for calling flush APIs #38503

Closed

original-brownbear mentioned this issue Aug 8, 2019

Only retain reasonable history for peer recoveries (#45208) #45355

Merged

colings86 added v7.5.0 and removed v7.4.0 labels Aug 30, 2019

DaveCTurner mentioned this issue Sep 2, 2019

Docs for translog, history retention and flushing #46245

Merged

DaveCTurner closed this as completed Sep 3, 2019

DaveCTurner added v7.4.0 and removed v7.5.0 labels Sep 3, 2019

DaveCTurner mentioned this issue Sep 4, 2019

Sequence-number-based replica allocation #46318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain history for peer recovery using leases #41536

Retain history for peer recovery using leases #41536

DaveCTurner commented Apr 25, 2019 •

edited

Loading

elasticmachine commented Apr 25, 2019

DaveCTurner commented Sep 3, 2019

Retain history for peer recovery using leases #41536

Retain history for peer recovery using leases #41536

Comments

DaveCTurner commented Apr 25, 2019 • edited Loading

elasticmachine commented Apr 25, 2019

DaveCTurner commented Sep 3, 2019

DaveCTurner commented Apr 25, 2019 •

edited

Loading