-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retain history for peer recovery using leases #41536
Labels
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>enhancement
Meta
v7.4.0
Comments
DaveCTurner
added
>enhancement
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
Meta
7x
labels
Apr 25, 2019
Pinging @elastic/es-distributed |
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jun 13, 2019
This creates a peer-recovery retention lease for every shard during recovery, ensuring that the replication group retains history for future peer recoveries. It also ensures that leases for active shard copies do not expire, and leases for inactive shard copies expire immediately if the shard is fully-allocated. Relates elastic#41536
DaveCTurner
added a commit
that referenced
this issue
Jun 19, 2019
This creates a peer-recovery retention lease for every shard during recovery, ensuring that the replication group retains history for future peer recoveries. It also ensures that leases for active shard copies do not expire, and leases for inactive shard copies expire immediately if the shard is fully-allocated. Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Jun 19, 2019
This creates a peer-recovery retention lease for every shard during recovery, ensuring that the replication group retains history for future peer recoveries. It also ensures that leases for active shard copies do not expire, and leases for inactive shard copies expire immediately if the shard is fully-allocated. Relates #41536
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jun 28, 2019
This commit adjusts the behaviour of the retention lease sync to first renew any peer-recovery retention leases where either: - the corresponding shard's global checkpoint has advanced, or - the lease is older than half of its expiry time Relates elastic#41536
DaveCTurner
added a commit
that referenced
this issue
Jul 1, 2019
This commit adjusts the behaviour of the retention lease sync to first renew any peer-recovery retention leases where either: - the corresponding shard's global checkpoint has advanced, or - the lease is older than half of its expiry time Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Jul 1, 2019
This commit adjusts the behaviour of the retention lease sync to first renew any peer-recovery retention leases where either: - the corresponding shard's global checkpoint has advanced, or - the lease is older than half of its expiry time Relates #41536
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jul 3, 2019
Today when a PRRL is created during peer recovery it retains history from the sequence number provided by the recovering peer. However this sequence number may be greater than the primary's knowledge of that peer's persisted global checkpoint. Subsequent renewals of this lease will attempt to set the retained sequence number back to the primary's knowledge of that peer's persisted global checkpoint tripping an assertion that retention leases must only advance. This commit accounts for this. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates elastic#41536
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jul 3, 2019
If the primary performs a file-based recovery to a node that has (or recently had) a copy of the shard then it is possible that the persisted global checkpoint of the new copy is behind that of the old copy since file-based recoveries are somewhat destructive operations. Today we leave that node's PRRL in place during the recovery with the expectation that it can be used by the new copy. However this isn't the case if the new copy needs more history to be retained, because retention leases may only advance and never retreat. This commit addresses this by removing any existing PRRL during a file-based recovery: since we are performing a file-based recovery we have already determined that there isn't enough history for an ops-based recovery, so there is little point in keeping the old lease in place. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates elastic#41536
DaveCTurner
added a commit
that referenced
this issue
Jul 4, 2019
If the primary performs a file-based recovery to a node that has (or recently had) a copy of the shard then it is possible that the persisted global checkpoint of the new copy is behind that of the old copy since file-based recoveries are somewhat destructive operations. Today we leave that node's PRRL in place during the recovery with the expectation that it can be used by the new copy. However this isn't the case if the new copy needs more history to be retained, because retention leases may only advance and never retreat. This commit addresses this by removing any existing PRRL during a file-based recovery: since we are performing a file-based recovery we have already determined that there isn't enough history for an ops-based recovery, so there is little point in keeping the old lease in place. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Jul 4, 2019
If the primary performs a file-based recovery to a node that has (or recently had) a copy of the shard then it is possible that the persisted global checkpoint of the new copy is behind that of the old copy since file-based recoveries are somewhat destructive operations. Today we leave that node's PRRL in place during the recovery with the expectation that it can be used by the new copy. However this isn't the case if the new copy needs more history to be retained, because retention leases may only advance and never retreat. This commit addresses this by removing any existing PRRL during a file-based recovery: since we are performing a file-based recovery we have already determined that there isn't enough history for an ops-based recovery, so there is little point in keeping the old lease in place. Caught by [a failure of `RecoveryWhileUnderLoadIT.testRecoverWhileRelocating`](https://scans.gradle.com/s/wxccfrtfgjj3g/console-log?task=:server:integTest#L14) Relates #41536
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jul 5, 2019
Today peer recovery retention leases (PRRLs) are created when starting a replication group from scratch and during peer recovery. However, if the replication group was migrated from nodes running a version which does not create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was relocated or promoted without first establishing all the expected leases. It's not possible to establish these leases before or during primary activation, so we must create them as soon as possible afterwards. This gives weaker guarantees about history retention, since there's a possibility that history will be discarded before it can be used. In practice such situations are expected to occur only rarely. This commit adds the machinery to create missing leases after primary activation, and strengthens the assertions about the existence of such leases in order to ensure that once all the leases do exist we never again enter a state where there's a missing lease. Relates elastic#41536
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jul 5, 2019
Today when renewing PRRLs we assert that any invalid "backwards" renewals must be because we are recovering the shard. In fact it's also possible to have `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a tracked shard copy if the primary was just promoted and hasn't received checkpoints from all of its peers too. This commit weakens the assertion to match. Caught by a [failure of the full cluster restart tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605) Relates elastic#41536
DaveCTurner
added a commit
that referenced
this issue
Jul 5, 2019
Today when renewing PRRLs we assert that any invalid "backwards" renewals must be because we are recovering the shard. In fact it's also possible to have `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a tracked shard copy if the primary was just promoted and hasn't received checkpoints from all of its peers too. This commit weakens the assertion to match. Caught by a [failure of the full cluster restart tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605) Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Jul 5, 2019
Today when renewing PRRLs we assert that any invalid "backwards" renewals must be because we are recovering the shard. In fact it's also possible to have `checkpointState.globalCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO` on a tracked shard copy if the primary was just promoted and hasn't received checkpoints from all of its peers too. This commit weakens the assertion to match. Caught by a [failure of the full cluster restart tests](https://scans.gradle.com/s/5lllzgqtuegty/console-log#L8605) Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Jul 8, 2019
Today peer recovery retention leases (PRRLs) are created when starting a replication group from scratch and during peer recovery. However, if the replication group was migrated from nodes running a version which does not create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was relocated or promoted without first establishing all the expected leases. It's not possible to establish these leases before or during primary activation, so we must create them as soon as possible afterwards. This gives weaker guarantees about history retention, since there's a possibility that history will be discarded before it can be used. In practice such situations are expected to occur only rarely. This commit adds the machinery to create missing leases after primary activation, and strengthens the assertions about the existence of such leases in order to ensure that once all the leases do exist we never again enter a state where there's a missing lease. Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Jul 8, 2019
Today peer recovery retention leases (PRRLs) are created when starting a replication group from scratch and during peer recovery. However, if the replication group was migrated from nodes running a version which does not create PRRLs (e.g. 7.3 and earlier) then it's possible that the primary was relocated or promoted without first establishing all the expected leases. It's not possible to establish these leases before or during primary activation, so we must create them as soon as possible afterwards. This gives weaker guarantees about history retention, since there's a possibility that history will be discarded before it can be used. In practice such situations are expected to occur only rarely. This commit adds the machinery to create missing leases after primary activation, and strengthens the assertions about the existence of such leases in order to ensure that once all the leases do exist we never again enter a state where there's a missing lease. Relates #41536
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Jul 25, 2019
Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts elastic#38904 and elastic#42211 Relates elastic#41536
This was referenced Jul 25, 2019
DaveCTurner
added a commit
that referenced
this issue
Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts #38904 and #42211 Relates #41536
DaveCTurner
added a commit
that referenced
this issue
Aug 1, 2019
Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts #38904 and #42211 Relates #41536
This was referenced Aug 2, 2019
DaveCTurner
added a commit
that referenced
this issue
Aug 2, 2019
Today we recover a replica by copying operations from the primary's translog. However we also retain some historical operations in the index itself, as long as soft-deletes are enabled. This commit adjusts peer recovery to use the operations in the index for recovery rather than those in the translog, and ensures that the replication group retains enough history for use in peer recovery by means of retention leases. Reverts #38904 and #42211 Relates #41536 Backport of #45136 to 7.x.
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this issue
Aug 5, 2019
Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates elastic#41536
This was referenced Aug 5, 2019
original-brownbear
pushed a commit
to original-brownbear/elasticsearch
that referenced
this issue
Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates elastic#41536
original-brownbear
pushed a commit
that referenced
this issue
Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates #41536
original-brownbear
added a commit
that referenced
this issue
Aug 8, 2019
Today if a shard is not fully allocated we maintain a retention lease for a lost peer for up to 12 hours, retaining all operations that occur in that time period so that we can recover this replica using an operations-based recovery if it returns. However it is not always reasonable to perform an operations-based recovery on such a replica: if the replica is a very long way behind the rest of the replication group then it can be much quicker to perform a file-based recovery instead. This commit introduces a notion of "reasonable" recoveries. If an operations-based recovery would involve copying only a small number of operations, but the index is large, then an operations-based recovery is reasonable; on the other hand if there are many operations to copy across and the index itself is relatively small then it makes more sense to perform a file-based recovery. We measure the size of the index by computing its number of documents (including deleted documents) in all segments belonging to the current safe commit, and compare this to the number of operations a lease is retaining below the local checkpoint of the safe commit. We consider an operations-based recovery to be reasonable iff it would involve replaying at most 10% of the documents in the index. The mechanism for this feature is to expire peer-recovery retention leases early if they are retaining so much history that an operations-based recovery using that lease would be unreasonable. Relates #41536
The only remaining item here is to update the |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>enhancement
Meta
v7.4.0
The goal is that we can perform an operations-based recovery for all "reasonable" shard copies C:
Reasonable shard copies comprise all the copies that are currently being tracked, as well as all the copies that "might be a recovery target": if the shard is not fully allocated then any copy that has been tracked in the last
index.soft_deletes.retention_lease.period
(i.e.12h
) might reasonably be a recovery target.We also require that history is eventually released: in a stable cluster, for every operation with seqno s below the MSN of a replication group, eventually there are no leases that retain s:
Concretely, this should ensure that operations-based recoveries are possible in the following cases (subject to the copy being allocated back to the same node):
12h
)This breaks into a few conceptually-separate pieces:
Adjust peer recovery to start by recovering the target using the local translog as far as (the local copy of) the global checkpoint (Use global checkpoint as starting seq in ops-based recovery #43463)
Create peer recovery retention leases to retain the history needed by each shard (Create peer-recovery retention leases #43190, Add missing GCP update #43632)
Lazily create retention leases for tracked shards that don't exist because the primary was relocated from an older version. (Create missing PRRLs after primary activation #44009)
Advance existing peer recovery retention leases according to the history information exposed by each shard copy. (Advance PRRLs to match GCP of tracked shards #43751, Prevent invalid renewals of PRRLs #43898)
Make peer recovery work together with retention leases (Recover peers using history from Lucene #44853)
Tests should randomly set the lease expiry time very low sometimes to ensure that everything still works if leases are expiring. (Randomise retention lease expiry time #44067)
Discard translog more enthusiastically now that we don't need to retain it any more (Ignore translog retention policy if soft-deletes enabled #45473)
Expire leases based on more than time - if a file-based recovery would clearly be cheaper than an ops-based recovery then we may as well throw a lease away (Only retain reasonable history for peer recoveries #45208)
Followup work, out-of-scope for the feature branches.
Adjust translog retention
Make the
ReplicaShardAllocator
sensitive to leases, so that it prefers to select a location for each replica that only needs an ops-based recovery. (relates Replica allocation consider no-op #42518)Seqno-based synced flush: if a copy has LCP == MSN then it needs no recovery. (relates Replica allocation consider no-op #42518)
BWC issues: during a rolling upgrade, we may migrate a primary onto a new node without first establishing the appropriate leases. They can't be established before or during this promotion, so we must weaken the assertions so that they only apply to sufficiently-newly-created indices. We will still establish leases properly during peer recovery, and can establish them lazily on older indices, but they may not retain all the right history when first created.
Closed replicated indices issues: a closed index permits no replicated actions, but should not need any history to be retained. We cannot replay history into a closed index, so all recoveries must be file-based, so there's no real need for leases; moreover any existing PRRLs will not be retaining any history. We cannot assert that all the copies of a replicated closed index have a corresponding lease without performing replicated write actions to create such leases as we create new replicas, and nor can we assert that there are no leases on a replicated closed index since again this would require replicated write actions. We elect to ignore PRRLs on closed indices: they might exist, but they might not, and either way is fine.
The text was updated successfully, but these errors were encountered: