Replica recovery on follower can fail after recovery from remote #39000

jasontedor · 2019-02-16T16:17:43Z

Today during peer recovery, we replay operations from the translog of the primary to the replica. This is to build a history of operations on the replica, in case that it would become a primary. Unfortunately, this is currently fatal for peer recovery of a follower shard after the primary shard has recovered remotely.

With recovery from remote, we copy over the index files and there is no translog replay phase. Assume the simple case of a newly created follower doing its initial recovery from the leader. If the leader does a flush before the follower initiates recovery from remote, after recovery from remote the follower will be fully caught up all primary shards of the follower will have empty translogs. When a replica shard of the follower attempts to recover from the primary shard of the follower, we want to replay translog from the local checkpoint of the commit, to bake a history of operations for it. Since the primary shards of the follower have empty translogs, this replay can not happen and recovery will fail.

Immediately I see two possibilities:

InternalEngine#readHistoryOperations could serve operations from the Lucene history instead of the translog; however, this interplays with the need for shard history retention leases to interplay with recovery to ensure that these operations exist
the primary shards of the follower could build their own translog from Lucene history after remotely recovering files

The purpose of this issue is to tracker this as a blocker for all versions of CCR that support recovery from remote, and to initiate discussion on these options (and others).

Relates #38633

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-16T16:17:44Z

Pinging @elastic/es-distributed

dnhatn · 2019-02-16T17:55:38Z

Relates #38949

ywelsch · 2019-02-17T02:55:21Z

Our replication model (and peer recovery) relies on a few concepts, among which:

there is always a safe commit (i.e. global checkpoint >= max seq no of the Lucene commit)
local checkpoint equals max sequence number on primary if all ongoing replication tasks complete in the replication group (typically bounded in time and under control of the primary).

Follower indices currently violate these two assumptions:

recovery from remote does not create a safe commit (the global checkpoint in the follower index's replication group == local checkpoint of the index commit that it copied over, which can be < max seq no of the index commit that was copied over). This is a problem for peer recovery (and possibly replica rollbacks) which rely on the existence of a safe commit.
primaries of follower indices are not in control of bringing their local checkpoint up to the max sequence number due to out-of-order replication from the leader index and the possibility of following being paused or leader index being unavailable. This is again a problem for peer recovery which assumes that the local checkpoint can catch up to the max sequence number by just waiting for ongoing ops to complete (This issue also existed already before recovery from remote AFAICS). So what's the reason for peer recovery to wait on this condition (local checkpoint = max seq no)? Peer recovery adds the new replica as tracking and waits for local checkpoint of primary to catch up to max seq no (captured in endingSeqNo) in order to ensure that all ongoing operations up to max seq no have been replicated to the replica. This is to ensure that all future replication requests (>endingSeqNo) will go to the new replica (which is now tracked). Similarly, all past operations (<= endingSeqNo) are then replayed in phase 2.
Note that this is the same type of issue we were having with the new close index API, that was built on the same assumption (max seq no = local checkpoint = global checkpoint once indexing activity in replication group is stopped) which no longer holds true for follower indices.

I think that we should try to keep assumption 1, even for follower indices, but relax assumption 2 in our replication model (at least for nor now, as long as we have out-of-order replication):

We should ensure that follower indices have a safe commit: recovery from remote should at least replay operations from the local checkpoint of the "to be safe commit" up to max seq of the "to be safe commit" in order to make it a safe commit before starting the primary shard. Note that we can decide on streaming more history here (above max seq no of the commit) if that helps with retention leases. The above is the minimum to adhere to the cluster-local replication group semantics, however.
adapt peer recovery so that it does not need the "local checkpoint = max seq no" condition and can therefore accomodate follower indices. This is a bit more tricky AFAICS. It requires having the notion of "all pending ops that have not taken the new replica into account have completed". One way of doing this is to introduce the notion of a replication group version (local to primary) in order to track whether there are outstanding replication requests for a previous version. Initiating tracking (ReplicationTracker.initiateTracking) would increment the version, and all replication operations would have to register with the replication tracker first (with a releasable) before writing to the primary so that we can track outstanding replication requests for older replication group versions and wait for their completion. Peer recovery would then use the following steps after phase 1:
- initiate tracking (returns incremented replication group version)
- wait for all outstanding requests associated with older replication group versions to complete. This means that all newly indexed operations from that point on will be sent to replica. Note that this here would replace the older condition where we waited on local checkpoint to move up to the previous max seq no.
- initiate phase 2, and capture a snapshot that contains all ops (starting from >= startingSeqNo) that have so far been indexed into primary and replay those operations to the replica. This range of operations CAN contain gaps for the parts above the primary's local checkpoint. It's still important that we replay the ops above the primary's local checkpoint in order to fully align primary and replica.
- mark the new shard copy as in-sync (just as before), which stops the global checkpoint from advancing until the local checkpoint of the new replica has caught up to the current global checkpoint. Even if max seq != local checkpoint, this can be guaranteed to complete as long as replica has the same writes as primary (ensured by previous steps).

DaveCTurner · 2019-02-18T05:16:27Z

One way of doing this is to introduce the notion of a replication group version (local to primary) in order to track whether there are outstanding replication requests for a previous version.

Could we also do this by briefly obtaining an operation block between initiating tracking and capturing the phase 2 snapshot, if the primary is a CCR follower(*)? The versioned replication group proposal has the advantage of not blocking operations, but I'm not sure we need to avoid this.

(*) Edit: I think we could also wait for the checkpoints to align for a reasonable length of time first and only obtain this block on timeout.

dnhatn · 2019-02-19T01:28:48Z

We should ensure that follower indices have a safe commit: recovery from remote should at least replay operations from the local checkpoint of the "to be safe commit" up to max seq of the "to be safe commit" in order to make it a safe commit before starting the primary shard.

I am working on this.

ywelsch · 2019-02-19T06:48:13Z

Could we also do this by briefly obtaining an operation block between initiating tracking and capturing the phase 2 snapshot, if the primary is a CCR follower(*)? The versioned replication group proposal has the advantage of not blocking operations, but I'm not sure we need to avoid this.

yes, that would also work. Blocking indexing should be avoided as much as possible though, and I would also like to avoid having CCR specific recovery code. I have given this a few more thoughts and think that another simpler solution is possible.

The replication group to be used is currently sampled after indexing into the primary (see ReplicationOperation class). This means that when initiating tracking of a new replica, we have to consider the following two cases:

there are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable).
there are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to endingSeqNo are available in the snapshot (i.e. also see comment in RecoverySourceHandler saying We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint.

I think we can therefore just remove cancellableThreads.execute(() -> shard.waitForOpsToComplete(endingSeqNo)); and the range completeness check (if (requiredOpsTracker.getCheckpoint() < endingSeqNo) {) in RecoverySourceHandler. These were added in #27580 as an extra safeguard for dealing with 5.x BWC challenges. We could in theory keep them around if the primary shard is fully in charge of the seq no generator (on follower shards, it's not) but I'm not sure it's worth the complexity. Our tests should already check that shard copies are fully aligned after recovering (this was not the case previously because of missing primary-replica resync).

DaveCTurner · 2019-02-19T12:13:10Z

That's a pretty subtle interplay of different components, but it does make sense.

dnhatn · 2019-02-19T22:17:28Z

I think we can therefore just remove cancellableThreads.execute(() -> shard.waitForOpsToComplete(endingSeqNo)); and the range completeness check (if (requiredOpsTracker.getCheckpoint() < endingSeqNo) {) in RecoverySourceHandler

@ywelsch This is brilliant 🎉

With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thanks to Yannick's finding. The replication group to be used is currently sampled after indexing into the primary (see `ReplicationOperation` class). This means that when initiating tracking of a new replica, we have to consider the following two cases: - There are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable). - There are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to `endingSeqNo` are available in the snapshot (i.e. also see comment in `RecoverySourceHandler` saying `We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.`). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint. Relates #39000 Closes #38949 Co-authored-by: Yannick Welsch <yannick@welsch.lu>

dnhatn · 2019-02-25T19:47:49Z

I am closing this issue for it was resolved by #39006. The safe-commit part is considered as an enhancement and will be handled by #39153.

jasontedor added blocker v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.7.0 v8.0.0 v7.2.0 labels Feb 16, 2019

dnhatn self-assigned this Feb 16, 2019

dnhatn mentioned this issue Feb 17, 2019

Do not wait for advancement of checkpoint in recovery #39006

Merged

dnhatn mentioned this issue Feb 19, 2019

Replay history of operations in remote recovery #39153

Closed

danielmitterdorfer added the >bug label Feb 21, 2019

dnhatn mentioned this issue Feb 22, 2019

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

Closed

dnhatn closed this as completed Feb 25, 2019

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica recovery on follower can fail after recovery from remote #39000

Replica recovery on follower can fail after recovery from remote #39000

jasontedor commented Feb 16, 2019

elasticmachine commented Feb 16, 2019

dnhatn commented Feb 16, 2019

ywelsch commented Feb 17, 2019

DaveCTurner commented Feb 18, 2019 •

edited

Loading

dnhatn commented Feb 19, 2019

ywelsch commented Feb 19, 2019

DaveCTurner commented Feb 19, 2019

dnhatn commented Feb 19, 2019

dnhatn commented Feb 25, 2019

Replica recovery on follower can fail after recovery from remote #39000

Replica recovery on follower can fail after recovery from remote #39000

Comments

jasontedor commented Feb 16, 2019

elasticmachine commented Feb 16, 2019

dnhatn commented Feb 16, 2019

ywelsch commented Feb 17, 2019

DaveCTurner commented Feb 18, 2019 • edited Loading

dnhatn commented Feb 19, 2019

ywelsch commented Feb 19, 2019

DaveCTurner commented Feb 19, 2019

dnhatn commented Feb 19, 2019

dnhatn commented Feb 25, 2019

DaveCTurner commented Feb 18, 2019 •

edited

Loading