Indexing during primary relocation with ongoing replica recoveries can lead to documents not being properly replicated #19248
Labels
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
resiliency
Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated. As this is quite a tricky issue to understand, I'll first give a short summary on how document replication and peer recoveries integrate:
The following two invariants are (among others) required for data replication to properly integrate with peer recoveries:
Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write operations.
Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a recovering shard.
With primary relocations, the two invariants can be possibly violated. To illustrate the issues, let's consider a primary relocating while there is another replica shard recovering from the primary shard.
Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the index files but it is a theoretical gap that surfaces in testing scenarios.
Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be currently missed on the new initializing replica.
The obvious easy fix for this will be to forbid any replica recovery while the primary is relocating. However, since primary relocation can take a long time (we do it in the background and throttle it) this will result in a large time window where the cluster will not be able to recover from a potential replica loss (either by a network hickup or a true node loss).
We currently working on a fix using the following two directions:
RELOCATED
). Since we now guarantee that no operations are inflight while the hand off happens ( Primary relocation handoff #15900 ), we know that from the moment operations are routed via the target primary, no new snapshots will be taken which is the premise of violating invariant 2.The text was updated successfully, but these errors were encountered: