Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing during primary relocation with ongoing replica recoveries can lead to documents not being properly replicated #19248

Closed
ywelsch opened this issue Jul 4, 2016 · 0 comments
Assignees
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Jul 4, 2016

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated. As this is quite a tricky issue to understand, I'll first give a short summary on how document replication and peer recoveries integrate:

  1. Peer recoveries are triggered by the recovery target node (the one that wants to recover) but only successfully started on the recovery source node (holding the primary shard) when the source node knows that the recovery target shard exists. This check is realized by looking at the current cluster state on the recovery source node and checking the routing table if a corresponding initializing shard exists on the target node.
  2. Once this check successfully passes the source node remembers the current position in the translog and syncs Lucene files to the target node (this is called phase 1). At the end of phase 1, the engine is started on the target recovery shard. From this moment on the shard accepts document writes.
  3. In a second phase the source shard takes a snapshot of the translog, containing all writes that have been added since the saved position in the translog while the lucene files were copied to the target shard. The source shard then sends all the operations in the snapshot to the target shard. New operations that happen after the snapshot was taking a replicated to the target shard using the normal replication logic.

The following two invariants are (among others) required for data replication to properly integrate with peer recoveries:

Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write operations.

Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a recovering shard.

With primary relocations, the two invariants can be possibly violated. To illustrate the issues, let's consider a primary relocating while there is another replica shard recovering from the primary shard.

Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the index files but it is a theoretical gap that surfaces in testing scenarios.

Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be currently missed on the new initializing replica.

The obvious easy fix for this will be to forbid any replica recovery while the primary is relocating. However, since primary relocation can take a long time (we do it in the background and throttle it) this will result in a large time window where the cluster will not be able to recover from a potential replica loss (either by a network hickup or a true node loss).

We currently working on a fix using the following two directions:

  1. As part of the primary hand off before source and target, the source will make sure the target knows about all ongoing recoveries. This will make sure invariant 1 can not violated.
  2. We will not start the phase2 of recoveries (where the snapshot is taken) after the hand off has taken place (the source shard state is RELOCATED). Since we now guarantee that no operations are inflight while the hand off happens ( Primary relocation handoff #15900 ), we know that from the moment operations are routed via the target primary, no new snapshots will be taken which is the premise of violating invariant 2.
@ywelsch ywelsch added resiliency :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jul 4, 2016
@ywelsch ywelsch self-assigned this Jul 4, 2016
@bleskes bleskes changed the title Indexing during primary relocation can lead to documents not being properly replicated Indexing during primary relocation with ongoing replica recoveries can lead to documents not being properly replicated Jul 4, 2016
ywelsch added a commit that referenced this issue Jul 19, 2016
…cation with ongoing replica recoveries (#19287)

Primary relocation violates two invariants that ensure proper interaction between document replication and peer recoveries, ultimately leading to documents not being properly replicated.

Invariant 1: Document writes must be replicated based on the routing table of a cluster state that includes all shards which have ongoing or finished recoveries. This is ensured by the fact that do not start a recovery that is not reflected by the cluster state available on the primary node and we always sample a fresh cluster state before starting to replicate write operations.

Invariant 2: Every operation that is not part of the snapshot taken for phase 2, must be succesfully indexed on the target replica (pending shard level errors which will cause the target shard to be failed). To ensure this, we start replicating to the target shard as soon as the recovery start and open it's engine before we take the snapshot. All operations that are indexed after the snapshot was taken are guaranteed to arrive to the shard when it's ready to index them. Note that this also means that the replication doesn't fail a shard if it's not yet ready to recieve operations - it's a normal part of a recovering shard.

With primary relocations, the two invariants can be possibly violated. Let's consider a primary relocating while there is another replica shard recovering from the primary shard.

Invariant 1 can be violated if the target of the primary relocation is so lagging on cluster state processing that it doesn't even know about the new initializing replica. This is very rare in practice as replica recoveries take time to copy all the index files but it is a theoretical gap that surfaces in testing scenarios.

Invariant 2 can be violated even if the target primary knows about the initializing replica. This can happen if the target primary replicates an operation to the intializing shard and that operation arrives to the initializing shard before it opens it's engine but arrives to the primary source after it has taken the snapshot of the translog. Those operations will be currently missed on the new initializing replica.

The fix to reestablish invariant 1 is to ensure that the primary relocation target has a cluster state with all replica recoveries that were successfully started on primary relocation source. The fix to reestablish invariant 2 is to check after opening engine on the replica if the primary has been relocated in the meanwhile and fail the recovery.

Closes #19248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency
Projects
None yet
Development

No branches or pull requests

1 participant