-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988
Comments
Set of crude steps for this task
|
Below are use cases related to primary allocation
|
|
On shard failure, master first tries to promote active replica (identified from cluster state in routing Nodes) which has highest engine version. In case, there is no available replica, master waits for cluster updates to trigger primary assignment via PrimaryShardAllocator. With this info, separate handling needs to be done for RoutingNodes.failShard workflow. Failover scenarios:
|
Evaluated option of ignoring the primary promotion in |
Closing this in favour of #4131 which tackles the second part of handling shard failure in RoutingNodes. |
With segment replication we would like to avoid situations replicas contain a segment that is different from the primary's version. After a read-only replica is promoted as the new primary, we will need to index operations that exist in its xlog that do not exist in the index and make them searchable. The presence of these ops in the replica's xlog means the previous primary had indexed the op, and not finished pushing out the latest segments to any/all segments before failure.
As suggested in #2212, to avoid this situation we would like to implement a best-effort approach to select the furthest ahead replica as the new primary and avoid reindexing.
#2212 (comment) suggests that we can accomplish this by extending PrimaryShardAllocator's async fetch, that fetches which shards are in sync, to include checkpoint data from each shard when selecting a new primary.
The text was updated successfully, but these errors were encountered: