[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

mch2 · 2022-07-22T19:01:00Z

With segment replication we would like to avoid situations replicas contain a segment that is different from the primary's version. After a read-only replica is promoted as the new primary, we will need to index operations that exist in its xlog that do not exist in the index and make them searchable. The presence of these ops in the replica's xlog means the previous primary had indexed the op, and not finished pushing out the latest segments to any/all segments before failure.

As suggested in #2212, to avoid this situation we would like to implement a best-effort approach to select the furthest ahead replica as the new primary and avoid reindexing.

#2212 (comment) suggests that we can accomplish this by extending PrimaryShardAllocator's async fetch, that fetches which shards are in sync, to include checkpoint data from each shard when selecting a new primary.

dreamer-89 · 2022-07-26T22:35:26Z

Set of crude steps for this task

Write a unit test with 1p and 2 replica on different checkpoints. Fail primary and check replica is promoted irrespective of replica's checkpoint state.
Update PrimaryShardAllocation.makeAllocationDecisions to include the checkpoint info
Ensure unit tests above passes
Milestone 1. Primary promotion with segrep happy path works
Update PrimaryShardAllocatorTests to add more unit tests around recovery/failover
Wait for [Segment Replication] Swap replica to writeable engine during failover. #3989 or add stub to prevent failure during primary promotions
Write a basic integration test mimicking unit test above.
Add more integration tests with different failover scenarios/events
Fix new bugs if any

dreamer-89 · 2022-07-29T21:18:10Z

Below are use cases related to primary allocation

RoutingNodes.failShard. This workflow comes into picture when a primary fails on node. This chooses the replica which has highest node version and used in below scenarios
- CancelAllocationCommand. REROUTE_CANCELLED. Cancelling existing allocation/recovery
- gateway.ReplicaShardAllocator. REALLOCATED_REPLICA - Cancel existing allocation when a better replica is identified i.e. one resulting in No-Op recovery.
- ShardStateAction. Local shard failure update to cluster manager node. Local shard missing, failures during index creation/updates
Cluster reroute. Cluster reroute API allows user to move shards (including primary) from node A to B.
Shard balancing. This applies during new index creation.

dreamer-89 · 2022-07-31T03:16:11Z

It appears AllocationService orchestrates the shard allocation. It does allocation handling by using RoutingNodes (responsbile for maintaining shards routing state) and shard allocators (which perform actual shard allocation). Checking more using an integration test.

dreamer-89 · 2022-08-01T00:26:15Z

On shard failure, master first tries to promote active replica (identified from cluster state in routing Nodes) which has highest engine version. In case, there is no available replica, master waits for cluster updates to trigger primary assignment via PrimaryShardAllocator.

With this info, separate handling needs to be done for RoutingNodes.failShard workflow.

Failover scenarios:

RoutingNodes.failShard. This is used when a node is marked faulty by FollowersChecker leading to coordinator running NodeRemovalClusterStateTaskExecutor. This removes the dead nodes and fail shards from cluster using RoutingNodes.failShard followed by reroute (step 2 below).
PrimaryShardAllocator. Used during cluster reroute actions to assign unassigned shards . This is used on ClusterStateUpdates (index create/delete/open/close, shard started/closed, cluster settings update, node-join, node leave), delayed allocation routing and snapshot restore.

dreamer-89 · 2022-08-01T19:37:05Z

Evaluated option of ignoring the primary promotion in RoutingNodes.failShard (failure scenario 1 above i.e. node leaving cluster). RoutingNodes#failShard is also used for updating cluster state, cancelling recoveries etc. Ignoring logic to primary promotion in RoutingNodes.failShard lead to multiple assertion failures at different levels. Removing this logic will need multiple changes in core allocation mechanism and will be a huge effort.

dreamer-89 · 2022-08-04T16:29:24Z

PR: PrimaryShardAllocator primary promotion logic: #4041

Taking up RoutingNodes.failShard primary promotion logic in #4131

dreamer-89 · 2022-09-06T21:13:38Z

Closing this in favour of #4131 which tackles the second part of handling shard failure in RoutingNodes.

mch2 added enhancement Enhancement or improvement to existing feature or request distributed framework labels Jul 22, 2022

mch2 mentioned this issue Jul 22, 2022

[Segment Replication] Support shard promotion. #2212

Closed

4 tasks

dreamer-89 self-assigned this Jul 25, 2022

dreamer-89 mentioned this issue Jul 29, 2022

[Segment Replication] Update PrimaryShardAllocator to prefer replicas with higher replication checkpoint #4041

Merged

dreamer-89 mentioned this issue Aug 3, 2022

[Bug] [Segment Replication] Fix isAheadOf logic for ReplicationCheckpoint comparison #4112

Merged

dreamer-89 mentioned this issue Aug 5, 2022

[Segment Replication] Primary promotion on shard failing during node removal in RoutingNodes#failShard #4131

Open

Rishikesh1159 mentioned this issue Aug 8, 2022

[Segment Replication] Peer recovery checkpoint publication invariants #3923

Open

dreamer-89 mentioned this issue Aug 8, 2022

Add IndexShard#getLatestReplicationCheckpoint behind segrep enable feature flag #4163

Merged

dreamer-89 closed this as completed Sep 6, 2022

anasalkouz added this to Segment Replication Feb 10, 2023

github-project-automation bot moved this to Todo in Segment Replication Feb 10, 2023

anasalkouz moved this from Todo to Done in Segment Replication Feb 10, 2023

Jeevananthan-23 mentioned this issue Feb 24, 2023

[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

mch2 commented Jul 22, 2022

dreamer-89 commented Jul 26, 2022 •

edited

Loading

dreamer-89 commented Jul 29, 2022 •

edited

Loading

dreamer-89 commented Jul 31, 2022 •

edited

Loading

dreamer-89 commented Aug 1, 2022 •

edited

Loading

dreamer-89 commented Aug 1, 2022 •

edited

Loading

dreamer-89 commented Aug 4, 2022 •

edited

Loading

dreamer-89 commented Sep 6, 2022

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

Comments

mch2 commented Jul 22, 2022

dreamer-89 commented Jul 26, 2022 • edited Loading

dreamer-89 commented Jul 29, 2022 • edited Loading

dreamer-89 commented Jul 31, 2022 • edited Loading

dreamer-89 commented Aug 1, 2022 • edited Loading

dreamer-89 commented Aug 1, 2022 • edited Loading

dreamer-89 commented Aug 4, 2022 • edited Loading

dreamer-89 commented Sep 6, 2022

dreamer-89 commented Jul 26, 2022 •

edited

Loading

dreamer-89 commented Jul 29, 2022 •

edited

Loading

dreamer-89 commented Jul 31, 2022 •

edited

Loading

dreamer-89 commented Aug 1, 2022 •

edited

Loading

dreamer-89 commented Aug 1, 2022 •

edited

Loading

dreamer-89 commented Aug 4, 2022 •

edited

Loading