Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Update shard promotion algorithm to consider replica checkpoints. #3988

Closed
Tracked by #2212
mch2 opened this issue Jul 22, 2022 · 7 comments
Closed
Tracked by #2212
Assignees
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mch2
Copy link
Member

mch2 commented Jul 22, 2022

With segment replication we would like to avoid situations replicas contain a segment that is different from the primary's version. After a read-only replica is promoted as the new primary, we will need to index operations that exist in its xlog that do not exist in the index and make them searchable. The presence of these ops in the replica's xlog means the previous primary had indexed the op, and not finished pushing out the latest segments to any/all segments before failure.

As suggested in #2212, to avoid this situation we would like to implement a best-effort approach to select the furthest ahead replica as the new primary and avoid reindexing.

#2212 (comment) suggests that we can accomplish this by extending PrimaryShardAllocator's async fetch, that fetches which shards are in sync, to include checkpoint data from each shard when selecting a new primary.

@mch2 mch2 added enhancement Enhancement or improvement to existing feature or request distributed framework labels Jul 22, 2022
@dreamer-89 dreamer-89 self-assigned this Jul 25, 2022
@dreamer-89
Copy link
Member

dreamer-89 commented Jul 26, 2022

Set of crude steps for this task

  • Write a unit test with 1p and 2 replica on different checkpoints. Fail primary and check replica is promoted irrespective of replica's checkpoint state.
  • Update PrimaryShardAllocation.makeAllocationDecisions to include the checkpoint info
  • Ensure unit tests above passes
  • Milestone 1. Primary promotion with segrep happy path works
  • Update PrimaryShardAllocatorTests to add more unit tests around recovery/failover
  • Wait for [Segment Replication] Swap replica to writeable engine during failover. #3989 or add stub to prevent failure during primary promotions
  • Write a basic integration test mimicking unit test above.
  • Add more integration tests with different failover scenarios/events
  • Fix new bugs if any

@dreamer-89
Copy link
Member

dreamer-89 commented Jul 29, 2022

Below are use cases related to primary allocation

  1. RoutingNodes.failShard. This workflow comes into picture when a primary fails on node. This chooses the replica which has highest node version and used in below scenarios
    • CancelAllocationCommand. REROUTE_CANCELLED. Cancelling existing allocation/recovery
    • gateway.ReplicaShardAllocator. REALLOCATED_REPLICA - Cancel existing allocation when a better replica is identified i.e. one resulting in No-Op recovery.
    • ShardStateAction. Local shard failure update to cluster manager node. Local shard missing, failures during index creation/updates
  2. Cluster reroute. Cluster reroute API allows user to move shards (including primary) from node A to B.
  3. Shard balancing. This applies during new index creation.

@dreamer-89
Copy link
Member

dreamer-89 commented Jul 31, 2022

It appears AllocationService orchestrates the shard allocation. It does allocation handling by using RoutingNodes (responsbile for maintaining shards routing state) and shard allocators (which perform actual shard allocation). Checking more using an integration test.

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 1, 2022

On shard failure, master first tries to promote active replica (identified from cluster state in routing Nodes) which has highest engine version. In case, there is no available replica, master waits for cluster updates to trigger primary assignment via PrimaryShardAllocator.

With this info, separate handling needs to be done for RoutingNodes.failShard workflow.

Failover scenarios:

  1. RoutingNodes.failShard. This is used when a node is marked faulty by FollowersChecker leading to coordinator running NodeRemovalClusterStateTaskExecutor. This removes the dead nodes and fail shards from cluster using RoutingNodes.failShard followed by reroute (step 2 below).
  2. PrimaryShardAllocator. Used during cluster reroute actions to assign unassigned shards . This is used on ClusterStateUpdates (index create/delete/open/close, shard started/closed, cluster settings update, node-join, node leave), delayed allocation routing and snapshot restore.

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 1, 2022

Evaluated option of ignoring the primary promotion in RoutingNodes.failShard (failure scenario 1 above i.e. node leaving cluster). RoutingNodes#failShard is also used for updating cluster state, cancelling recoveries etc. Ignoring logic to primary promotion in RoutingNodes.failShard lead to multiple assertion failures at different levels. Removing this logic will need multiple changes in core allocation mechanism and will be a huge effort.

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 4, 2022

PR: PrimaryShardAllocator primary promotion logic: #4041

Taking up RoutingNodes.failShard primary promotion logic in #4131

@dreamer-89
Copy link
Member

Closing this in favour of #4131 which tackles the second part of handling shard failure in RoutingNodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants