-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Should consider using RAFT consensus algorithm for Segment replication #6369
Comments
Tracking existing
|
@Jeevananthan-23 Thanks for raising this! Consensus is useful for electing cluster manager nodes, but I don't think it's required on primary failure. #2212 is around handling failover within a replication group with segment replication enabled. During the failover case today, the cluster manager node makes a decision on which replica should be elected as the new primary within the replication group here, by considering only if the candidate is active & selecting the furthest ahead in terms of OpenSearch version. With segment replication, we also want to take into account the candidate's latest SegmentInfos version. We want to do this to ensure that we are 1) not reindexing documents that have already been indexed and 2) to avoid creating new segments of the same name that already exist somewhere within the replication group.
This is the desired case with segment replication. The newly promoted primary would have previously been syncing segments with the old primary, so it will have up to the old primary's latest segments at the time of failover. The new primary will continue indexing and create new segments that no other replica in the group has. If the newly elected primary is behind the old primary but another replica in the replication group is up to date, this is where the conflict occurs. The newly elected primary will in this case replay form its translog after promotion & create new segments with the same name as existing segments on the other replica. #4365 was an attempt to prevent the newly elected primary from creating new segments with a name higher than that of a segment on a pre existing replica. However, this solution is not fool proof, we only bump the counter (which drives the segment name) by an arbitrary amount, so if the newly elected primary was behind the old by more than that amount, we could still see conflicts. If this happens, the newly elected primary will continue, yet the replicas would fail & need recovery.
This was a suggestion to store the former primary's state within cluster state, so that we increase the counter by a known amount instead of some arbitrary long. IMO we should update this logic that executes on cluster managers to fetch the latest checkpoint from all candidate replicas, and select the one with the highest value, this would add some latency to fetch from each replica, but I can't imagine it being too expensive in exchange for correctness. Alternatively, we could store in cluster state after each replica updates to a new set of segments, so that cluster managers already have this state, yet this would be a frequent update. |
@mch2 Sorry, for the late reply had some research on existing ElasticSearch consensus solutions they also don't relay on Raft consensus.
As you mentioned here the new primary promotion must be accountable with latest SegmentInfos version.
My proposal here is that should consider
How is the latest checkpoint fetch from the sequence-number based replication as you mentioned have some latency this point the right implementation of Raft for coordination should help. I know that this may have difficulties to implement but should be looking forward to benchmarking #2583 results. Thanks! |
+1 to add the logic to fetch the latest checkpoint before promoting a replica to primary. Whenever you choose to implement it, a note on the code reference for RoutingNodes. That logic is executed when processing new cluster state which executes in single threaded executor for cluster state updates OpenSearch/server/src/main/java/org/opensearch/cluster/action/shard/ShardStateAction.java Line 527 in 7472aa9
This might turn out to be expensive if segments are created every few secs, this could result in too many requests to ClusterManager and wouldn't be preferred. ClusterManager shouldn't be in Indexing direct path. |
Hello @mikemccand / @mch2, I could like to understand incontext of how shard promotion (Leader Election) works with the below proposals. May why not consider distributed consensus algorithm like
RAFT
.what happen when master primary shard dies at first inside the cluster and newly prometed primary shard has same segment number and how promotion happens?
Using distributed consensus algorithm like
Raft
should be the great choice because copying the merged segment and transfer to replicas and support learder election as @mikemccand mentioned in his blog Segment Replication cluster state.Originally posted by @Jeevananthan-23 in #2212 (comment)
The text was updated successfully, but these errors were encountered: