-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Segment replication stats APIs to support pull based architecture #15534
Comments
Hi @mch2 can I take this task? |
Documentation for the current CAT SegRep Stats API: I tried to explain how the this API work currently bellow: Currently each replica once the replication is complete for the latest received checkpoint it updates the visible checkpoint to the primary shard. Primary keeps track of all the checkpoints replicated at the replicas. Whenever stats is requested primary calculates the stats using the tracked replica checkpoints. The algorithms for this API work like bellow:
Primary Shard publishes the the checkpoint. All the replica shards will receive the checkpoint. Once replica receives the checkpoint it starts replication process for that checkpoint. Once the replication is complete replica updates latestVisibleCheckpoint the primary.
For search replicas, it runs an async process continuously for every interval ie default 1s. Job of the async process is to startReplication from the
With the addition of the search replica, we cannot continue to calculate the SegRep stats how we do now. Because with the reader/writer split we ensure that primaries do not have any direct communication with replicas ie Primary do not aware of the Search replica checkpoints. So we have to calculate the stats for the search replica differently.
Currently the calculation happens in
Currently the diff is between
IndexShard has two Listeners for the refresh
After a replication is completed in the replica, it calls updateVisibleCheckpoint UpdateVisibleCheckpointRequestHandler is registered with transportService in SegmentReplicationSourceService
Segment Replication in regular replica works based on the checkpoint received from the primary where as search replica its actually pulled from the remote store. SegmentReplicator has two startReplication methods
Thanks @mch2 to helping me onboard to this issue. As above section provides some detailed background on this issue, In the following comment I will add the proposed solution details |
Solution 1: No change for the existing calculation, Stats for Search Replica calculated differentlyWe will keep the current way of calculation for the regular replicas, but start supporting the search replicas also in the stats. This can be achieved by calculating and returning the stats from the search replica shard and using that in the coordinator to enrich the overall stats to include the search replica stats as well. Pros:
Cons:
Deep dive on implementation:We need to calculate the So for CheckPointBehindCount and BytesBehindCount calculation we can use For CurrentReplicationLag and LastCompletedReplicationLag calculation we can use the timer available in the In the coordinator node, we can use the |
Solution 2: Calculate the Stats for each replica at replica Shard levelWe can calculate the stats at the replicas and return those stats to coordinator. Coordinator will combine all the stats received from the replicas. Pros:
Cons:
Deep dive on implementation:We need to calculate the So for CheckPointBehindCount and BytesBehindCount calculation we can use For CurrentReplicationLag and LastCompletedReplicationLag calculation we can use the timer available in the Once every replica shard returns the
rejectionCount cannot be computed as it comes from the pressure service or we can return this data from the primary shard and use this in the coordinator. rejectionCount is not applicable for the search replicas. Proposed Metrics definition :CheckPointBehindCount: Number of checkpoints by which replica is behind with respective to the latest received checkpoint in the replica. BytesBehindCount: Number of bytes by which replica is behind with respective to the latest received checkpoint in the replica. CurrentReplicationLag: Total time elapsed for the replica shard to perform the current segment replication. LastCompletedReplicationLag: Total time elapsed for replica shard to complete the last replication. |
Solution 3: Return CheckPoint from replica shards, Coordinator will compute the statsWe can fetch the latest visible checkpoints from each replica and compute the required stats at the coordinator. We need to compare the Pros:
Cons:
Deep dive on implementation:Every replica shard can keep track of the In the coordinator node, we have latestCheckPoint from the primary and latestVisibleCheckPoint from other shards.
Build
Return the overall response from the coordinator. Proposed Metrics definition :CheckPointBehindCount: Number of checkpoints by which replica is behind the primary shard. BytesBehindCount: Number of bytes by which replica is behind the primary shard. CurrentReplicationLag: Total time elapsed for the replica shard to perform the current segment replication. LastCompletedReplicationLag: Total time elapsed for replica shard to complete the last replication. |
So in summary: Choosing between this options really depends on what are metrics we really care and how we want to interpret the meaning of metrics.
|
Thanks @vinaykpud for laying this out, which one do you think is the right approach? We shouldn't be changing the meaning of returned metrics within a minor release, even if it is a cat API. So I'm leaning towards lets do 1 with the introduction of the feature, and 2 or 3 for the next major? There are really three things at play here that all depend on these primary collected stats. The cat segrep API, segrep stats returned via node stats, and the segrep backpressure mechanism. I think that we tried to get too precise with the definition of replication lag with the implementation of segrep and it has left us a lot of unnecessary complexity. So I would be in favor of dramatically simplifying all three by computing replication lag on the fly rather than pre computing it OR doing away with primary comparison entirely and only showing stats on ongoing syncs... I think that would mean:
|
Thanks @mch2. I think selection of approach depends on if we need to compare with primary or we just return the on going replication stats. Based on the discussions it looks like we are inclined towards later. ie Solution 2. Since this associates with definition change of existing |
With #4577 replicas will sync directly with their source of replication rather than pushing updates from the primary.
Today stats are collected at the primary level to support APIs and enforce backpressure. With the rw split we will ensure primaries do not have any direct communication with replicas and will not be able to collect these stats.
To fix this we can update our stats APIs to fetch the latest checkpoint from each replica directly and compute the required stats at the coordinator, eliminating the need for primaries to capture these stats.
The text was updated successfully, but these errors were encountered: