Implement Segment replication Backpressure #6563

mch2 · 2023-03-07T06:39:05Z

Description

This PR is a re-cut of #6520 that includes implementing backpressure and removes the addition of these metrics to NodeStats API. This is to make it easier to see in this PR how this will be used to implement pressure. The metrics additions will be in a separate change.

This PR adds backpressure for index operations when Segment Replication is enabled.

This PR implements backpressure mechanisms for segment replication to prevent lagging
replicas from falling too far behind. Writes will be rejected under the following conditions:

More than half (default setting) of the replication group is 'stale'. Defined by setting MAX_ALLOWED_STALE_SHARDS.
A replica is stale if it is behind more than MAX_INDEXING_CHECKPOINTS, default 4 AND its current replication lag is over
MAX_REPLICATION_TIME_SETTING, default 5 minutes.

This PR intentionally implements rejections only for index operations,
allowing other TransportWriteActions to succeed, TransportResyncReplicationAction and RetentionLeaseSyncAction.
Blocking these requests will fail recoveries as new nodes are added.

Issues Resolved

#4478

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2023-03-12T22:20:58Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/12322/
CommitID: a33d287

codecov-commenter · 2023-03-12T22:22:25Z

Codecov Report

Merging #6563 (33198a1) into main (73a2279) will decrease coverage by 0.55%.
The diff coverage is 61.50%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6563      +/-   ##
============================================
- Coverage     71.22%   70.68%   -0.55%     
+ Complexity    59521    59129     -392     
============================================
  Files          4803     4808       +5     
  Lines        283208   283449     +241     
  Branches      40842    40868      +26     
============================================
- Hits         201712   200348    -1364     
- Misses        65266    66608    +1342     
- Partials      16230    16493     +263

Impacted Files	Coverage Δ
...rg/opensearch/common/settings/ClusterSettings.java	`92.30% <ø> (ø)`
...s/replication/SegmentReplicationTargetService.java	`48.40% <0.00%> (-0.63%)`	⬇️
...eplication/checkpoint/PublishCheckpointAction.java	`23.80% <0.00%> (+0.37%)`	⬆️
.../org/opensearch/index/SegmentReplicationStats.java	`15.38% <15.38%> (ø)`
...nsearch/index/SegmentReplicationPerGroupStats.java	`28.57% <28.57%> (ø)`
...opensearch/index/SegmentReplicationShardStats.java	`32.35% <32.35%> (ø)`
...ensearch/action/bulk/TransportShardBulkAction.java	`76.99% <50.00%> (+0.87%)`	⬆️
...org/opensearch/index/seqno/ReplicationTracker.java	`67.75% <70.68%> (-0.70%)`	⬇️
.../replication/checkpoint/ReplicationCheckpoint.java	`63.04% <75.00%> (+6.63%)`	⬆️
...in/java/org/opensearch/index/shard/IndexShard.java	`69.87% <81.25%> (-0.69%)`	⬇️
... and 7 more

... and 460 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

server/src/internalClusterTest/java/org/opensearch/index/SegmentReplicationPressureIT.java

server/src/main/java/org/opensearch/index/SegmentReplicationPressureService.java

dreamer-89 · 2023-03-14T05:45:08Z

server/src/main/java/org/opensearch/index/SegmentReplicationPressureService.java

+        4,
+        1,
+        Setting.Property.Dynamic,
+        Setting.Property.NodeScope


Should this (and other) be index scoped IndexScope ?

This was my thinking initially, but felt it would get a bit difficult to manage. I think we can start with node scope and extend to index if the need is there?

May be we can rename the setting constants here to reflect here cluster or node scope.
index.segrep.pressure.checkpoint.limit -> node.segrep.... ?

ack good catch - I've just removed the index. prefix. ex. segrep.pressure.checkpoint.limit

This PR introduces new mechanisms to keep track of the current replicas within a replication group and apply backpressure if they fall too far behind. Writes will be rejected under the following conditions: 1. More than half (default setting) of the replication group is 'stale'. Defined by setting MAX_ALLOWED_STALE_SHARDS. 2. A replica is stale if it is behind more than MAX_INDEXING_CHECKPOINTS, default 4 AND its current replication lag is over MAX_REPLICATION_TIME_SETTING, default 5 minutes. This PR intentionally implements rejections only for index operations, allowing other TransportWriteActions to succeed, TransportResyncReplicationAction and RetentionLeaseSyncAction. Blocking these requests will fail recoveries as new nodes are added. Signed-off-by: Marc Handalian <handalm@amazon.com>

Signed-off-by: Marc Handalian <handalm@amazon.com>

mch2 · 2023-03-14T20:21:47Z

force pushed a rebase from main.

github-actions · 2023-03-14T20:40:52Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12431/
CommitID: f6c861a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

dreamer-89 · 2023-03-14T20:42:55Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌

URL: https://build.ci.opensearch.org/job/gradle-check/12431/

CommitID: f6c861a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Spotless check failure.

Signed-off-by: Marc Handalian <handalm@amazon.com>

github-actions · 2023-03-14T22:24:34Z

Gradle Check (Jenkins) Run Completed with:

RESULT: TIMEOUT ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12432/
CommitID: 76c7b48
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-03-14T22:47:13Z

Gradle Check (Jenkins) Run Completed with:

RESULT: TIMEOUT ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12434/
CommitID: c9e630a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-03-14T23:28:03Z

Gradle Check (Jenkins) Run Completed with:

RESULT: TIMEOUT ❌
URL: https://build.ci.opensearch.org/job/gradle-check/12436/
CommitID: 33198a1
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-03-15T00:06:50Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.index.SegmentReplicationPressureIT.testWritesRejected
      1 org.opensearch.index.SegmentReplicationPressureIT.testAddReplicaWhileWritesBlocked

URL: https://build.ci.opensearch.org/job/gradle-check/12439/
CommitID: 33198a1
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

opensearch-trigger-bot · 2023-03-15T00:15:20Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6563-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6babc087f7bd5774b97e67f9e386187fe0db3ecb
# Push it to GitHub
git push --set-upstream origin backport/backport-6563-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6563-to-2.x.

* Add Segment Replication backpressure. This PR introduces new mechanisms to keep track of the current replicas within a replication group and apply backpressure if they fall too far behind. Writes will be rejected under the following conditions: 1. More than half (default setting) of the replication group is 'stale'. Defined by setting MAX_ALLOWED_STALE_SHARDS. 2. A replica is stale if it is behind more than MAX_INDEXING_CHECKPOINTS, default 4 AND its current replication lag is over MAX_REPLICATION_TIME_SETTING, default 5 minutes. This PR intentionally implements rejections only for index operations, allowing other TransportWriteActions to succeed, TransportResyncReplicationAction and RetentionLeaseSyncAction. Blocking these requests will fail recoveries as new nodes are added. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add changelog Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test class to match naming conventions. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Change setting keys to remove index scope. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>

* Implement Segment replication Backpressure (#6563) * Add Segment Replication backpressure. This PR introduces new mechanisms to keep track of the current replicas within a replication group and apply backpressure if they fall too far behind. Writes will be rejected under the following conditions: 1. More than half (default setting) of the replication group is 'stale'. Defined by setting MAX_ALLOWED_STALE_SHARDS. 2. A replica is stale if it is behind more than MAX_INDEXING_CHECKPOINTS, default 4 AND its current replication lag is over MAX_REPLICATION_TIME_SETTING, default 5 minutes. This PR intentionally implements rejections only for index operations, allowing other TransportWriteActions to succeed, TransportResyncReplicationAction and RetentionLeaseSyncAction. Blocking these requests will fail recoveries as new nodes are added. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add changelog Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test class to match naming conventions. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Change setting keys to remove index scope. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix Xcontent imports. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add Segment Replication backpressure. This PR introduces new mechanisms to keep track of the current replicas within a replication group and apply backpressure if they fall too far behind. Writes will be rejected under the following conditions: 1. More than half (default setting) of the replication group is 'stale'. Defined by setting MAX_ALLOWED_STALE_SHARDS. 2. A replica is stale if it is behind more than MAX_INDEXING_CHECKPOINTS, default 4 AND its current replication lag is over MAX_REPLICATION_TIME_SETTING, default 5 minutes. This PR intentionally implements rejections only for index operations, allowing other TransportWriteActions to succeed, TransportResyncReplicationAction and RetentionLeaseSyncAction. Blocking these requests will fail recoveries as new nodes are added. Signed-off-by: Marc Handalian <handalm@amazon.com> * Add changelog Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test class to match naming conventions. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Change setting keys to remove index scope. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Mingshi Liu <mingshl@amazon.com>