Fix bug where ReplicationListeners would not complete on cancellation. #8478

mch2 · 2023-07-06T08:38:22Z

Description

This change updates Segment Replication to ensure all listeners are cleaned up during cancellation. This happens because of a race condition with beforeIndexShardClosed cancelling with RecoveriesCollection#cancelForShard and the target failing. CancelForShard immediately removes the target from the collection and then invokes cancel. When cancel would complete, it relies on another call to fail to remove it from the collection & notify the listeners, but it had already been removed. This PR fixes this by introducing a new method to RecoveriesCollection to request cancellation only.

This change includes tests using suite scope to catch for any open tasks. This caught other locations where this was possible:

On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary. - Fixed by refactoring those paths to use a ChannelActionListener.
On the primary during forceSync, the primary makes a synchronous call to forceSync that was not wrapped in cancellableThreads. So it relies on the replica to send cancellation in order to proceed.
During cancellation when pterm is greater on an incoming checkpoint, we would remove from collection but the target could still be open - fixed by waiting for the cancelled target to be closed before proceeding. Also added method to ReplicationTarget to guarantee only a single target can be added to the collection.

Related Issues

closes #8292

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

mch2 · 2023-07-07T08:40:40Z

Waiting on #8463 to rebase changes for store cleanup. These new ITs would occasionally hit cases on shard close where tmp files would get wiped while multiFileWriter is still writing to it.

github-actions · 2023-07-07T17:35:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/19427/
CommitID: 91a99f8
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-07-07T21:57:54Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication
      1 org.opensearch.recovery.ReplicationCollectionTests.testStartMultipleReplicationsForSingleShard

URL: https://build.ci.opensearch.org/job/gradle-check/19465/
CommitID: a0d0c51
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2023-07-09T00:52:46Z

Codecov Report

Merging #8478 (aab498a) into main (516685d) will increase coverage by 0.00%.
The diff coverage is 72.05%.

@@            Coverage Diff            @@
##               main    #8478   +/-   ##
=========================================
  Coverage     70.85%   70.85%           
+ Complexity    56971    56915   -56     
=========================================
  Files          4758     4758           
  Lines        269361   269372   +11     
  Branches      39408    39407    -1     
=========================================
+ Hits         190849   190862   +13     
- Misses        62438    62466   +28     
+ Partials      16074    16044   -30

Impacted Files	Coverage Δ
...s/replication/SegmentReplicationTargetService.java	`70.09% <60.27%> (+2.05%)`	⬆️
.../indices/replication/SegmentReplicationTarget.java	`83.83% <82.97%> (-0.51%)`	⬇️
...ices/replication/common/ReplicationCollection.java	`75.20% <92.30%> (+2.05%)`	⬆️
...search/indices/recovery/RecoverySourceHandler.java	`78.11% <100.00%> (ø)`
...h/indices/replication/SegmentReplicationState.java	`44.69% <100.00%> (-0.63%)`	⬇️
.../indices/replication/common/ReplicationTarget.java	`78.94% <100.00%> (-7.72%)`	⬇️

... and 470 files with indirect coverage changes

…mplete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com>

github-actions · 2023-07-09T04:57:06Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/19523/
CommitID: f42be80

server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationState.java

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java

...c/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationSuiteIT.java

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java

server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Signed-off-by: Marc Handalian <handalm@amazon.com>

…tReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com>

github-actions · 2023-07-11T00:33:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

URL: https://build.ci.opensearch.org/job/gradle-check/19767/
CommitID: 50a1c37
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions · 2023-07-11T00:45:32Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/19768/
CommitID: 478eda3

dreamer-89

Thanks @mch2 for fixing this and cleaning up cancellation sphagetti code.

Signed-off-by: Marc Handalian <handalm@amazon.com>

github-actions · 2023-07-11T07:00:47Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/19807/
CommitID: aab498a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

mch2 · 2023-07-11T15:30:22Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌

URL: https://build.ci.opensearch.org/job/gradle-check/19807/

CommitID: aab498a
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Err https://esm.ubuntu.com/ trusty-infra-updates/main amd64 Packages
    HttpError503
  �[91mW: Failed to fetch https://esm.ubuntu.com/ubuntu/dists/trusty-infra-security/main/binary-amd64/Packages  HttpError503
  
  W: Failed to fetch https://esm.ubuntu.com/ubuntu/dists/trusty-infra-updates/main/binary-amd64/Packages  HttpError503
  
  E: Some index files failed to download. They have been ignored, or old ones used instead.
  �[0mFetched 13.6 MB in 2min 8s (105 kB/s)

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org/.
==============================================================================

2: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':client:rest-high-level:test'.

Unrelated

github-actions · 2023-07-11T16:08:53Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❕
TEST FAILURES:

      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testBasicTaskResourceTracking

URL: https://build.ci.opensearch.org/job/gradle-check/19897/
CommitID: aab498a
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> (cherry picked from commit 4ccbf9d) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

#8478) (#8630) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. PR cleanup. Update log message * PR feedback. * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java * Add more tests. --------- (cherry picked from commit 4ccbf9d) Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>

opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>

opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: sahil buddharaju <sahilbud@amazon.com>

opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>

opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>