-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where ReplicationListeners would not complete on cancellation. #8478
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
17af199
to
b3e72bb
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
745f471
to
72b4f69
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Waiting on #8463 to rebase changes for store cleanup. These new ITs would occasionally hit cases on shard close where tmp files would get wiped while multiFileWriter is still writing to it. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #8478 +/- ##
=========================================
Coverage 70.85% 70.85%
+ Complexity 56971 56915 -56
=========================================
Files 4758 4758
Lines 269361 269372 +11
Branches 39408 39407 -1
=========================================
+ Hits 190849 190862 +13
- Misses 62438 62466 +28
+ Partials 16074 16044 -30
|
This comment was marked as outdated.
This comment was marked as outdated.
…mplete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationState.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java
Show resolved
Hide resolved
...c/internalClusterTest/java/org/opensearch/indices/replication/SegmentReplicationSuiteIT.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Marc Handalian <handalm@amazon.com>
…tReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mch2 for fixing this and cleaning up cancellation sphagetti code.
Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Unrelated |
Gradle Check (Jenkins) Run Completed with:
|
#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> (cherry picked from commit 4ccbf9d) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
#8478) (#8630) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. PR cleanup. Update log message * PR feedback. * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java * Add more tests. --------- (cherry picked from commit 4ccbf9d) Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>
opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>
opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>
opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>
opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: sahil buddharaju <sahilbud@amazon.com>
opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com>
opensearch-project#8478) * [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation. This change updates cancellation with Segment Replication to ensure all listeners are resolved. It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection. This would cause the underlying ReplicationListener to never get invoked on close. This change includes new tests using suite scope to catch for any open tasks. This caught other locations where this was possible: 1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary. 2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads. Signed-off-by: Marc Handalian <handalm@amazon.com> PR cleanup. Signed-off-by: Marc Handalian <handalm@amazon.com> Update log message Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Marc Handalian <handalm@amazon.com> * Add more tests. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> Co-authored-by: Suraj Singh <surajrider@gmail.com> Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Description
This change updates Segment Replication to ensure all listeners are cleaned up during cancellation. This happens because of a race condition with beforeIndexShardClosed cancelling with RecoveriesCollection#cancelForShard and the target failing. CancelForShard immediately removes the target from the collection and then invokes cancel. When cancel would complete, it relies on another call to fail to remove it from the collection & notify the listeners, but it had already been removed. This PR fixes this by introducing a new method to RecoveriesCollection to request cancellation only.
This change includes tests using suite scope to catch for any open tasks. This caught other locations where this was possible:
Related Issues
closes #8292
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.