Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where ReplicationListeners would not complete on cancellation. #8478

Merged
merged 4 commits into from
Jul 11, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Jul 6, 2023

Description

This change updates Segment Replication to ensure all listeners are cleaned up during cancellation. This happens because of a race condition with beforeIndexShardClosed cancelling with RecoveriesCollection#cancelForShard and the target failing. CancelForShard immediately removes the target from the collection and then invokes cancel. When cancel would complete, it relies on another call to fail to remove it from the collection & notify the listeners, but it had already been removed. This PR fixes this by introducing a new method to RecoveriesCollection to request cancellation only.

This change includes tests using suite scope to catch for any open tasks. This caught other locations where this was possible:

  1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary. - Fixed by refactoring those paths to use a ChannelActionListener.
  2. On the primary during forceSync, the primary makes a synchronous call to forceSync that was not wrapped in cancellableThreads. So it relies on the replica to send cancellation in order to proceed.
  3. During cancellation when pterm is greater on an incoming checkpoint, we would remove from collection but the target could still be open - fixed by waiting for the cancelled target to be closed before proceeding. Also added method to ReplicationTarget to guarantee only a single target can be added to the collection.

Related Issues

closes #8292

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@mch2 mch2 force-pushed the cancel branch 2 times, most recently from 17af199 to b3e72bb Compare July 6, 2023 18:33
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@mch2 mch2 force-pushed the cancel branch 2 times, most recently from 745f471 to 72b4f69 Compare July 7, 2023 05:00
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@mch2
Copy link
Member Author

mch2 commented Jul 7, 2023

Waiting on #8463 to rebase changes for store cleanup. These new ITs would occasionally hit cases on shard close where tmp files would get wiped while multiFileWriter is still writing to it.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testDropPrimaryDuringReplication
      1 org.opensearch.recovery.ReplicationCollectionTests.testStartMultipleReplicationsForSingleShard

@codecov
Copy link

codecov bot commented Jul 9, 2023

Codecov Report

Merging #8478 (aab498a) into main (516685d) will increase coverage by 0.00%.
The diff coverage is 72.05%.

@@            Coverage Diff            @@
##               main    #8478   +/-   ##
=========================================
  Coverage     70.85%   70.85%           
+ Complexity    56971    56915   -56     
=========================================
  Files          4758     4758           
  Lines        269361   269372   +11     
  Branches      39408    39407    -1     
=========================================
+ Hits         190849   190862   +13     
- Misses        62438    62466   +28     
+ Partials      16074    16044   -30     
Impacted Files Coverage Δ
...s/replication/SegmentReplicationTargetService.java 70.09% <60.27%> (+2.05%) ⬆️
.../indices/replication/SegmentReplicationTarget.java 83.83% <82.97%> (-0.51%) ⬇️
...ices/replication/common/ReplicationCollection.java 75.20% <92.30%> (+2.05%) ⬆️
...search/indices/recovery/RecoverySourceHandler.java 78.11% <100.00%> (ø)
...h/indices/replication/SegmentReplicationState.java 44.69% <100.00%> (-0.63%) ⬇️
.../indices/replication/common/ReplicationTarget.java 78.94% <100.00%> (-7.72%) ⬇️

... and 470 files with indirect coverage changes

@github-actions

This comment was marked as outdated.

…mplete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jul 9, 2023

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 marked this pull request as ready for review July 9, 2023 06:42
mch2 and others added 2 commits July 10, 2023 17:00
Signed-off-by: Marc Handalian <handalm@amazon.com>
…tReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89 dreamer-89 added backport 2.x Backport to 2.x branch skip-changelog labels Jul 11, 2023
Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mch2 for fixing this and cleaning up cancellation sphagetti code.

Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Jul 11, 2023

Gradle Check (Jenkins) Run Completed with:

Err https://esm.ubuntu.com/ trusty-infra-updates/main amd64 Packages
    HttpError503
  �[91mW: Failed to fetch https://esm.ubuntu.com/ubuntu/dists/trusty-infra-security/main/binary-amd64/Packages  HttpError503
  
  W: Failed to fetch https://esm.ubuntu.com/ubuntu/dists/trusty-infra-updates/main/binary-amd64/Packages  HttpError503
  
  E: Some index files failed to download. They have been ignored, or old ones used instead.
  �[0mFetched 13.6 MB in 2min 8s (105 kB/s)

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Get more help at https://help.gradle.org/.
==============================================================================

2: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':client:rest-high-level:test'.

Unrelated

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testBasicTaskResourceTracking

@mch2 mch2 merged commit 4ccbf9d into opensearch-project:main Jul 11, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 11, 2023
#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
(cherry picked from commit 4ccbf9d)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mch2 pushed a commit that referenced this pull request Jul 11, 2023
#8478) (#8630)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.



PR cleanup.



Update log message



* PR feedback.



* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java




* Add more tests.



---------



(cherry picked from commit 4ccbf9d)

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
vikasvb90 pushed a commit to raghuvanshraj/OpenSearch that referenced this pull request Jul 12, 2023
opensearch-project#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
raghuvanshraj pushed a commit to raghuvanshraj/OpenSearch that referenced this pull request Jul 12, 2023
opensearch-project#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
dzane17 pushed a commit to dzane17/OpenSearch that referenced this pull request Jul 12, 2023
opensearch-project#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
buddharajusahil pushed a commit to buddharajusahil/OpenSearch that referenced this pull request Jul 18, 2023
opensearch-project#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: sahil buddharaju <sahilbud@amazon.com>
baba-devv pushed a commit to baba-devv/OpenSearch that referenced this pull request Jul 29, 2023
opensearch-project#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
opensearch-project#8478)

* [Segment Replication] Fix bug where ReplicationListeners would not complete on target cancellation.

This change updates cancellation with Segment Replication to ensure all listeners are resolved.
It does this by requesting cancellation before shard closure instead of using ReplicationCollection's cancelForShard which immediately removes it from the replicationCollection.  This would cause the underlying ReplicationListener to never get invoked on close.

This change includes new tests using suite scope to catch for any open tasks.
This caught other locations where this was possible:
1. On a replica during force sync if the shard was closed while resolving its listeners, it would never call back to the primary if an exception was caught in the onDone method. - Fixed by refactoring those paths to use a ChannelActionListener and always reply to primary.
2. On the primary during forceSync, the primary would not successfully cancel before shard close during a forceSync, Fixed by wrapping the synchronous recoveryTarget::forceSync call in cancellableThreads.

Signed-off-by: Marc Handalian <handalm@amazon.com>

PR cleanup.

Signed-off-by: Marc Handalian <handalm@amazon.com>

Update log message

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR feedback.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Update server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java

Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>

* Add more tests.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Co-authored-by: Suraj Singh <surajrider@gmail.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Segment Replication - Shard close while copying files can leave listeners open
3 participants