-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport 2.x] Segment Replication - Fix ShardLockObtained error during corruption cases #10370 #10418
Conversation
…ases (opensearch-project#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Compatibility status:Checks if related components are compatible with change 13299ab Incompatible componentsIncompatible components: [https://github.com/opensearch-project/security.git] Skipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer-rca.git] |
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## 2.x #10418 +/- ##
============================================
+ Coverage 70.75% 70.88% +0.12%
- Complexity 58397 58544 +147
============================================
Files 4816 4829 +13
Lines 276031 276355 +324
Branches 40565 40576 +11
============================================
+ Hits 195306 195891 +585
+ Misses 64105 63765 -340
- Partials 16620 16699 +79
|
…ng corruption cases #10370 (#10418) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove exra logs Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky assertion on store refcount Signed-off-by: Marc Handalian <handalm@amazon.com> * Remove flaky test. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix unit test. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless Signed-off-by: Marc Handalian <handalm@amazon.com> * Revert flaky IT Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix flakiness failure by expecting RTE when check index fails. Signed-off-by: Marc Handalian <handalm@amazon.com> * reintroduce ITs and use recoveries API instead of waiting on shard state. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix edge case where flush failures would not get reported as corruption. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix breaking change only on main. Signed-off-by: Marc Handalian <handalm@amazon.com> --------- Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit cdf5e1a) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ng corruption cases #10370 (#10418) (#10429) * Segment Replication - Fix ShardLockObtained error during corruption cases (#10370) * Segment Replication - Fix ShardLockObtained error during corruption cases This change fixes a bug where shards could not be recreated locally after corruption. This occured because the store was not decref'd to 0 if the commit on close would fail with a corruption exception. * Remove exra logs * Remove flaky assertion on store refcount * Remove flaky test. * PR Feedback. Remove hacky handling of corruption when fetching metadata. This will now check for store corruption when replication has failed and fail the shard accordingly. This commit also fixes logging in NRTReplicationEngine. * Fix unit test. * Fix test failure testSegRepSucceedsOnPreviousCopiedFiles. This test broke because we invoked target.indexShard on a closed replicationTarget. In these cases we can assume the store is not corrupt. * spotless * Revert flaky IT * Fix flakiness failure by expecting RTE when check index fails. * reintroduce ITs and use recoveries API instead of waiting on shard state. * Fix edge case where flush failures would not get reported as corruption. --------- * Fix breaking change only on main. --------- (cherry picked from commit cdf5e1a) Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Manual backport of #10370 to 2.x