Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.x] Segment Replication - Fix ShardLockObtained error during corruption cases #10370 #10418

Merged
merged 2 commits into from
Oct 5, 2023

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Oct 5, 2023

Manual backport of #10370 to 2.x

…ases (opensearch-project#10370)

* Segment Replication - Fix ShardLockObtained error during corruption cases

This change fixes a bug where shards could not be recreated locally after corruption.
This occured because the store was not decref'd to 0 if the commit on close would fail
with a corruption exception.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove exra logs

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove flaky assertion on store refcount

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove flaky test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR Feedback.

Remove hacky handling of corruption when fetching metadata.  This will now check for store corruption
when replication has failed and fail the shard accordingly.

This commit also fixes logging in NRTReplicationEngine.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix unit test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix test failure testSegRepSucceedsOnPreviousCopiedFiles.

This test broke because we invoked target.indexShard on a closed replicationTarget.
In these cases we can assume the store is not corrupt.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Revert flaky IT

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix flakiness failure by expecting RTE when check index fails.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* reintroduce ITs and use recoveries API instead of waiting on shard state.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix edge case where flush failures would not get reported as corruption.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

Compatibility status:

Checks if related components are compatible with change 13299ab

Incompatible components

Incompatible components: [https://github.com/opensearch-project/security.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.test.rest.ClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals}
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Merging #10418 (13299ab) into 2.x (fc7dc20) will increase coverage by 0.12%.
Report is 7 commits behind head on 2.x.
The diff coverage is 77.07%.

@@             Coverage Diff              @@
##                2.x   #10418      +/-   ##
============================================
+ Coverage     70.75%   70.88%   +0.12%     
- Complexity    58397    58544     +147     
============================================
  Files          4816     4829      +13     
  Lines        276031   276355     +324     
  Branches      40565    40576      +11     
============================================
+ Hits         195306   195891     +585     
+ Misses        64105    63765     -340     
- Partials      16620    16699      +79     
Files Coverage Δ
...arch/telemetry/metrics/DefaultMetricsRegistry.java 100.00% <100.00%> (ø)
...ch/telemetry/metrics/noop/NoopMetricsRegistry.java 100.00% <100.00%> (ø)
...va/org/opensearch/telemetry/metrics/tags/Tags.java 100.00% <100.00%> (ø)
...org/opensearch/repositories/url/URLRepository.java 72.00% <100.00%> (ø)
...g/opensearch/transport/netty4/Netty4Transport.java 73.65% <100.00%> (ø)
...opensearch/repositories/azure/AzureRepository.java 68.88% <100.00%> (-1.33%) ⬇️
...g/opensearch/repositories/hdfs/HdfsRepository.java 61.90% <100.00%> (ø)
...java/org/opensearch/repositories/s3/S3Service.java 75.00% <100.00%> (-1.05%) ⬇️
.../org/opensearch/telemetry/OTelTelemetryPlugin.java 100.00% <100.00%> (ø)
...rg/opensearch/telemetry/OTelTelemetrySettings.java 100.00% <100.00%> (ø)
... and 43 more

... and 501 files with indirect coverage changes

@kotwanikunal kotwanikunal merged commit cdf5e1a into opensearch-project:2.x Oct 5, 2023
14 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 5, 2023
…ng corruption cases #10370 (#10418)

* Segment Replication - Fix ShardLockObtained error during corruption cases (#10370)

* Segment Replication - Fix ShardLockObtained error during corruption cases

This change fixes a bug where shards could not be recreated locally after corruption.
This occured because the store was not decref'd to 0 if the commit on close would fail
with a corruption exception.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove exra logs

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove flaky assertion on store refcount

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Remove flaky test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* PR Feedback.

Remove hacky handling of corruption when fetching metadata.  This will now check for store corruption
when replication has failed and fail the shard accordingly.

This commit also fixes logging in NRTReplicationEngine.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix unit test.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix test failure testSegRepSucceedsOnPreviousCopiedFiles.

This test broke because we invoked target.indexShard on a closed replicationTarget.
In these cases we can assume the store is not corrupt.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* spotless

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Revert flaky IT

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix flakiness failure by expecting RTE when check index fails.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* reintroduce ITs and use recoveries API instead of waiting on shard state.

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix edge case where flush failures would not get reported as corruption.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>

* Fix breaking change only on main.

Signed-off-by: Marc Handalian <handalm@amazon.com>

---------

Signed-off-by: Marc Handalian <handalm@amazon.com>
(cherry picked from commit cdf5e1a)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
kotwanikunal pushed a commit that referenced this pull request Oct 5, 2023
…ng corruption cases #10370 (#10418) (#10429)

* Segment Replication - Fix ShardLockObtained error during corruption cases (#10370)

* Segment Replication - Fix ShardLockObtained error during corruption cases

This change fixes a bug where shards could not be recreated locally after corruption.
This occured because the store was not decref'd to 0 if the commit on close would fail
with a corruption exception.



* Remove exra logs



* Remove flaky assertion on store refcount



* Remove flaky test.



* PR Feedback.

Remove hacky handling of corruption when fetching metadata.  This will now check for store corruption
when replication has failed and fail the shard accordingly.

This commit also fixes logging in NRTReplicationEngine.



* Fix unit test.



* Fix test failure testSegRepSucceedsOnPreviousCopiedFiles.

This test broke because we invoked target.indexShard on a closed replicationTarget.
In these cases we can assume the store is not corrupt.



* spotless



* Revert flaky IT



* Fix flakiness failure by expecting RTE when check index fails.



* reintroduce ITs and use recoveries API instead of waiting on shard state.



* Fix edge case where flush failures would not get reported as corruption.



---------



* Fix breaking change only on main.



---------


(cherry picked from commit cdf5e1a)

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants