Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Fix Flaky in Test SegmentReplicationRelocationIT #6637

Merged
merged 10 commits into from
Mar 15, 2023

Conversation

Rishikesh1159
Copy link
Member

@Rishikesh1159 Rishikesh1159 commented Mar 11, 2023

Description

This PR triggers refreshes on shards using NRT Engine. We made the decision to not support that behavior wait_until with segrep. So this PR reverts changes made to PublishCheckpointAction class by #6366.

This PR fixes relocation bugs and also fixes many of the flaky tests failures in SegmentReplicationIT. Mainly tests:
testRelocateWhileContinuouslyIndexingAndWaitingForRefresh() and testPrimaryRelocation()

Issues Resolved

#6531 and #6665

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermarkWithAutoReleaseEnabled
      1 org.opensearch.backwards.MixedClusterClientYamlTestSuiteIT.test {p0=search.aggregation/20_terms/string profiler via global ordinals}

@codecov-commenter
Copy link

codecov-commenter commented Mar 11, 2023

Codecov Report

Merging #6637 (ffd94b1) into main (73a2279) will decrease coverage by 0.51%.
The diff coverage is 37.50%.

❗ Current head ffd94b1 differs from pull request most recent head f3fa803. Consider uploading reports for the commit f3fa803 to get more accurate results

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6637      +/-   ##
============================================
- Coverage     71.22%   70.71%   -0.51%     
+ Complexity    59521    59132     -389     
============================================
  Files          4803     4803              
  Lines        283208   283190      -18     
  Branches      40842    40836       -6     
============================================
- Hits         201712   200258    -1454     
- Misses        65266    66549    +1283     
- Partials      16230    16383     +153     
Impacted Files Coverage Δ
...eplication/checkpoint/PublishCheckpointAction.java 28.57% <10.00%> (+5.13%) ⬆️
.../opensearch/index/engine/NRTReplicationEngine.java 70.86% <33.33%> (-2.06%) ⬇️
...arch/index/engine/NRTReplicationReaderManager.java 88.46% <100.00%> (+0.96%) ⬆️
...nsearch/index/shard/CheckpointRefreshListener.java 100.00% <100.00%> (+11.11%) ⬆️
...in/java/org/opensearch/index/shard/IndexShard.java 69.71% <100.00%> (-0.85%) ⬇️

... and 490 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

…6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dreamer-89
Copy link
Member

Gradle Check (Jenkins) Run Completed with:

Looks like a legit failure @Rishikesh1159 which needs unit test updates.

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.indices.replication.checkpoint.PublishCheckpointActionTests.testPublishCheckpointActionOnPrimary" -Dtests.seed=D77C43ABBEE6E324 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ms -Dtests.timezone=Etc/GMT+12 -Druntime.java=19

org.opensearch.indices.replication.checkpoint.PublishCheckpointActionTests > testPublishCheckpointActionOnPrimary FAILED
    junit.framework.AssertionFailedError: Expected exception OpenSearchException but no exception was thrown
        at __randomizedtesting.SeedInfo.seed([D77C43ABBEE6E324:CD7AA1203C0F125B]:0)
        at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2864)
        at org.apache.lucene.tests.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2850)
        at org.opensearch.indices.replication.checkpoint.PublishCheckpointActionTests.testPublishCheckpointActionOnPrimary(PublishCheckpointActionTests.java:109)

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringFetchPhase

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermarkWithAutoReleaseEnabled

@Rishikesh1159 Rishikesh1159 changed the title [Segment Replication] Trigger Refresh on NRT Engine [Segment Replication] Fix Flaky Test SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh Mar 14, 2023
@@ -566,6 +566,9 @@ public void updateShardState(
: "a primary relocation is completed by the cluster-managerr, but primary mode is not active " + currentRouting;

changeState(IndexShardState.STARTED, "global state is [" + newRouting.state() + "]");
if (indexSettings.isSegRepEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here on why this is required? I think this is so new primary shards will push to replicas after relocation rather than waiting for more docs to get indexed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Also, it would be better to add a test verifying segrep is forced on replica copies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, new primary shards will push to replicas after relocation rather than waiting for more docs to get indexed


shards.promoteReplicaToPrimary(replica_2).get();
primary.close("demoted", false);
primary.store().close();
IndexShard oldPrimary = shards.addReplicaWithExistingPath(primary.shardPath(), primary.routingEntry().currentNodeId());
shards.recoverReplica(oldPrimary);
assertLatestCommitGen(5, oldPrimary);
assertLatestCommitGen(5, replica_2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, this test is flaky and we really don't care about the generation only that the doc counts are equal.

@@ -51,6 +51,10 @@ public class NRTReplicationReaderManager extends OpenSearchReaderManager {
@Override
protected OpenSearchDirectoryReader refreshIfNeeded(OpenSearchDirectoryReader referenceToRefresh) throws IOException {
Objects.requireNonNull(referenceToRefresh);
// checks if an actual refresh (change in segments) happened
if (unwrapStandardReader(referenceToRefresh).getSegmentInfos().version == currentInfos.version) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pls add a unit test for this change in NRTReplicationEngineTests ?

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Comment on lines 552 to 553
// Verify if all docs are present in replica after flush, if new relocated primary doesn't flush after relocation the below assert
// will fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without waiting for all indexing operations to complete, this test will flaky. I think a better way to test changes would be:

  1. Block operations from older primary to replica.
  2. Insert all docs after above.
  3. Relocate the primary shard.

Without auto refreshes (refresh -1), only way replica will have doc count is via the force flush from new primary.

Copy link
Member

@dreamer-89 dreamer-89 Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we need not perform any refresh and turn all aync calls to sync here. Below is the set of steps we can follow (we don't need to block segment replication to replica).

  1. Index docs in sync without refresh so that only primary contains.
  2. Assert doc count on primary is matching ingested doc count but replica should have 0 doc count (note: we are not performing any refresh).
  3. Relocate primary to new primary node and wait for it to complete.
  4. Assert replica gets all the docs. This should only be possible from flush operation before state switch to STARTED.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense. This way it will fail more definitively without flush. Thanks @dreamer-89 , I will update the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asserting doc count on primary before relocation is not possible as without a refresh even on primary shard, it will have 0 docs, as we open searcher only after a refresh. So instead I am asserting on segrep stats that no replication event has taken place on replica shard before relocation

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Rishikesh1159 and others added 2 commits March 15, 2023 00:18
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.SegmentReplicationPressureIT.testWritesRejected
      1 org.opensearch.index.SegmentReplicationPressureIT.testAddReplicaWhileWritesBlocked

Copy link
Member

@dreamer-89 dreamer-89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Let's also wait for @mch2 review.

Comment on lines 557 to 558
// assert
// will fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can be accomodated in a single line.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue
      1 org.opensearch.search.SearchWeightedRoutingIT.testSearchAggregationWithNetworkDisruption_FailOpenEnabled
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testFlushAfterRelocation
      1 org.opensearch.index.SegmentReplicationPressureIT.testWritesRejected
      1 org.opensearch.index.SegmentReplicationPressureIT.testAddReplicaWhileWritesBlocked

@Rishikesh1159 Rishikesh1159 added the backport 2.x Backport to 2.x branch label Mar 15, 2023
@Rishikesh1159 Rishikesh1159 merged commit 1e5d913 into opensearch-project:main Mar 15, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 15, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637)

* Trigger Refresh on NRT Engine.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Revert changes made to PublishCheckpointAction in #6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Force flush on new elected primary after relocation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Remove unnecessary assertions

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding tests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Address comments

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix indentation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

---------

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
(cherry picked from commit 1e5d913)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@Rishikesh1159 Rishikesh1159 changed the title [Segment Replication] Fix Flaky Test SegmentReplicationRelocationIT.testRelocateWhileContinuouslyIndexingAndWaitingForRefresh [Segment Replication] Fix Flaky in Test SegmentReplicationRelocationIT Mar 15, 2023
Rishikesh1159 pushed a commit that referenced this pull request Mar 16, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637) (#6675)

* Trigger Refresh on NRT Engine.



* Revert changes made to PublishCheckpointAction in #6366



* Fix failing unit test



* Force flush on new elected primary after relocation.



* Fix failing unit test.



* Remove unnecessary assertions



* Adding tests.



* Address comments



* Fix indentation.



---------


(cherry picked from commit 1e5d913)

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
mingshl pushed a commit to mingshl/OpenSearch-Mingshl that referenced this pull request Mar 24, 2023
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (opensearch-project#6637)

* Trigger Refresh on NRT Engine.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Revert changes made to PublishCheckpointAction in opensearch-project#6366

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Force flush on new elected primary after relocation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix failing unit test.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Remove unnecessary assertions

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Adding tests.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Address comments

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

* Fix indentation.

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>

---------

Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants