-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Fix Flaky in Test SegmentReplicationRelocationIT #6637
[Segment Replication] Fix Flaky in Test SegmentReplicationRelocationIT #6637
Conversation
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #6637 +/- ##
============================================
- Coverage 71.22% 70.71% -0.51%
+ Complexity 59521 59132 -389
============================================
Files 4803 4803
Lines 283208 283190 -18
Branches 40842 40836 -6
============================================
- Hits 201712 200258 -1454
- Misses 65266 66549 +1283
- Partials 16230 16383 +153
... and 490 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
…6366 Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Looks like a legit failure @Rishikesh1159 which needs unit test updates.
|
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
@@ -566,6 +566,9 @@ public void updateShardState( | |||
: "a primary relocation is completed by the cluster-managerr, but primary mode is not active " + currentRouting; | |||
|
|||
changeState(IndexShardState.STARTED, "global state is [" + newRouting.state() + "]"); | |||
if (indexSettings.isSegRepEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here on why this is required? I think this is so new primary shards will push to replicas after relocation rather than waiting for more docs to get indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Also, it would be better to add a test verifying segrep is forced on replica copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, new primary shards will push to replicas after relocation rather than waiting for more docs to get indexed
|
||
shards.promoteReplicaToPrimary(replica_2).get(); | ||
primary.close("demoted", false); | ||
primary.store().close(); | ||
IndexShard oldPrimary = shards.addReplicaWithExistingPath(primary.shardPath(), primary.routingEntry().currentNodeId()); | ||
shards.recoverReplica(oldPrimary); | ||
assertLatestCommitGen(5, oldPrimary); | ||
assertLatestCommitGen(5, replica_2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this, this test is flaky and we really don't care about the generation only that the doc counts are equal.
@@ -51,6 +51,10 @@ public class NRTReplicationReaderManager extends OpenSearchReaderManager { | |||
@Override | |||
protected OpenSearchDirectoryReader refreshIfNeeded(OpenSearchDirectoryReader referenceToRefresh) throws IOException { | |||
Objects.requireNonNull(referenceToRefresh); | |||
// checks if an actual refresh (change in segments) happened | |||
if (unwrapStandardReader(referenceToRefresh).getSegmentInfos().version == currentInfos.version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you pls add a unit test for this change in NRTReplicationEngineTests
?
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
// Verify if all docs are present in replica after flush, if new relocated primary doesn't flush after relocation the below assert | ||
// will fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without waiting for all indexing operations to complete, this test will flaky. I think a better way to test changes would be:
- Block operations from older primary to replica.
- Insert all docs after above.
- Relocate the primary shard.
Without auto refreshes (refresh -1), only way replica will have doc count is via the force flush from new primary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we need not perform any refresh and turn all aync calls to sync here. Below is the set of steps we can follow (we don't need to block segment replication to replica).
- Index docs in sync without refresh so that only primary contains.
- Assert doc count on primary is matching ingested doc count but replica should have 0 doc count (note: we are not performing any refresh).
- Relocate primary to new primary node and wait for it to complete.
- Assert replica gets all the docs. This should only be possible from flush operation before state switch to
STARTED
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense. This way it will fail more definitively without flush. Thanks @dreamer-89 , I will update the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserting doc count on primary before relocation is not possible as without a refresh even on primary shard, it will have 0 docs, as we open searcher only after a refresh. So instead I am asserting on segrep stats that no replication event has taken place on replica shard before relocation
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Let's also wait for @mch2 review.
// assert | ||
// will fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be accomodated in a single line.
Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637) * Trigger Refresh on NRT Engine. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Revert changes made to PublishCheckpointAction in #6366 Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fix failing unit test Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Force flush on new elected primary after relocation. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fix failing unit test. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Remove unnecessary assertions Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding tests. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Address comments Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fix indentation. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> --------- Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> (cherry picked from commit 1e5d913) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (#6637) (#6675) * Trigger Refresh on NRT Engine. * Revert changes made to PublishCheckpointAction in #6366 * Fix failing unit test * Force flush on new elected primary after relocation. * Fix failing unit test. * Remove unnecessary assertions * Adding tests. * Address comments * Fix indentation. --------- (cherry picked from commit 1e5d913) Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…estRelocateWhileContinuouslyIndexingAndWaitingForRefresh (opensearch-project#6637) * Trigger Refresh on NRT Engine. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Revert changes made to PublishCheckpointAction in opensearch-project#6366 Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fix failing unit test Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Force flush on new elected primary after relocation. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fix failing unit test. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Remove unnecessary assertions Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Adding tests. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Address comments Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> * Fix indentation. Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> --------- Signed-off-by: Rishikesh1159 <rishireddy1159@gmail.com> Signed-off-by: Mingshi Liu <mingshl@amazon.com>
Description
This PR triggers refreshes on shards using NRT Engine. We made the decision to not support that behavior wait_until with segrep. So this PR reverts changes made to PublishCheckpointAction class by #6366.
This PR fixes relocation bugs and also fixes many of the flaky tests failures in SegmentReplicationIT. Mainly tests:
testRelocateWhileContinuouslyIndexingAndWaitingForRefresh()
andtestPrimaryRelocation()
Issues Resolved
#6531 and #6665
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.