-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Fix Flaky in Test SegmentReplicationRelocationIT #6637
Changes from 7 commits
f25361f
5407a68
a5711a1
c8e5154
d4379e3
4e3272b
f7dcfb1
ffd94b1
f3fa803
fe5e715
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,6 +51,10 @@ public class NRTReplicationReaderManager extends OpenSearchReaderManager { | |
@Override | ||
protected OpenSearchDirectoryReader refreshIfNeeded(OpenSearchDirectoryReader referenceToRefresh) throws IOException { | ||
Objects.requireNonNull(referenceToRefresh); | ||
// checks if an actual refresh (change in segments) happened | ||
if (unwrapStandardReader(referenceToRefresh).getSegmentInfos().version == currentInfos.version) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you pls add a unit test for this change in |
||
return null; | ||
} | ||
final List<LeafReader> subs = new ArrayList<>(); | ||
final StandardDirectoryReader standardDirectoryReader = unwrapStandardReader(referenceToRefresh); | ||
for (LeafReaderContext ctx : standardDirectoryReader.leaves()) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -566,6 +566,12 @@ public void updateShardState( | |
: "a primary relocation is completed by the cluster-managerr, but primary mode is not active " + currentRouting; | ||
|
||
changeState(IndexShardState.STARTED, "global state is [" + newRouting.state() + "]"); | ||
|
||
// Flush here after relocation of primary, so that replica get all changes from new primary rather than waiting for more | ||
// docs to get indexed. | ||
if (indexSettings.isSegRepEnabled()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a comment here on why this is required? I think this is so new primary shards will push to replicas after relocation rather than waiting for more docs to get indexed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, new primary shards will push to replicas after relocation rather than waiting for more docs to get indexed |
||
flush(new FlushRequest().waitIfOngoing(true).force(true)); | ||
} | ||
} else if (currentRouting.primary() | ||
&& currentRouting.relocating() | ||
&& replicationTracker.isRelocated() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,6 @@ | |
|
||
package org.opensearch.index.shard; | ||
|
||
import org.apache.lucene.index.IndexCommit; | ||
import org.apache.lucene.index.SegmentInfos; | ||
import org.junit.Assert; | ||
import org.opensearch.ExceptionsHelper; | ||
|
@@ -569,21 +568,15 @@ public void testReplicaReceivesLowerGeneration() throws Exception { | |
numDocs = randomIntBetween(numDocs + 1, numDocs + 10); | ||
shards.indexDocs(numDocs); | ||
flushShard(primary, false); | ||
assertLatestCommitGen(4, primary); | ||
replicateSegments(primary, List.of(replica_1)); | ||
|
||
assertEqualCommittedSegments(primary, replica_1); | ||
assertLatestCommitGen(4, primary); | ||
assertLatestCommitGen(5, replica_1); | ||
assertLatestCommitGen(3, replica_2); | ||
|
||
shards.promoteReplicaToPrimary(replica_2).get(); | ||
primary.close("demoted", false); | ||
primary.store().close(); | ||
IndexShard oldPrimary = shards.addReplicaWithExistingPath(primary.shardPath(), primary.routingEntry().currentNodeId()); | ||
shards.recoverReplica(oldPrimary); | ||
assertLatestCommitGen(5, oldPrimary); | ||
assertLatestCommitGen(5, replica_2); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for doing this, this test is flaky and we really don't care about the generation only that the doc counts are equal. |
||
|
||
numDocs = randomIntBetween(numDocs + 1, numDocs + 10); | ||
shards.indexDocs(numDocs); | ||
|
@@ -1078,14 +1071,6 @@ private IndexShard failAndPromoteRandomReplica(ReplicationGroup shards) throws I | |
return newPrimary; | ||
} | ||
|
||
private void assertLatestCommitGen(long expected, IndexShard... shards) throws IOException { | ||
for (IndexShard indexShard : shards) { | ||
try (final GatedCloseable<IndexCommit> commit = indexShard.acquireLastIndexCommit(false)) { | ||
assertEquals(expected, commit.get().getGeneration()); | ||
} | ||
} | ||
} | ||
|
||
private void assertEqualCommittedSegments(IndexShard primary, IndexShard... replicas) throws IOException { | ||
for (IndexShard replica : replicas) { | ||
final SegmentInfos replicaInfos = replica.store().readLastCommittedSegmentsInfo(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without waiting for all indexing operations to complete, this test will flaky. I think a better way to test changes would be:
Without auto refreshes (refresh -1), only way replica will have doc count is via the force flush from new primary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we need not perform any refresh and turn all aync calls to sync here. Below is the set of steps we can follow (we don't need to block segment replication to replica).
STARTED
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense. This way it will fail more definitively without flush. Thanks @dreamer-89 , I will update the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserting doc count on primary before relocation is not possible as without a refresh even on primary shard, it will have 0 docs, as we open searcher only after a refresh. So instead I am asserting on segrep stats that no replication event has taken place on replica shard before relocation