-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky SegmentReplicationITs. #6015
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This change fixes flakiness with segment replication ITs. It does this by updating the wait condition used to ensure replicas are up to date to wait until a searched docCount is reached instead of output of the Segments API that can change if there are concurrent refreshes. It also does this by updating the method used to assert segment stats to wait until the assertion holds true rather than at a point in time. This method is also updated to assert store metadata directly over API output. Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #6015 +/- ##
============================================
+ Coverage 70.73% 71.30% +0.56%
- Complexity 58738 59132 +394
============================================
Files 4771 4771
Lines 280820 280818 -2
Branches 40568 40568
============================================
+ Hits 198645 200243 +1598
+ Misses 65865 64461 -1404
+ Partials 16310 16114 -196
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
org.opensearch.indices.replication.SegmentReplicationIT.testDropPrimaryDuringReplication grr.. looking at this one... |
Signed-off-by: Marc Handalian <handalm@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
last few runs related to DiskThresholdDeciderIT failures. #5956 I've run SR test testDropPrimaryDuringReplication overnight 1,150 times locally while running the entire test class and not able to repro this. Will try some more today. In the meantime improved the error msg and kicking off a few more check runs on this PR. |
Gradle Check (Jenkins) Run Completed with:
|
final Store.RecoveryDiff diff = Store.segmentReplicationDiff(checkpointInfo.getMetadataMap(), getMetadataMap()); | ||
logger.trace("Replication diff {}", diff); | ||
final Store.RecoveryDiff diff = Store.segmentReplicationDiff(checkpointInfo.getMetadataMap(), indexShard.getSegmentMetadataMap()); | ||
logger.trace("Replication diff for checkpoint {} {}", checkpointInfo.getCheckpoint(), diff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
final ShardRouting replicaShardRouting = shardSegments.getShardRouting(); | ||
ClusterState state = client(internalCluster().getClusterManagerName()).admin().cluster().prepareState().get().getState(); | ||
final DiscoveryNode replicaNode = state.nodes().resolveNode(replicaShardRouting.currentNodeId()); | ||
return getIndexShard(replicaNode.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ?
return getIndexShard(shardSegments.getShardRouting().currentNodeId()) ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be needed based on #6015 (comment)
assertBusy(() -> { | ||
final IndicesSegmentResponse indicesSegmentResponse = client().admin() | ||
.indices() | ||
.segments(new IndicesSegmentsRequest()) | ||
.actionGet(); | ||
List<ShardSegments[]> segmentsByIndex = getShardSegments(indicesSegmentResponse); | ||
|
||
// Fetch the IndexShard for this replica and try and build its SegmentInfos from the previous commit point. | ||
// This ensures the previous commit point is not wiped. | ||
final ShardRouting replicaShardRouting = shardSegment.getShardRouting(); | ||
ClusterState state = client(internalCluster().getMasterName()).admin().cluster().prepareState().get().getState(); | ||
final DiscoveryNode replicaNode = state.nodes().resolveNode(replicaShardRouting.currentNodeId()); | ||
IndexShard indexShard = getIndexShard(replicaNode.getName()); | ||
// calls to readCommit will fail if a valid commit point and all its segments are not in the store. | ||
indexShard.store().readLastCommittedSegmentsInfo(); | ||
// There will be an entry in the list for each index. | ||
assertEquals("Expected a different number of shards in the index", numberOfShards, segmentsByIndex.size()); | ||
for (ShardSegments[] replicationGroupSegments : segmentsByIndex) { | ||
// Separate Primary & replica shards ShardSegments. | ||
final Map<Boolean, List<ShardSegments>> segmentListMap = segmentsByShardType(replicationGroupSegments); | ||
final List<ShardSegments> primaryShardSegmentsList = segmentListMap.get(true); | ||
final List<ShardSegments> replicaShardSegmentsList = segmentListMap.get(false); | ||
assertEquals("There should only be one primary in the replicationGroup", 1, primaryShardSegmentsList.size()); | ||
assertEquals( | ||
"There should be a ShardSegment entry for each replica in the replicationGroup", | ||
numberOfReplicas, | ||
replicaShardSegmentsList.size() | ||
); | ||
final ShardSegments primaryShardSegments = primaryShardSegmentsList.stream().findFirst().get(); | ||
final IndexShard primaryShard = getIndexShard(primaryShardSegments); | ||
final Map<String, StoreFileMetadata> primarySegmentMetadata = primaryShard.getSegmentMetadataMap(); | ||
for (ShardSegments replicaShardSegments : replicaShardSegmentsList) { | ||
final IndexShard replicaShard = getIndexShard(replicaShardSegments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see usage of List<ShardSegments[]> segmentsByIndex
other than finding primary/replica set corresponding to an index and then verify store content. Finding this little complicated and feel can be simplified by iterating over routing table. With this, below two methods can be removed.
private IndexShard getIndexShard(ShardSegments shardSegments)
.private Map<Boolean, List<ShardSegments>> segmentsByShardType(ShardSegments[] replicationGroupSegments)
for(IndexRoutingTable indexRoutingTable: clusterState.routingTable()) {
for(IndexShardRoutingTable shardRoutingTable: indexRoutingTable) {
final ShardRouting primaryRouting = shardRoutingTable.primaryShard();
final String indexName = primaryRouting.getIndexName();
final List<ShardRouting> replicaRouting = shardRoutingTable.replicaShards();
final IndexShard primaryShard = getIndexShard(shardRoutingTable.primaryShard().currentNodeId(), indexName);
for(ShardRouting replica: replicaRouting) {
IndexShard replicaShard = getIndexShard(replica.currentNodeId(), indexName);
// Compare store content
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, there's really no reason to use the Segments API anymore only for the node Ids. Will update.
} | ||
|
||
public void testDropPrimaryDuringReplication() throws Exception { | ||
int replica_count = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final ?
assertEquals(replicaSegment.getDeletedDocs(), primarySegment.getDeletedDocs()); | ||
assertEquals(replicaSegment.getSize(), primarySegment.getSize()); | ||
} | ||
private void assertIdenticalSegments(int numberOfShards, int numberOfReplicas) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assertIdenticalSegments
-> verifyStoreContent
echoes better with method definition.
Signed-off-by: Marc Handalian <handalm@amazon.com>
This comment was marked as outdated.
This comment was marked as outdated.
Precommit failure due to spotless check
|
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @mch2 for making this change.
Gradle Check (Jenkins) Run Completed with:
|
This flaky failure was because I pushed an unintentional commit to unMute this test. This test is not fixed as part of this PR and remains muted in further commits. |
* Fix flaky SegmentReplicationITs. This change fixes flakiness with segment replication ITs. It does this by updating the wait condition used to ensure replicas are up to date to wait until a searched docCount is reached instead of output of the Segments API that can change if there are concurrent refreshes. It also does this by updating the method used to assert segment stats to wait until the assertion holds true rather than at a point in time. This method is also updated to assert store metadata directly over API output. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix error message to print expected and actual doc counts. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless. Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Marc Handalian <handalm@amazon.com> (cherry picked from commit ade01ec) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Backporting this to |
* Fix flaky SegmentReplicationITs. This change fixes flakiness with segment replication ITs. It does this by updating the wait condition used to ensure replicas are up to date to wait until a searched docCount is reached instead of output of the Segments API that can change if there are concurrent refreshes. It also does this by updating the method used to assert segment stats to wait until the assertion holds true rather than at a point in time. This method is also updated to assert store metadata directly over API output. Signed-off-by: Marc Handalian <handalm@amazon.com> * Fix error message to print expected and actual doc counts. Signed-off-by: Marc Handalian <handalm@amazon.com> * PR feedback. Signed-off-by: Marc Handalian <handalm@amazon.com> * spotless. Signed-off-by: Marc Handalian <handalm@amazon.com> Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian handalm@amazon.com
Description
This change fixes flakiness with segment replication ITs. It does this by updating the wait condition used to ensure replicas are up to date to wait until a searched docCount is reached instead of output of the Segments API that can change if there are concurrent refreshes.
It also does this by updating the method used to assert segment stats to wait until the assertion holds true rather than at a point in time. This method is also updated to assert store metadata directly over API output. In doing so I've moved the method used to compute store metadata on top of IndexShard.
I've run this about ~1k times on the file locally and not seeing any issues, will open this and run a few times to test from CI.
Issues Resolved
#5669 - note not all tests in this issue will be resolved with this change, only those with doc count mismatches, particularly testReplicationAfterPrimaryRefreshAndFlush.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.