-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing flaky test testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness #3646
Conversation
…onIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
✅ Gradle Check success 9b30f1a711d9a06dfa96996513124b21a60f263c |
…lueAndLoadAwareness Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Ran the test 100 times now. Succeeds every time.
|
Seems like both gradle check failures are from a flaky test - #3650 Refiring. |
start gradle check |
Thank you @imRishN for this PR. Appreciate for taking time in fixing this flaky test. Previously it has been observed that a flaky test rarely fail when run in isolation as single test. I suspect the test will still pass without your fix. Running entire gradle check will provide a better picture as it is what running on CI today. Can you give it a try ? |
@@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E | |||
.put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c") | |||
.put("cluster.routing.allocation.load_awareness.skew_factor", "0.0") | |||
.put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3)) | |||
.put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imRishN From test failure trace (MasterNotDiscoveredException), it is not clear if it is an actual issue or a flaky one. Can you explain how existing test is identified as flaky and changes here fixes it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test here now adds a dedicated cluster manager node where as previously there was no dedicated cluster manager setup and the test was randomly killing half the nodes in a particular zone. I assume MasterNotDiscoveredException
was coming when a node that was stopped was an active master that time and hence the exception was thrown sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @imRishN for the clarification.
The build passes locally |
start gradle check |
Test (flaky) failure. Tracked in #3579
|
start gradle check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -364,18 +364,22 @@ public void testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness() throws E | |||
.put("cluster.routing.allocation.awareness.force.zone.values", "a,b,c") | |||
.put("cluster.routing.allocation.load_awareness.skew_factor", "0.0") | |||
.put("cluster.routing.allocation.load_awareness.provisioned_capacity", Integer.toString(nodeCountPerAZ * 3)) | |||
.put("cluster.routing.allocation.allow_rebalance", "indices_primaries_active") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @imRishN for the clarification.
nodeCountPerAZ, | ||
Settings.builder().put(commonSettings).put("node.attr.zone", "a").build() | ||
); | ||
List<String> nodes_in_zone_b = internalCluster().startNodes( | ||
List<String> nodes_in_zone_b = internalCluster().startDataOnlyNodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Looks like nodes_in_zone_b
is not used after declaration. In that case, it can be removed.
nodeCountPerAZ, | ||
Settings.builder().put(commonSettings).put("node.attr.zone", "b").build() | ||
); | ||
List<String> nodes_in_zone_c = internalCluster().startNodes( | ||
List<String> nodes_in_zone_c = internalCluster().startDataOnlyNodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
…reness (#3646) * Fixing flaky test org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness by adding dedicated cluster manager node Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata rnnahata@amazon.com
Description
Caused by #3563
Issues Resolved
#3603
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.