-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] failing IT test : SegmentReplicationRelocationIT #6065
Comments
Another occurrence - https://build.ci.opensearch.org/job/gradle-check/10155/ |
Other tests inside SegmentReplicationRelocationIT are also flaky. Muting them all. The test does not fail locally with provided seed irrespective of iteration count (passes 1000/1000). The test on CI fails while waiting for cluster health response with pending operation, which probably depicts some load on cluster.
|
Next Steps:
|
@dreamer-89 I took a look at
clusterHealthResponse = client().admin()
.cluster()
.prepareHealth()
.setWaitForEvents(Priority.LANGUID)
.setWaitForNoRelocatingShards(true)
.setTimeout(ACCEPTABLE_RELOCATION_TIME)
.execute()
.actionGet();
assertEquals(clusterHealthResponse.isTimedOut(), false); With the following trace:
The test is using WAIT_UNTIL for every index req, causing the old primary to have pending operations waiting for a refresh, blocking relocation. The issue is the test uses WAIT_UNTIL, but also disables auto refresh within index settings with The second way this fails: I chose to remove the index setting & refresh policy entirely, after that is updated, the test still fails with timeout, but with another Exception:
This is error is thrown after the new primary has done a round of segrep to sync with the old primary, flipped to a writeable engine, and recovered any missing ops from xlog. The issue here is the old primary ack'd 511 docs, meaning up to 511 is persisted in the xlog of the newPrimary. However, the global cp is still 510, and is what is used as the upper bound for number of docs to recover once the new primary is flipped. Added some more detailed logs to show this...
The first line updates the old primary's replicationTracker to 511 for the new primary's allocaitonId, however once the new primary is flipped to a writeable engine, its processed cp is only 510. After the engine is flipped, the newPrimary's checkpoint tracker resets its processed & persisted seqNo's to the local cp, in this case 510. Fix: There are a couple options here. I suggest we go with option A here - in the interest of avoiding reindexing at all costs, we flush the new segments and copy them over. We could go with option B as well as a safety net, or at a minimum add an assertion that the maxSeqNo is equal to the latest seqNo in the replication cp. The third way this fails: Once the newPrimary is active and recovered ops from the xlog, it is not publishing a new checkpoint automatically becuase at the time of refresh the shard is not in a valid state. This is causing the replica to not sync and be stale in the assertion. Logs, node_t2 is new primary, node_t0 is old.
Fix options: IMO we need to sync the replica with option A rather than depending on some future event to occur. |
Ok so after making all of these fixes, we still have a problem. During relocation we block operations on primary, which will wait until all outstanding permits are released. If one of those permits is from a WAIT_UNTIL request, the relocation hangs indefinitely. This happens because wait_until does not complete until replicas have refreshed on a particular translog location, replicas with SR do not refresh at all until receiving a new set of segments, however at this point the primary is not refreshing, and cannot forcefully publish a checkpoint because publishCheckpoint requires an op permit on primary. The easy way to fix this is implement maybeRefresh method in NRTEngine to refresh to at least trigger its RefreshListeners, however this breaks the meaning of wait_until, bc its only waiting until the ops are durable, not visible. I think the only way around this currently is to flip to a polling architecture instead of pushing/publishing checkpoints to replicas. I was able to get this going fairly quickly and these tests pass 100% of the time for me now... will put up a draft asap... |
@ashking94 FYI - #6315 |
While polling fixes the WAIT_UNTIL issue above, it is a bit larger of a change that has its own considerations, namely how do we reduce traffic to primaries that are not actively indexing. I think polling could be the right solution here going fwd, but to allow us time to vet that thoroughly, a simpler change is in #6366. That change will simply avoid attempting to acquire a lock on the primary by not sending a publishCheckpoint request to itself, allowing replicas to receive checkpoints and proceed with running a round of replication. To summarize the above issues. We will need to still cover edge cases when publish requests fail. I am thinking a good solution to this is actually to use polling on the replica, but only trigger requests if we recognize the shard is behind (we will know this based on xlog noOp write) and we are not actively copying segments. We should also add retries + jitter to the publish request. |
Describe the bug
https://build.ci.opensearch.org/job/gradle-check/10147/
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: