-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299
Comments
Pinging @elastic/es-distributed |
Also reproduces on |
It seems that we are not correctly flushing all completed operations before taking the snapshot. The offending shard flushes with one operation still in-flight (NB local checkpoint is 66, max seqno is 67):
Then a short while later we contemplate doing another flush prior to the snapshot but decide to do nothing:
Once the snapshot completes we delete the index and restore it:
We should be filling in the gap with a
I am looking harder at the gap-filling logic now. |
@dnhatn I suspect that the issue with gap-filling lies here: When restoring from a snapshot, this advances the local checkpoint but does not put corresponding entries in the translog, so by the time we get to |
@DaveCTurner I was about to ping you. That's the root cause indeed. Great find! |
This is an unreleased bug that relates to #38237 and #38904. This bug only affects the snapshot/restore with soft-deletes enabled. Since #38237, we initialize the local_checkpoint of the restoring shard with the local_checkpoint of the restoring commit (previously we assigned the max_seq_no to the local_checkpoint). This change exposes that refilling LocalCheckpointTracker does not play well with "fillSeqNoGaps". Suppose the restoring commit consists of seq-0, seq-2, and seq-3 (seq-1 is not in the commit), then "fillSeqNoGaps" method will add NoOp for seq-1 because after seq-1 is filled the local checkpoint jumps to 3. A peer-recovery will fail since we don't have enough history in translog. This bug is very similar to #39000. If we remove the sequence number range check in peer-recovery, then this issue would be resolved. Moreover, to maintain the safe commit assumption, I think we need to flush after fillSeqNoGaps. /cc @ywelsch @jasontedor |
This issue is resolved by #39006. |
I've seen this test fail in apparently the same way, see https://gradle-enterprise.elastic.co/s/j7gznmtonko42 . I will reopen, let me know if you prefer a new issue for this. |
This seems like a different issue:
somehow we entered a green state and 5s later timed out waiting for that green state. My suspicion is that this was just a randomly very slow CI run, but I'll investigate a little to see if we can make this wait more resilient :) |
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes elastic#39299
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes #39299
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes elastic#39299
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes #39299
... for some value of the word "reproducibly". After 180 iterations of this command:
I reproduced the failure that occurred here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/989/console. However I ran over 1000 iterations of just that one test (same command line plus
-Dtests.method=testSnapshotAndRestore
) without a single failure.The presenting complaint is that the cluster failed to get to green health:
This in turn is because peer recoveries were persistently failing:
Here are the full logs from the failure, including some
TRACE
-level ones: fail.log.gz/cc @benwtrent - I couldn't find another issue about this but please correct me if I missed it.
The text was updated successfully, but these errors were encountered: