Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

Closed
DaveCTurner opened this issue Feb 22, 2019 · 9 comments · Fixed by #60532
Closed

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

DaveCTurner opened this issue Feb 22, 2019 · 9 comments · Fixed by #60532
Assignees
Labels
blocker :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.2.0 v8.0.0-alpha1

Comments

@DaveCTurner
Copy link
Contributor

... for some value of the word "reproducibly". After 180 iterations of this command:

./gradlew :server:integTest -Dtests.seed=47534FB84D89BB2E -Dtests.class=org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT -Dtests.security.manager=true -Dtests.locale=et-EE -Dtests.timezone=Africa/Dar_es_Salaam -Dcompiler.java=11 -Druntime.java=8

I reproduced the failure that occurred here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/989/console. However I ran over 1000 iterations of just that one test (same command line plus -Dtests.method=testSnapshotAndRestore) without a single failure.

The presenting complaint is that the cluster failed to get to green health:

FAILURE 32.7s | FsBlobStoreRepositoryIT.testSnapshotAndRestore <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: timed out waiting for green state
   >    at __randomizedtesting.SeedInfo.seed([47534FB84D89BB2E:87342CCB90464619]:0)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:975)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:931)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:920)
   >    at org.elasticsearch.repositories.blobstore.ESBlobStoreRepositoryIntegTestCase.testSnapshotAndRestore(ESBlobStoreRepositoryIntegTestCase.java:136)
   >    at java.lang.Thread.run(Thread.java:748)

This in turn is because peer recoveries were persistently failing:

  1> [2019-02-22T07:40:30,000][WARN ][o.e.i.c.IndicesClusterStateService] [node_s1] [[efr][0]] marking and sending shard failed due to [failed recovery]
  1> org.elasticsearch.indices.recovery.RecoveryFailedException: [efr][0]: Recovery failed from {node_s0}{VofCxu6oSmuljC2VrCOgWw}{3PNCN7QaTiS_dJ3Ev-Qyaw}{127.0.0.1}{127.0.0.1:35411} into {node_s1}{aJuFo5EUSCynoCzNS60dnA}{E_NgEcNuR6O-skD7Aaar5Q}{127.0.0.1}{127.0.0.1:34247}
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) ~[main/:?]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport.lambda$handleException$24(TcpTransport.java:1001) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:35411][internal:index/shard/recovery/start_recovery]
  1> Caused by: java.lang.IllegalStateException: translog replay failed to cover required sequence numbers (required range [67:67). first missing op is [67]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$25(RecoverySourceHandler.java:592) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendBatch(RecoverySourceHandler.java:648) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$sendBatch$26(RecoverySourceHandler.java:634) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.lambda$indexTranslogOperations$4(RemoteRecoveryTargetHandler.java:133) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) ~[main/:?]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleResponse(PlainTransportFuture.java:87) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:975) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more

Here are the full logs from the failure, including some TRACE-level ones: fail.log.gz

/cc @benwtrent - I couldn't find another issue about this but please correct me if I missed it.

@DaveCTurner DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.2.0 labels Feb 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear
Copy link
Member

Also reproduces on master :)

@DaveCTurner
Copy link
Contributor Author

It seems that we are not correctly flushing all completed operations before taking the snapshot. The offending shard flushes with one operation still in-flight (NB local checkpoint is 66, max seqno is 67):

  1> [2019-02-22T17:39:09,611][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] committing writer with commit data [{local_checkpoint=66, max_unsafe_auto_id_timestamp=-1, translog_uuid=f9ujzyJ6QAKvJkRhG8R85w, min_retained_seq_no=0, history_uuid=ZVFJ1nDmSJuqoKqG2J_N-w, translog_generation=3, max_seq_no=67}]

Then a short while later we contemplate doing another flush prior to the snapshot but decide to do nothing:

  1> [2019-02-22T17:39:10,486][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] start flush for snapshot
  1> [2019-02-22T17:39:10,487][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] acquired flush lock immediately
  1> [2019-02-22T17:39:10,487][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] finish flush for snapshot

Once the snapshot completes we delete the index and restore it:

  1> [2019-02-22T17:39:11,837][TRACE][o.e.i.e.Engine           ] [node_s0] [efr][0] recovered maximum sequence number [67] and local checkpoint [66]

We should be filling in the gap with a NoOp that gets recorded to the translog, but it seems that we do not:

  1> [2019-02-22T17:39:12,291][TRACE][o.e.i.r.RecoverySourceHandler] [node_s0] [efr][0][recover to node_s1] recovery [phase2]: sending batch of 0 ops

I am looking harder at the gap-filling logic now.

@DaveCTurner
Copy link
Contributor Author

@dnhatn I suspect that the issue with gap-filling lies here:

https://github.com/elastic/elasticsearch/blob/e00e7003fcf75a5d61ea02320c84fd366afd5535/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java#L259-L264

When restoring from a snapshot, this advances the local checkpoint but does not put corresponding entries in the translog, so by the time we get to fillSeqNoGaps() there don't seem to be any gaps to fill. Below is a repository containing a snapshot that exhibits the problem. It's possible to restore its primaries (really the efr index is all you need) but peer recoveries from those primaries will fail. Could you take a look?

issue-39299-repo.tar.gz

@DaveCTurner DaveCTurner removed their assignment Feb 22, 2019
@dnhatn dnhatn assigned dnhatn and unassigned original-brownbear Feb 22, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 22, 2019

@DaveCTurner I was about to ping you. That's the root cause indeed. Great find!

@dnhatn dnhatn added the blocker label Feb 22, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 22, 2019

This is an unreleased bug that relates to #38237 and #38904. This bug only affects the snapshot/restore with soft-deletes enabled.

Since #38237, we initialize the local_checkpoint of the restoring shard with the local_checkpoint of the restoring commit (previously we assigned the max_seq_no to the local_checkpoint). This change exposes that refilling LocalCheckpointTracker does not play well with "fillSeqNoGaps". Suppose the restoring commit consists of seq-0, seq-2, and seq-3 (seq-1 is not in the commit), then "fillSeqNoGaps" method will add NoOp for seq-1 because after seq-1 is filled the local checkpoint jumps to 3. A peer-recovery will fail since we don't have enough history in translog. This bug is very similar to #39000. If we remove the sequence number range check in peer-recovery, then this issue would be resolved. Moreover, to maintain the safe commit assumption, I think we need to flush after fillSeqNoGaps.

/cc @ywelsch @jasontedor

@dnhatn
Copy link
Member

dnhatn commented Feb 25, 2019

This issue is resolved by #39006.

@dnhatn dnhatn closed this as completed Feb 25, 2019
@javanna
Copy link
Member

javanna commented Jul 31, 2020

I've seen this test fail in apparently the same way, see https://gradle-enterprise.elastic.co/s/j7gznmtonko42 . I will reopen, let me know if you prefer a new issue for this.

@javanna javanna reopened this Jul 31, 2020
@original-brownbear
Copy link
Member

This seems like a different issue:

  1> [2020-07-31T04:23:01,375][INFO ][o.e.c.r.a.AllocationService] [node_sm0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[zxapvwy][4]]]).
  1> [2020-07-31T04:23:06,653][INFO ][o.e.r.f.FsBlobStoreRepositoryIT] [testSnapshotAndRestore] ensureGreen timed out, cluster state:

somehow we entered a green state and 5s later timed out waiting for that green state. My suspicion is that this was just a randomly very slow CI run, but I'll investigate a little to see if we can make this wait more resilient :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 31, 2020
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green).
I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of
data in this test generally speaking.

Closes elastic#39299
original-brownbear added a commit that referenced this issue Jul 31, 2020
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green).
I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of
data in this test generally speaking.

Closes #39299
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 31, 2020
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green).
I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of
data in this test generally speaking.

Closes elastic#39299
original-brownbear added a commit that referenced this issue Jul 31, 2020
It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green).
I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of
data in this test generally speaking.

Closes #39299
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants