FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

DaveCTurner · 2019-02-22T10:40:33Z

... for some value of the word "reproducibly". After 180 iterations of this command:

./gradlew :server:integTest -Dtests.seed=47534FB84D89BB2E -Dtests.class=org.elasticsearch.repositories.fs.FsBlobStoreRepositoryIT -Dtests.security.manager=true -Dtests.locale=et-EE -Dtests.timezone=Africa/Dar_es_Salaam -Dcompiler.java=11 -Druntime.java=8

I reproduced the failure that occurred here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/989/console. However I ran over 1000 iterations of just that one test (same command line plus -Dtests.method=testSnapshotAndRestore) without a single failure.

The presenting complaint is that the cluster failed to get to green health:

FAILURE 32.7s | FsBlobStoreRepositoryIT.testSnapshotAndRestore <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: timed out waiting for green state
   >    at __randomizedtesting.SeedInfo.seed([47534FB84D89BB2E:87342CCB90464619]:0)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:975)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:931)
   >    at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:920)
   >    at org.elasticsearch.repositories.blobstore.ESBlobStoreRepositoryIntegTestCase.testSnapshotAndRestore(ESBlobStoreRepositoryIntegTestCase.java:136)
   >    at java.lang.Thread.run(Thread.java:748)

This in turn is because peer recoveries were persistently failing:

  1> [2019-02-22T07:40:30,000][WARN ][o.e.i.c.IndicesClusterStateService] [node_s1] [[efr][0]] marking and sending shard failed due to [failed recovery]
  1> org.elasticsearch.indices.recovery.RecoveryFailedException: [efr][0]: Recovery failed from {node_s0}{VofCxu6oSmuljC2VrCOgWw}{3PNCN7QaTiS_dJ3Ev-Qyaw}{127.0.0.1}{127.0.0.1:35411} into {node_s1}{aJuFo5EUSCynoCzNS60dnA}{E_NgEcNuR6O-skD7Aaar5Q}{127.0.0.1}{127.0.0.1:34247}
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) ~[main/:?]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport.lambda$handleException$24(TcpTransport.java:1001) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [node_s0][127.0.0.1:35411][internal:index/shard/recovery/start_recovery]
  1> Caused by: java.lang.IllegalStateException: translog replay failed to cover required sequence numbers (required range [67:67). first missing op is [67]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$25(RecoverySourceHandler.java:592) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendBatch(RecoverySourceHandler.java:648) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$sendBatch$26(RecoverySourceHandler.java:634) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.lambda$indexTranslogOperations$4(RemoteRecoveryTargetHandler.java:133) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) ~[main/:?]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleResponse(PlainTransportFuture.java:87) ~[main/:?]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) ~[main/:?]
  1>    at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:975) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[main/:?]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1>    ... 3 more

Here are the full logs from the failure, including some TRACE-level ones: fail.log.gz

/cc @benwtrent - I couldn't find another issue about this but please correct me if I missed it.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-22T10:40:34Z

Pinging @elastic/es-distributed

original-brownbear · 2019-02-22T11:45:17Z

Also reproduces on master :)

DaveCTurner · 2019-02-22T15:12:46Z

It seems that we are not correctly flushing all completed operations before taking the snapshot. The offending shard flushes with one operation still in-flight (NB local checkpoint is 66, max seqno is 67):

  1> [2019-02-22T17:39:09,611][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] committing writer with commit data [{local_checkpoint=66, max_unsafe_auto_id_timestamp=-1, translog_uuid=f9ujzyJ6QAKvJkRhG8R85w, min_retained_seq_no=0, history_uuid=ZVFJ1nDmSJuqoKqG2J_N-w, translog_generation=3, max_seq_no=67}]

Then a short while later we contemplate doing another flush prior to the snapshot but decide to do nothing:

  1> [2019-02-22T17:39:10,486][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] start flush for snapshot
  1> [2019-02-22T17:39:10,487][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] acquired flush lock immediately
  1> [2019-02-22T17:39:10,487][TRACE][o.e.i.e.Engine           ] [node_s1] [efr][0] finish flush for snapshot

Once the snapshot completes we delete the index and restore it:

  1> [2019-02-22T17:39:11,837][TRACE][o.e.i.e.Engine           ] [node_s0] [efr][0] recovered maximum sequence number [67] and local checkpoint [66]

We should be filling in the gap with a NoOp that gets recorded to the translog, but it seems that we do not:

  1> [2019-02-22T17:39:12,291][TRACE][o.e.i.r.RecoverySourceHandler] [node_s0] [efr][0][recover to node_s1] recovery [phase2]: sending batch of 0 ops

I am looking harder at the gap-filling logic now.

DaveCTurner · 2019-02-22T16:14:35Z

@dnhatn I suspect that the issue with gap-filling lies here:

https://github.com/elastic/elasticsearch/blob/e00e7003fcf75a5d61ea02320c84fd366afd5535/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java#L259-L264

When restoring from a snapshot, this advances the local checkpoint but does not put corresponding entries in the translog, so by the time we get to fillSeqNoGaps() there don't seem to be any gaps to fill. Below is a repository containing a snapshot that exhibits the problem. It's possible to restore its primaries (really the efr index is all you need) but peer recoveries from those primaries will fail. Could you take a look?

issue-39299-repo.tar.gz

dnhatn · 2019-02-22T16:28:34Z

@DaveCTurner I was about to ping you. That's the root cause indeed. Great find!

dnhatn · 2019-02-22T19:22:32Z

This is an unreleased bug that relates to #38237 and #38904. This bug only affects the snapshot/restore with soft-deletes enabled.

Since #38237, we initialize the local_checkpoint of the restoring shard with the local_checkpoint of the restoring commit (previously we assigned the max_seq_no to the local_checkpoint). This change exposes that refilling LocalCheckpointTracker does not play well with "fillSeqNoGaps". Suppose the restoring commit consists of seq-0, seq-2, and seq-3 (seq-1 is not in the commit), then "fillSeqNoGaps" method will add NoOp for seq-1 because after seq-1 is filled the local checkpoint jumps to 3. A peer-recovery will fail since we don't have enough history in translog. This bug is very similar to #39000. If we remove the sequence number range check in peer-recovery, then this issue would be resolved. Moreover, to maintain the safe commit assumption, I think we need to flush after fillSeqNoGaps.

/cc @ywelsch @jasontedor

dnhatn · 2019-02-25T19:28:13Z

This issue is resolved by #39006.

javanna · 2020-07-31T14:32:39Z

I've seen this test fail in apparently the same way, see https://gradle-enterprise.elastic.co/s/j7gznmtonko42 . I will reopen, let me know if you prefer a new issue for this.

original-brownbear · 2020-07-31T14:41:56Z

This seems like a different issue:

  1> [2020-07-31T04:23:01,375][INFO ][o.e.c.r.a.AllocationService] [node_sm0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[zxapvwy][4]]]).
  1> [2020-07-31T04:23:06,653][INFO ][o.e.r.f.FsBlobStoreRepositoryIT] [testSnapshotAndRestore] ensureGreen timed out, cluster state:

somehow we entered a green state and 5s later timed out waiting for that green state. My suspicion is that this was just a randomly very slow CI run, but I'll investigate a little to see if we can make this wait more resilient :)

It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes elastic#39299

It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes #39299

It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes elastic#39299

It seems this test only fails with `FsRepository` and mostly just barely times out (takes just a little over 30s to go green). I think just increasing the timeout should be fine as a fix here as it's a little interesting to check larger amounts of data in this test generally speaking. Closes #39299

DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.2.0 labels Feb 22, 2019

original-brownbear self-assigned this Feb 22, 2019

original-brownbear added the v8.0.0 label Feb 22, 2019

DaveCTurner assigned DaveCTurner and original-brownbear and unassigned original-brownbear Feb 22, 2019

DaveCTurner removed their assignment Feb 22, 2019

dnhatn assigned dnhatn and unassigned original-brownbear Feb 22, 2019

dnhatn added the blocker label Feb 22, 2019

dnhatn closed this as completed Feb 25, 2019

javanna reopened this Jul 31, 2020

original-brownbear mentioned this issue Jul 31, 2020

Increase Timeout in testSnapshotRestore #60532

Merged

original-brownbear closed this as completed in #60532 Jul 31, 2020

original-brownbear mentioned this issue Jul 31, 2020

Increase Timeout in testSnapshotRestore (#60532) #60538

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

DaveCTurner commented Feb 22, 2019

elasticmachine commented Feb 22, 2019

original-brownbear commented Feb 22, 2019

DaveCTurner commented Feb 22, 2019

DaveCTurner commented Feb 22, 2019

dnhatn commented Feb 22, 2019

dnhatn commented Feb 22, 2019 •

edited

Loading

dnhatn commented Feb 25, 2019

javanna commented Jul 31, 2020

original-brownbear commented Jul 31, 2020

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

FsBlobStoreRepositoryIT#testSnapshotRestore fails reproducibly #39299

Comments

DaveCTurner commented Feb 22, 2019

elasticmachine commented Feb 22, 2019

original-brownbear commented Feb 22, 2019

DaveCTurner commented Feb 22, 2019

DaveCTurner commented Feb 22, 2019

dnhatn commented Feb 22, 2019

dnhatn commented Feb 22, 2019 • edited Loading

dnhatn commented Feb 25, 2019

javanna commented Jul 31, 2020

original-brownbear commented Jul 31, 2020

dnhatn commented Feb 22, 2019 •

edited

Loading