[ci] CorruptedFileIT.testReplicaCorruption #32304

andyb-elastic · 2018-07-23T23:10:13Z

Doesn't reproduce locally

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2385/console

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=172540A0C77C8F60 \
  -Dtests.class=org.elasticsearch.index.store.CorruptedFileIT \
  -Dtests.method="testReplicaCorruption" \
  -Dtests.security.manager=true \
  -Dtests.locale=en-MT \
  -Dtests.timezone=Chile/EasterIsland

ERROR   5.17s J2 | CorruptedFileIT.testReplicaCorruption <<< FAILURES!
   > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=9856, name=elasticsearch[node_sd3][write][T#2], state=RUNNABLE, group=TGRP-CorruptedFileIT]
   > Caused by: java.lang.AssertionError: shard term already update.  op term [2], shardTerm [3]
   >    at __randomizedtesting.SeedInfo.seed([172540A0C77C8F60]:0)
   >    at org.elasticsearch.index.shard.IndexShard.lambda$acquireReplicaOperationPermit$9(IndexShard.java:2234)
   >    at org.elasticsearch.index.shard.IndexShardOperationPermits.doBlockOperations(IndexShardOperationPermits.java:173)
   >    at org.elasticsearch.index.shard.IndexShardOperationPermits.blockOperations(IndexShardOperationPermits.java:110)
   >    at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2233)
   >    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:616)
   >    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493)
   >    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:479)
   >    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
   >    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1679)
   >    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
   >    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   >    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   >    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   >    at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-07-23T23:10:14Z

Pinging @elastic/es-distributed

jasontedor · 2018-07-24T03:33:05Z

Note this happened in #32118 too.

We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues linked below), leading to the discovery of a race between resetting a replica when it learns about a higher term and when the same replica is promoted to primary. This commit fixes the race by distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level operation term. The former is set during the cluster state update or when a replica learns about a new primary. The latter is only incremented under the operation block, which can happen in a delayed fashion. It also solves the issue where a replica that's still adjusting to the new term receives a cluster state update that promotes it to primary, which can happen in the situation of multiple nodes being shut down in short succession. In that case, the cluster state update thread would call `asyncBlockOperations` in `updateShardState`, which in turn would throw an exception as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard. This commit therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks (which will all take precedence over operations acquiring permits). Finally, it also moves the primary activation of the replication tracker under the operation block, so that the actual transition to primary only happens under the operation block. Relates to #32431, #32304 and #32118

ywelsch · 2018-08-03T09:04:21Z

Closed by #32442. If this still occurs, please reopen.

andyb-elastic added >test-failure Triaged test failures from CI :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jul 23, 2018

javanna mentioned this issue Jul 27, 2018

ClusterDisruptionIT#testSendingShardFailure fails on CI #32431

Closed

ywelsch mentioned this issue Jul 27, 2018

Fix race between replica reset and primary promotion #32442

Merged

ywelsch closed this as completed Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] CorruptedFileIT.testReplicaCorruption #32304

[ci] CorruptedFileIT.testReplicaCorruption #32304

andyb-elastic commented Jul 23, 2018

elasticmachine commented Jul 23, 2018

jasontedor commented Jul 24, 2018

ywelsch commented Aug 3, 2018

[ci] CorruptedFileIT.testReplicaCorruption #32304

[ci] CorruptedFileIT.testReplicaCorruption #32304

Comments

andyb-elastic commented Jul 23, 2018

elasticmachine commented Jul 23, 2018

jasontedor commented Jul 24, 2018

ywelsch commented Aug 3, 2018