Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ShrinkIndexIT.testShrinkIndexPrimaryTerm fails #28153

Closed
cbuescher opened this issue Jan 9, 2018 · 9 comments · Fixed by #42098
Closed

[CI] ShrinkIndexIT.testShrinkIndexPrimaryTerm fails #28153

cbuescher opened this issue Jan 9, 2018 · 9 comments · Fixed by #42098
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

cbuescher commented Jan 9, 2018

The test ShrinkIndexIT.testShrinkIndexPrimaryTerm() failed today on CI:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/1526/console

The first this that looks like an error in the log is this:

13:29:05   1> [2018-01-09T23:27:23,148][WARN ][o.e.i.c.IndicesClusterStateService] [node_s1] [[target][0]] marking and sending shard failed due to [failed to create shard]
13:29:05   1> org.elasticsearch.index.IndexNotFoundException: no such index
13:29:05   1> 	at org.elasticsearch.index.shard.IndexShard.startRecovery(IndexShard.java:2075) ~[main/:?]
13:29:05   1> 	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:516) ~[main/:?]
13:29:05   1> 	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143) ~[main/:?]
13:29:05   1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552) ~[main/:?]
13:29:05   1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529) ~[main/:?]
13:29:05   1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) ~[main/:?]
13:29:05   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) ~[main/:?]
13:29:05   1> 	at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_141]
13:29:05   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) [main/:?]
13:29:05   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) [main/:?]
13:29:05   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [main/:?]
13:29:05   1> 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) [main/:?]
13:29:05   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) [main/:?]
13:29:05   1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [main/:?]
13:29:05   1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [main/:?]
13:29:05   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
13:29:05   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
13:29:05   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

The test seems to time out waiting for green state as a consequence:

13:29:07   1> [2018-01-09T23:29:04,372][INFO ][o.e.a.a.i.c.ShrinkIndexIT] [testShrinkIndexPrimaryTerm]: after test
13:29:07 FAILURE 90.4s J0 | ShrinkIndexIT.testShrinkIndexPrimaryTerm <<< FAILURES!
13:29:07    > Throwable #1: java.lang.AssertionError: timed out waiting for green state
13:29:07    > 	at __randomizedtesting.SeedInfo.seed([CD5C54E500E5EC92:E7D0617C1EBDF9DF]:0)
13:29:07    > 	at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:947)
13:29:07    > 	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:913)
13:29:07    > 	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:902)
13:29:07    > 	at org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm(ShrinkIndexIT.java:167)
13:29:07    > 	at java.lang.Thread.run(Thread.java:748)
./gradlew :core:integTest -Dtests.seed=CD5C54E500E5EC92 -Dtests.class=org.elasticsearch.action.admin.indices.create.ShrinkIndexIT -Dtests.method="testShrinkIndexPrimaryTerm" -Dtests.security.manager=true -Dtests.locale=ar-AE -Dtests.timezone=Asia/Ust-Nera
@cbuescher cbuescher added >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI labels Jan 9, 2018
@dnhatn dnhatn self-assigned this Jan 9, 2018
@dnhatn
Copy link
Member

dnhatn commented Jan 9, 2018

Log: testShrinkIndexPrimaryTerm.txt

dnhatn added a commit that referenced this issue Jan 26, 2018
If we have created 210 shards, we may need more than 30 seconds for all
shards become green.

Relates #28153
dnhatn added a commit that referenced this issue Jan 26, 2018
If we have created 210 shards, we may need more than 30 seconds for all
shards become green.

Relates #28153
@dnhatn dnhatn closed this as completed in 414b5de Jan 26, 2018
dnhatn added a commit that referenced this issue Jan 26, 2018
In some cases testShrinkIndexPrimaryTerm creates then 'mutates' 210
shards. If each shard opens more than 10 files (translog, lucene index),
we exceeded the maximum allowed file handles. In our test, the number of
file handles is limited to 2048 by HandleLimitFS. This commit reduces
the number of shards in testShrinkIndexPrimaryTerm to avoid such errors.

Closes #28153
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jul 5, 2018

The same test failed on master Java 10 today.
Is not reproduced locally for me.

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=2A7F48E21D6F8FBF \
  -Dtests.class=org.elasticsearch.action.admin.indices.create.ShrinkIndexIT \
  -Dtests.method="testShrinkIndexPrimaryTerm" \
  -Dtests.security.manager=true \
  -Dtests.locale=hy \
  -Dtests.timezone=America/Sitka

Console log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/158/console

Errors:

16:10:08 FAILURE 4.58s J1 | ShrinkIndexIT.testShrinkIndexPrimaryTerm <<< FAILURES!
16:10:08    > Throwable #1: java.lang.AssertionError: Accounting breaker not reset to 0 on node: node_s2
16:10:08    > Expected: <0L>
16:10:08    >      but: was <1322L>
16:10:08    > 	at __randomizedtesting.SeedInfo.seed([2A7F48E21D6F8FBF:F37D7B03379AF2]:0)
16:10:08    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
16:10:08    > 	at org.elasticsearch.test.InternalTestCluster.ensureEstimatedStats(InternalTestCluster.java:2020)
16:10:08    > 	at org.elasticsearch.test.TestCluster.assertAfterTest(TestCluster.java:93)
16:10:08    > 	at org.elasticsearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:2051)
16:10:08    > 	at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:592)
16:10:08    > 	at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:2185)
16:10:08    > 	at jdk.internal.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
16:10:08    > 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
16:10:08    > 	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
16:10:08    > 	at java.base/java.lang.Thread.run(Thread.java:844)

@mayya-sharipova mayya-sharipova added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Jul 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member

dnhatn commented Jul 5, 2018

16:10:08 > Throwable #1: java.lang.AssertionError: Accounting breaker not reset to 0 on node: node_s2
16:10:08 > Expected: <0L>
16:10:08 > but: was <1322L>

@mayya-sharipova I think this test failed because the accounting breaker was not reset correctly. This is a known issue and being tracked at #30290. I added this failure to #30290.

@dnhatn dnhatn closed this as completed Jul 5, 2018
@tlrx
Copy link
Member

tlrx commented Feb 27, 2019

This test failed again today:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+internalClusterTest/1370

With the same error as the one reported in the description of the issue:

FAILURE 30.6s J6 | ShrinkIndexIT.testShrinkIndexPrimaryTerm <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: timed out waiting for green state
   > 	at __randomizedtesting.SeedInfo.seed([6DE07894235EC66A:476C4D0D3D06D327]:0)
   > 	at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:975)
   > 	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:931)
   > 	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:920)
   > 	at org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkIndexPrimaryTerm(ShrinkIndexIT.java:221)
   > 	at java.lang.Thread.run(Thread.java:748)

It does not seem that the accounting breaker is in cause here.

consoleText.txt

@ywelsch
Copy link
Contributor

ywelsch commented May 7, 2019

This test continues to fail (newest failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+release-tests/308/console)
It seems to be just too brittle. I wonder if we should make it so that the test waits after each primary failure (i.e. call to failShard) for the primary term to be incremented, i.e., the failure to have taken effect. WDYT @dnhatn ?

dnhatn added a commit that referenced this issue May 14, 2019
This test can create and shuffle 2*(3*5*7) = 210 shards which is quite
heavy for our CI. This commit reduces the load, so we don't timeout on
CI.

Closes #28153
dnhatn added a commit that referenced this issue May 15, 2019
This test can create and shuffle 2*(3*5*7) = 210 shards which is quite
heavy for our CI. This commit reduces the load, so we don't timeout on
CI.

Closes #28153
dnhatn added a commit that referenced this issue May 15, 2019
This test can create and shuffle 2*(3*5*7) = 210 shards which is quite
heavy for our CI. This commit reduces the load, so we don't timeout on
CI.

Closes #28153
dnhatn added a commit that referenced this issue May 16, 2019
This test can create and shuffle 2*(3*5*7) = 210 shards which is quite
heavy for our CI. This commit reduces the load, so we don't timeout on
CI.

Closes #28153
dnhatn added a commit that referenced this issue May 16, 2019
This test can create and shuffle 2*(3*5*7) = 210 shards which is quite
heavy for our CI. This commit reduces the load, so we don't timeout on
CI.

Closes #28153
@alpar-t alpar-t reopened this May 22, 2019
@dnhatn
Copy link
Member

dnhatn commented May 22, 2019

java.lang.AssertionError: Accounting breaker not reset to 0 on node: node_sd2, are there still Lucene indices around?
Expected: <0L>
     but: was <4172L>

Closing this issue again since the actual failure is tracked #30290.

@dnhatn dnhatn closed this as completed May 22, 2019
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
This test can create and shuffle 2*(3*5*7) = 210 shards which is quite
heavy for our CI. This commit reduces the load, so we don't timeout on
CI.

Closes elastic#28153
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants