[CCR] CCRIT.testAutoFollowing is failing #37231

tlrx · 2019-01-08T15:41:51Z

The test CCRIT.testAutoFollowing is failing on CI:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1125

With the error:

14:48:58   1> [2019-01-08T17:48:57,831][INFO ][o.e.u.CCRIT              ] [testAutoFollowing] There are still tasks running after this test that might break subsequent tests [cluster:monitor/state, indices:data/read/xpack/ccr/shard_changes, indices:data/read/xpack/ccr/shard_changes[s], xpack/ccr/shard_follow_task[c]].
14:48:58   1> [2019-01-08T17:48:57,832][INFO ][o.e.u.CCRIT              ] [testAutoFollowing] after test
14:48:58 FAILURE 10.2s | CCRIT.testAutoFollowing <<< FAILURES!
14:48:58    > Throwable #1: java.lang.AssertionError: 
14:48:58    > Expected: <1>
14:48:58    >      but: was <0>
14:48:58    > 	at __randomizedtesting.SeedInfo.seed([C7C9CBCAF12212B6:6827EDCAF58BB99B]:0)
14:48:58    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
14:48:58    > 	at org.elasticsearch.upgrades.CCRIT.assertFollowerGlobalCheckpoint(CCRIT.java:273)
14:48:58    > 	at org.elasticsearch.upgrades.CCRIT.lambda$testAutoFollowing$5(CCRIT.java:123)
14:48:58   2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/workspace/elastic+elasticsearch+master+intake/x-pack/qa/rolling-upgrade/without-system-key/build/testrun/v6.7.0#oneThirdUpgradedTestRunner/J0/temp/org.elasticsearch.upgrades.CCRIT_C7C9CBCAF12212B6-001
14:48:58    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
14:48:58    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:821)
14:48:58   2> NOTE: test params are: codec=Asserting(Lucene80): {}, docValues:{}, maxPointsInLeafNode=1566, maxMBSortInHeap=7.812807438368832, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@33ae1d18), locale=es-DO, timezone=Europe/Volgograd
14:48:58    > 	at org.elasticsearch.upgrades.CCRIT.testAutoFollowing(CCRIT.java:121)
14:48:58    > 	at java.lang.Thread.run(Thread.java:748)
14:48:58    > 	Suppressed: java.lang.AssertionError: 
14:48:58   2> NOTE: Linux 4.9.0-8-amd64 amd64/Oracle Corporation 1.8.0_192 (64-bit)/cpus=16,threads=1,free=427840024,total=514850816
14:48:58    > Expected: <1>
14:48:58    >      but: was <0>
14:48:58    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
14:48:58    > 		at org.elasticsearch.upgrades.CCRIT.assertFollowerGlobalCheckpoint(CCRIT.java:273)
14:48:58   2> NOTE: All tests run in this JVM: [IndexAuditUpgradeIT, UpgradeClusterClientYamlTestSuiteIT, RollupIDUpgradeIT, IndexingIT, TokenBackwardsCompatibilityIT, CCRIT]
14:48:58    > 		at org.elasticsearch.upgrades.CCRIT.lambda$testAutoFollowing$5(CCRIT.java:123)
14:48:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
14:48:58    > 		... 39 more

It does not reproduce locally for me with:

./gradlew :x-pack:qa:rolling-upgrade:without-system-key:v6.7.0#oneThirdUpgradedTestRunner -Dtests.seed=C7C9CBCAF12212B6 -Dtests.class=org.elasticsearch.upgrades.CCRIT -Dtests.method="testAutoFollowing" -Dtests.security.manager=true -Dtests.locale=es-DO -Dtests.timezone=Europe/Volgograd -Dcompiler.java=11 -Druntime.java=8

The failure looks different from #35937 hence the new issue.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-08T15:41:52Z

Pinging @elastic/es-distributed

tlrx · 2019-01-08T15:47:15Z

Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1127

Full execution log consoleTest.tar.gz

Relates to #37231

martijnvg · 2019-01-08T15:55:03Z

I've muted the two tests in CCRIT test class. The seem to fail for the same reason.

If a running shard follow task needs to be restarted and the remote connection seeds have changed then a shard follow task currently fails with a fatal error. The change creates the remote client lazily and adjusts the errors a shard follow task should retry. This issue was found in test failures in the recently added ccr rolling upgrade tests. The reason why this issue occurs more frequently in the rolling upgrade test is because ccr is setup in local mode (so remote connection seed will become stale) and all nodes are restarted, which forces the shard follow tasks to get restarted at some point during the test. Note that these tests cannot be enabled yet, because this change will need to be backported to 6.x first. (otherwise the issue still occurs on non upgraded nodes) I also changed the RestartIndexFollowingIT to setup remote cluster via persistent settings and to also restart the leader cluster. This way what happens during the ccr rolling upgrade qa tests, also happens in this test. Relates to elastic#37231

If a running shard follow task needs to be restarted and the remote connection seeds have changed then a shard follow task currently fails with a fatal error. The change creates the remote client lazily and adjusts the errors a shard follow task should retry. This issue was found in test failures in the recently added ccr rolling upgrade tests. The reason why this issue occurs more frequently in the rolling upgrade test is because ccr is setup in local mode (so remote connection seed will become stale) and all nodes are restarted, which forces the shard follow tasks to get restarted at some point during the test. Note that these tests cannot be enabled yet, because this change will need to be backported to 6.x first. (otherwise the issue still occurs on non upgraded nodes) I also changed the RestartIndexFollowingIT to setup remote cluster via persistent settings and to also restart the leader cluster. This way what happens during the ccr rolling upgrade qa tests, also happens in this test. Relates to #37231

martijnvg · 2019-01-10T15:49:30Z

I merged #37239, which fixes the underlying issue why this rolling upgrade test failed.
I will unmute this test soon, when I've verified that this test doesn't fail locally. (I was able to reproduce locally).

Relates to #37231

…rade test, in order to reduce the likelihood the test fails because of timeing issues. Relates #37231

dnhatn · 2019-01-31T21:57:19Z

This test is failing again. I am re-opening the issue.

Some instances:

Tracked at #37231

martijnvg · 2019-02-01T09:16:13Z

This failure is expected. In a 3 node cluster, all on version 6.7.0-SNAPSHOT, one node gets upgraded to 7.0.0-SNAPSHOT. In this state a leader index gets created on the upgraded node. The auto follow coordinator auto-follows this index, and the restore is performed on a not upgraded node. The restore fails because a 6.7.0-SNAPSHOT node can't read the index files from a 7.0.0-SNAPSHOT node:

[2019-02-01T09:59:32,426][WARN ][o.e.s.RestoreService     ] [node-1] [_ccr_local:_latest_] failed to restore snapshot
org.elasticsearch.snapshots.SnapshotRestoreException: [_ccr_local:_latest_/_latest_] the snapshot was created with Elasticsearch version [7.0.0] which is higher than the version of this node [6.7.0]
        at org.elasticsearch.snapshots.RestoreService.validateSnapshotRestorable(RestoreService.java:855) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.snapshots.RestoreService.restoreSnapshot(RestoreService.java:197) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.xpack.ccr.action.TransportPutFollowAction$1.doRun(TransportPutFollowAction.java:174) [x-pack-ccr-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

This test is testing a rolling upgrade in a local cluster (which is another form of bi-directional index following, see #38037) and currently we can't support that. The reason that it didn't fail before is because the new follower index would do an ops based recovery and recently we switched to a file based recovery as initial recovery.

I'm working an adding a qa module that does a rolling upgrading in two clusters that is setup to do unidirectional index following. In this case a rolling upgrade while new leader indices are auto followed should work without a problem. When that is ready we can then move this test, to this new qa module.

This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and the the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to elastic#37231 and elastic#38037

This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and then the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to #37231 and #38037

This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and then the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to elastic#37231 and elastic#38037

* Add rolling upgrade multi cluster test module (#38277) This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and then the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to #37231 and #38037 * Filter out upgraded version index settings when starting index following (#38838) The `index.version.upgraded` and `index.version.upgraded_string` are likely to be different between leader and follower index. In the event that a follower index gets restored on a upgraded node while the leader index is still on non-upgraded nodes. Closes #38835

* Add rolling upgrade multi cluster test module (elastic#38277) This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and then the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to elastic#37231 and elastic#38037 * Filter out upgraded version index settings when starting index following (elastic#38838) The `index.version.upgraded` and `index.version.upgraded_string` are likely to be different between leader and follower index. In the event that a follower index gets restored on a upgraded node while the leader index is still on non-upgraded nodes. Closes elastic#38835

* Add rolling upgrade multi cluster test module (#38277) This test starts 2 clusters, each with 3 nodes. First the leader cluster is started and tests are run against it and then the follower cluster is started and tests execute against this two cluster. Then the follower cluster is upgraded, one node at a time. After that the leader cluster is upgraded, one node at a time. Every time a node is upgraded tests are ran while both clusters are online. (and either leader cluster has mixed node versions or the follower cluster) This commit only tests CCR index following, but could be used for CCS tests as well. In particular for CCR, unidirectional index following is tested during a rolling upgrade. During the test several indices are created and followed in the leader cluster before or while the follower cluster is being upgraded. This tests also verifies that attempting to follow an index in the upgraded cluster from the not upgraded cluster fails. After both clusters are upgraded following the index that previously failed should succeed. Relates to #37231 and #38037 * Filter out upgraded version index settings when starting index following (#38838) The `index.version.upgraded` and `index.version.upgraded_string` are likely to be different between leader and follower index. In the event that a follower index gets restored on a upgraded node while the leader index is still on non-upgraded nodes. Closes #38835

The rest of `CCRIT` is now no longer relevant, because the remaining test tests the same of the index following test in the rolling upgrade multi cluster module. Added `tests.upgrade_from_version` version to test. It is not needed in this branch, but is in 6.7 branch. Closes elastic#37231

…38900) The rest of `CCRIT` is now no longer relevant, because the remaining test tests the same of the index following test in the rolling upgrade multi cluster module. Added `tests.upgrade_from_version` version to test. It is not needed in this branch, but is in 6.7 branch. Closes #37231

tlrx added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Jan 8, 2019

martijnvg self-assigned this Jan 8, 2019

tlrx mentioned this issue Jan 8, 2019

[CI] CCRIT testIndexFollowing Failure #36339

Closed

martijnvg added a commit that referenced this issue Jan 8, 2019

Muted rolling upgrade tests.

d6608ca

Relates to #37231

martijnvg mentioned this issue Jan 8, 2019

[CCR] Make shard follow tasks more resilient for restarts #37239

Merged

martijnvg added a commit that referenced this issue Jan 11, 2019

Unmuted test now that #37239 has been merged and backported.

37493c2

Relates to #37231

martijnvg closed this as completed Jan 14, 2019

martijnvg added a commit that referenced this issue Jan 21, 2019

Decrease ccr.auto_follow.wait_for_metadata_timeout in ccr rolling upg…

a3180fd

…rade test, in order to reduce the likelihood the test fails because of timeing issues. Relates #37231

martijnvg added a commit that referenced this issue Jan 21, 2019

Decrease ccr.auto_follow.wait_for_metadata_timeout in ccr rolling upg…

5048162

…rade test, in order to reduce the likelihood the test fails because of timeing issues. Relates #37231

alpar-t mentioned this issue Jan 22, 2019

CI: rolling upgrade CCRIT.testAutoFollowing fails #37689

Closed

dnhatn reopened this Jan 31, 2019

dnhatn self-assigned this Jan 31, 2019

dnhatn added a commit that referenced this issue Jan 31, 2019

Mute testAutoFollowing

6c1e9fa

Tracked at #37231

martijnvg mentioned this issue Feb 3, 2019

Add rolling upgrade multi cluster test module #38277

Merged

jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019

martijnvg mentioned this issue Feb 14, 2019

Backport rolling upgrade multi cluster module to 6.7 branch #38888

Merged

martijnvg mentioned this issue Feb 14, 2019

Migrate muted auto follow rolling upgrade test and unmute this test #38900

Merged

martijnvg closed this as completed in #38900 Feb 15, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CCR] CCRIT.testAutoFollowing is failing #37231

[CCR] CCRIT.testAutoFollowing is failing #37231

tlrx commented Jan 8, 2019

elasticmachine commented Jan 8, 2019

tlrx commented Jan 8, 2019

martijnvg commented Jan 8, 2019

martijnvg commented Jan 10, 2019

dnhatn commented Jan 31, 2019

martijnvg commented Feb 1, 2019

[CCR] CCRIT.testAutoFollowing is failing #37231

[CCR] CCRIT.testAutoFollowing is failing #37231

Comments

tlrx commented Jan 8, 2019

elasticmachine commented Jan 8, 2019

tlrx commented Jan 8, 2019

martijnvg commented Jan 8, 2019

martijnvg commented Jan 10, 2019

dnhatn commented Jan 31, 2019

martijnvg commented Feb 1, 2019