Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CCR] CCRIT.testAutoFollowing is failing #37231

Closed
tlrx opened this issue Jan 8, 2019 · 6 comments · Fixed by #38900
Closed

[CCR] CCRIT.testAutoFollowing is failing #37231

tlrx opened this issue Jan 8, 2019 · 6 comments · Fixed by #38900
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@tlrx
Copy link
Member

tlrx commented Jan 8, 2019

The test CCRIT.testAutoFollowing is failing on CI:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1125

With the error:

14:48:58   1> [2019-01-08T17:48:57,831][INFO ][o.e.u.CCRIT              ] [testAutoFollowing] There are still tasks running after this test that might break subsequent tests [cluster:monitor/state, indices:data/read/xpack/ccr/shard_changes, indices:data/read/xpack/ccr/shard_changes[s], xpack/ccr/shard_follow_task[c]].
14:48:58   1> [2019-01-08T17:48:57,832][INFO ][o.e.u.CCRIT              ] [testAutoFollowing] after test
14:48:58 FAILURE 10.2s | CCRIT.testAutoFollowing <<< FAILURES!
14:48:58    > Throwable #1: java.lang.AssertionError: 
14:48:58    > Expected: <1>
14:48:58    >      but: was <0>
14:48:58    > 	at __randomizedtesting.SeedInfo.seed([C7C9CBCAF12212B6:6827EDCAF58BB99B]:0)
14:48:58    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
14:48:58    > 	at org.elasticsearch.upgrades.CCRIT.assertFollowerGlobalCheckpoint(CCRIT.java:273)
14:48:58    > 	at org.elasticsearch.upgrades.CCRIT.lambda$testAutoFollowing$5(CCRIT.java:123)
14:48:58   2> NOTE: leaving temporary files on disk at: /var/lib/jenkins/workspace/elastic+elasticsearch+master+intake/x-pack/qa/rolling-upgrade/without-system-key/build/testrun/v6.7.0#oneThirdUpgradedTestRunner/J0/temp/org.elasticsearch.upgrades.CCRIT_C7C9CBCAF12212B6-001
14:48:58    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
14:48:58    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:821)
14:48:58   2> NOTE: test params are: codec=Asserting(Lucene80): {}, docValues:{}, maxPointsInLeafNode=1566, maxMBSortInHeap=7.812807438368832, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@33ae1d18), locale=es-DO, timezone=Europe/Volgograd
14:48:58    > 	at org.elasticsearch.upgrades.CCRIT.testAutoFollowing(CCRIT.java:121)
14:48:58    > 	at java.lang.Thread.run(Thread.java:748)
14:48:58    > 	Suppressed: java.lang.AssertionError: 
14:48:58   2> NOTE: Linux 4.9.0-8-amd64 amd64/Oracle Corporation 1.8.0_192 (64-bit)/cpus=16,threads=1,free=427840024,total=514850816
14:48:58    > Expected: <1>
14:48:58    >      but: was <0>
14:48:58    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
14:48:58    > 		at org.elasticsearch.upgrades.CCRIT.assertFollowerGlobalCheckpoint(CCRIT.java:273)
14:48:58   2> NOTE: All tests run in this JVM: [IndexAuditUpgradeIT, UpgradeClusterClientYamlTestSuiteIT, RollupIDUpgradeIT, IndexingIT, TokenBackwardsCompatibilityIT, CCRIT]
14:48:58    > 		at org.elasticsearch.upgrades.CCRIT.lambda$testAutoFollowing$5(CCRIT.java:123)
14:48:58    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
14:48:58    > 		... 39 more

It does not reproduce locally for me with:

./gradlew :x-pack:qa:rolling-upgrade:without-system-key:v6.7.0#oneThirdUpgradedTestRunner -Dtests.seed=C7C9CBCAF12212B6 -Dtests.class=org.elasticsearch.upgrades.CCRIT -Dtests.method="testAutoFollowing" -Dtests.security.manager=true -Dtests.locale=es-DO -Dtests.timezone=Europe/Volgograd -Dcompiler.java=11 -Druntime.java=8

The failure looks different from #35937 hence the new issue.

@tlrx tlrx added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Jan 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@martijnvg martijnvg self-assigned this Jan 8, 2019
@tlrx
Copy link
Member Author

tlrx commented Jan 8, 2019

martijnvg added a commit that referenced this issue Jan 8, 2019
@martijnvg
Copy link
Member

I've muted the two tests in CCRIT test class. The seem to fail for the same reason.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 8, 2019
If a running shard follow task needs to be restarted and
the remote connection seeds have changed then
a shard follow task currently fails with a fatal error.

The change creates the remote client lazily and adjusts
the errors a shard follow task should retry.

This issue was found in test failures in the recently added
ccr rolling upgrade tests. The reason why this issue occurs
more frequently in the rolling upgrade test is because ccr
is setup in local mode (so remote connection seed will become stale) and
all nodes are restarted, which forces the shard follow tasks to get
restarted at some point during the test. Note that these tests
cannot be enabled yet, because this change will need to be backported
to 6.x first. (otherwise the issue still occurs on non upgraded nodes)

I also changed the RestartIndexFollowingIT to setup remote cluster
via persistent settings and to also restart the leader cluster. This
way what happens during the ccr rolling upgrade qa tests, also happens
in this test.

Relates to elastic#37231
martijnvg added a commit that referenced this issue Jan 10, 2019
If a running shard follow task needs to be restarted and
the remote connection seeds have changed then
a shard follow task currently fails with a fatal error.

The change creates the remote client lazily and adjusts
the errors a shard follow task should retry.

This issue was found in test failures in the recently added
ccr rolling upgrade tests. The reason why this issue occurs
more frequently in the rolling upgrade test is because ccr
is setup in local mode (so remote connection seed will become stale) and
all nodes are restarted, which forces the shard follow tasks to get
restarted at some point during the test. Note that these tests
cannot be enabled yet, because this change will need to be backported
to 6.x first. (otherwise the issue still occurs on non upgraded nodes)

I also changed the RestartIndexFollowingIT to setup remote cluster
via persistent settings and to also restart the leader cluster. This
way what happens during the ccr rolling upgrade qa tests, also happens
in this test.

Relates to #37231
martijnvg added a commit that referenced this issue Jan 10, 2019
If a running shard follow task needs to be restarted and
the remote connection seeds have changed then
a shard follow task currently fails with a fatal error.

The change creates the remote client lazily and adjusts
the errors a shard follow task should retry.

This issue was found in test failures in the recently added
ccr rolling upgrade tests. The reason why this issue occurs
more frequently in the rolling upgrade test is because ccr
is setup in local mode (so remote connection seed will become stale) and
all nodes are restarted, which forces the shard follow tasks to get
restarted at some point during the test. Note that these tests
cannot be enabled yet, because this change will need to be backported
to 6.x first. (otherwise the issue still occurs on non upgraded nodes)

I also changed the RestartIndexFollowingIT to setup remote cluster
via persistent settings and to also restart the leader cluster. This
way what happens during the ccr rolling upgrade qa tests, also happens
in this test.

Relates to #37231
martijnvg added a commit that referenced this issue Jan 10, 2019
If a running shard follow task needs to be restarted and
the remote connection seeds have changed then
a shard follow task currently fails with a fatal error.

The change creates the remote client lazily and adjusts
the errors a shard follow task should retry.

This issue was found in test failures in the recently added
ccr rolling upgrade tests. The reason why this issue occurs
more frequently in the rolling upgrade test is because ccr
is setup in local mode (so remote connection seed will become stale) and
all nodes are restarted, which forces the shard follow tasks to get
restarted at some point during the test. Note that these tests
cannot be enabled yet, because this change will need to be backported
to 6.x first. (otherwise the issue still occurs on non upgraded nodes)

I also changed the RestartIndexFollowingIT to setup remote cluster
via persistent settings and to also restart the leader cluster. This
way what happens during the ccr rolling upgrade qa tests, also happens
in this test.

Relates to #37231
@martijnvg
Copy link
Member

I merged #37239, which fixes the underlying issue why this rolling upgrade test failed.
I will unmute this test soon, when I've verified that this test doesn't fail locally. (I was able to reproduce locally).

martijnvg added a commit that referenced this issue Jan 21, 2019
…rade test,

in order to reduce the likelihood the test fails because of timeing issues.

Relates #37231
martijnvg added a commit that referenced this issue Jan 21, 2019
…rade test,

in order to reduce the likelihood the test fails because of timeing issues.

Relates #37231
@dnhatn
Copy link
Member

dnhatn commented Jan 31, 2019

@dnhatn dnhatn reopened this Jan 31, 2019
@dnhatn dnhatn self-assigned this Jan 31, 2019
dnhatn added a commit that referenced this issue Jan 31, 2019
@martijnvg
Copy link
Member

This failure is expected. In a 3 node cluster, all on version 6.7.0-SNAPSHOT, one node gets upgraded to 7.0.0-SNAPSHOT. In this state a leader index gets created on the upgraded node. The auto follow coordinator auto-follows this index, and the restore is performed on a not upgraded node. The restore fails because a 6.7.0-SNAPSHOT node can't read the index files from a 7.0.0-SNAPSHOT node:

[2019-02-01T09:59:32,426][WARN ][o.e.s.RestoreService     ] [node-1] [_ccr_local:_latest_] failed to restore snapshot
org.elasticsearch.snapshots.SnapshotRestoreException: [_ccr_local:_latest_/_latest_] the snapshot was created with Elasticsearch version [7.0.0] which is higher than the version of this node [6.7.0]
        at org.elasticsearch.snapshots.RestoreService.validateSnapshotRestorable(RestoreService.java:855) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.snapshots.RestoreService.restoreSnapshot(RestoreService.java:197) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.xpack.ccr.action.TransportPutFollowAction$1.doRun(TransportPutFollowAction.java:174) [x-pack-ccr-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

This test is testing a rolling upgrade in a local cluster (which is another form of bi-directional index following, see #38037) and currently we can't support that. The reason that it didn't fail before is because the new follower index would do an ops based recovery and recently we switched to a file based recovery as initial recovery.

I'm working an adding a qa module that does a rolling upgrading in two clusters that is setup to do unidirectional index following. In this case a rolling upgrade while new leader indices are auto followed should work without a problem. When that is ready we can then move this test, to this new qa module.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 3, 2019
This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
the the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to elastic#37231 and elastic#38037
@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
martijnvg added a commit that referenced this issue Feb 12, 2019
This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to #37231 and #38037
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 13, 2019
This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to elastic#37231 and elastic#38037
martijnvg added a commit that referenced this issue Feb 14, 2019
* Add rolling upgrade multi cluster test module (#38277)

This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to #37231 and #38037

* Filter out upgraded version index settings when starting index following (#38838)

The `index.version.upgraded` and `index.version.upgraded_string` are likely
to be different between leader and follower index. In the event that
a follower index gets restored on a upgraded node while the leader index
is still on non-upgraded nodes.

Closes #38835
martijnvg added a commit that referenced this issue Feb 14, 2019
* Add rolling upgrade multi cluster test module (#38277)

This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to #37231 and #38037

* Filter out upgraded version index settings when starting index following (#38838)

The `index.version.upgraded` and `index.version.upgraded_string` are likely
to be different between leader and follower index. In the event that
a follower index gets restored on a upgraded node while the leader index
is still on non-upgraded nodes.

Closes #38835
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 14, 2019
* Add rolling upgrade multi cluster test module (elastic#38277)

This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to elastic#37231 and elastic#38037

* Filter out upgraded version index settings when starting index following (elastic#38838)

The `index.version.upgraded` and `index.version.upgraded_string` are likely
to be different between leader and follower index. In the event that
a follower index gets restored on a upgraded node while the leader index
is still on non-upgraded nodes.

Closes elastic#38835
martijnvg added a commit that referenced this issue Feb 14, 2019
* Add rolling upgrade multi cluster test module (#38277)

This test starts 2 clusters, each with 3 nodes.
First the leader cluster is started and tests are run against it and
then the follower cluster is started and tests execute against this two cluster.

Then the follower cluster is upgraded, one node at a time.
After that the leader cluster is upgraded, one node at a time.
Every time a node is upgraded tests are ran while both clusters are online.
(and either leader cluster has mixed node versions or the follower cluster)

This commit only tests CCR index following, but could be used for CCS tests as well.
In particular for CCR, unidirectional index following is tested during a rolling upgrade.
During the test several indices are created and followed in the leader cluster before or
while the follower cluster is being upgraded.

This tests also verifies that attempting to follow an index in the upgraded cluster
from the not upgraded cluster fails. After both clusters are upgraded following the
index that previously failed should succeed.

Relates to #37231 and #38037

* Filter out upgraded version index settings when starting index following (#38838)

The `index.version.upgraded` and `index.version.upgraded_string` are likely
to be different between leader and follower index. In the event that
a follower index gets restored on a upgraded node while the leader index
is still on non-upgraded nodes.

Closes #38835
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Feb 14, 2019
The rest of `CCRIT` is now no longer relevant, because the remaining
test tests the same of the index following test in the rolling upgrade
multi cluster module.

Added `tests.upgrade_from_version` version to test. It is not needed
in this branch, but is in 6.7 branch.

Closes elastic#37231
martijnvg added a commit that referenced this issue Feb 15, 2019
…38900)

The rest of `CCRIT` is now no longer relevant, because the remaining
test tests the same of the index following test in the rolling upgrade
multi cluster module.

Added `tests.upgrade_from_version` version to test. It is not needed
in this branch, but is in 6.7 branch.

Closes #37231
martijnvg added a commit that referenced this issue Feb 15, 2019
…38900)

The rest of `CCRIT` is now no longer relevant, because the remaining
test tests the same of the index following test in the rolling upgrade
multi cluster module.

Added `tests.upgrade_from_version` version to test. It is not needed
in this branch, but is in 6.7 branch.

Closes #37231
martijnvg added a commit that referenced this issue Feb 15, 2019
…38900)

The rest of `CCRIT` is now no longer relevant, because the remaining
test tests the same of the index following test in the rolling upgrade
multi cluster module.

Added `tests.upgrade_from_version` version to test. It is not needed
in this branch, but is in 6.7 branch.

Closes #37231
martijnvg added a commit that referenced this issue Feb 15, 2019
…38900)

The rest of `CCRIT` is now no longer relevant, because the remaining
test tests the same of the index following test in the rolling upgrade
multi cluster module.

Added `tests.upgrade_from_version` version to test. It is not needed
in this branch, but is in 6.7 branch.

Closes #37231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants