Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816

Open
tvernum opened this issue Dec 19, 2018 · 15 comments
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Dec 19, 2018

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/867/console

Reproduces for me with a variety of seeds.

/gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
-Dtests.seed=A29E5A089272CB0C \
-Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
-Dtests.method="testMigration" \
-Dtests.security.manager=true -Dtests.locale=ms -Dtests.timezone=America/Araguaina \
-Dcompiler.java=11 -Druntime.java=8
MlMigrationFullClusterRestartIT.testMigration <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([655FBD85FBD8974E:A6E3BD79B4BC8D2D]:0)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForMigration(MlMigrationFullClusterRestartIT.java:159)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:109)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:77)
   >    at java.lang.Thread.run(Thread.java:745)
   >    Suppressed: java.lang.AssertionError
   >            at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   >            ... 40 more
   >    Suppressed: java.lang.AssertionError
   >            at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   >            ... 40 more

I'm going to mute the test.

@tvernum tvernum added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@tvernum
Copy link
Contributor Author

tvernum commented Dec 19, 2018

Muted in 06b175d

@pgomulka
Copy link
Contributor

I have this one failing on 6.6 and few times on 6.x build stats

I wonder if we should back port this change?

./gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
  -Dtests.seed=D791D0C47DC7EAF4 \
  -Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
  -Dtests.method="testMigration" \
  -Dtests.security.manager=true \
  -Dtests.locale=mt \
  -Dtests.timezone=America/Nipigon \
  -Dcompiler.java=11 \
  -Druntime.java=8

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/125/console

@droberts195
Copy link
Contributor

I backported #37077 to 6.6 to fix the reproducible failure that occurs when testing upgrade from 6.6.0 to 6.6.1.

The 6.x failure must be something different as that already had #37077.

@droberts195 droberts195 self-assigned this Jan 31, 2019
@droberts195
Copy link
Contributor

One of the 6.x failures is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/376/consoleText

The error is:

   > Throwable #1: org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
   > 	at __randomizedtesting.SeedInfo.seed([3C4D9B41A53DB36A:FFF19BBDEA59A909]:0)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:146)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:87)
   > 	at java.lang.Thread.run(Thread.java:748)

The root cause seems to be a race. From v5.6.10#upgradedClusterTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/full-cluster-restart.log:

[2019-01-30T22:06:25,714][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21203] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB

...

[2019-01-30T22:06:27,641][INFO ][o.e.x.m.j.p.a.ProcessContext] [node-0] Killing job [migration-old-cluster-open-job], because [task has been removed, cancelling locally]
[2019-01-30T22:06:27,649][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [migration-old-cluster-open-job] some results not processed due to the process being killed
[2019-01-30T22:06:27,653][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [controller/20292] [CDetachedProcessSpawner.cc@177] Child process with PID 21203 was terminated by signal 15
[2019-01-30T22:06:27,653][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [node-0] Received ban for the parent [lCoBrFUwS42T5jSfuPOm4A:890] on the node [lCoBrFUwS42T5jSfuPOm4A], reason: [task has been removed, cancelling locally]

...

[2019-01-30T22:06:29,935][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21319] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB

...

[2019-01-30T22:06:30,137][ERROR][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Could not set job state to [opened] for job [migration-old-cluster-open-job]
org.elasticsearch.ResourceNotFoundException: the task with id job-migration-old-cluster-open-job and allocation id 7 doesn't exist
    at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:235) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:642) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:271) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:201) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:136) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

Or in plain English:

  1. The node is upgraded and notices that it's supposed to have migration-old-cluster-open-job running on it, so starts a process for migration-old-cluster-open-job
  2. The node notices that there is no persistent task for migration-old-cluster-open-job or the persistent task is not assigned to this node, so kills the process started in 1.
  3. The node once again notices that it's supposed to have migration-old-cluster-open-job running on it, so starts another process for migration-old-cluster-open-job
  4. Meanwhile the update to the task state caused by 2. has gone through
  5. The update to the task state triggered by 3. fails because the task has been updated since the version that step was working off of

I worry that this is a real bug that would affect customers on 6.7. Immediately before the first attempt at starting the job we have:

[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml job configurations migrated: [migration-old-cluster-closed-job, migration-old-cluster-open-job]
[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml datafeed configurations migrated: [migration-old-cluster-started-datafeed, migration-old-cluster-stopped-datafeed]
[2019-01-30T22:06:24,695][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,775][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,831][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,833][WARN ][o.e.x.m.MlAssignmentNotifier] [node-0] [migration-old-cluster-open-job] No node found to start datafeed [migration-old-cluster-started-datafeed]. Reasons []
[2019-01-30T22:06:24,878][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,895][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [migration-old-cluster-open-job]

We rewrite the persistent tasks in that migration step, and there could be a race there.

Thankfully the step of migrating configs for open jobs is new in 6.7, so not released yet. We need to scrutinise in detail the code that rewrites the persistent tasks when migrating the open job configs before 6.7 is released.

The race is probably not that likely, as the BWC tests had iterated from 5.6.0 to 5.6.10 before hitting this problem, and there's no reason to think it's unique to upgrading from 5.6.10. So that suggests something like a 9% chance of the race occurring (1 in 11).

@cbuescher
Copy link
Member

@droberts195
I see several others of these over the weekend on 6.x and master as well, e.g.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/394/console

org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
	at __randomizedtesting.SeedInfo.seed([37F2A222C9FBA52A:F44EA2DE869FBF49]:0)
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)

and also:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1796/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/389/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/388/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles-12/216/console

Should we mute this or are you close to a fix?

@droberts195
Copy link
Contributor

@cbuescher please can you mute it in master and 6.x? I think it's going to be at least a day before I can have a fix ready to merge.

@cbuescher
Copy link
Member

Mutes on master and 6.x

@droberts195
Copy link
Contributor

droberts195 commented Feb 4, 2019

I found the problem, and thankfully it is not in the code that we expect production users will run. The problem is that FullClusterRestartIT.testSnapshotRestore restores a snapshot with include_global_state=true. Depending on the order the tests run, this can wipe out the persistent tasks in the cluster state, or cause them to be different to the ones the PersistentTasksNodeService has recorded as currently assigned to a node, thus causing a reassignment.

(The fact that restoring global cluster state resets persistent tasks to whatever state they were in at the time the snapshot was taken has been the case ever since persistent tasks were introduced. So this is an effect that production users can experience, but ML config migration doesn't change it one way or the other.)

The tests don't run in parallel so you may think that they should not interact in this way. But MlMigrationFullClusterRestartIT.testMigration is relying on behaviour that is performed automatically on any cluster state update. So the assertions that MlMigrationFullClusterRestartIT.testMigration makes are not looking for the result of actions taken by the test itself, but looking for effects of the cluster starting up on the newer version and doing anything that results in a cluster state update.

The possible solutions I can think of are:

  1. Make sure FullClusterRestartIT.testSnapshotRestore always runs first in the old cluster and last in the upgraded cluster, so that it does not interfere with the state of persistent tasks that other tests rely on (or at least not during the period when those other tests are making assertions that rely on that state)
  2. Move FullClusterRestartIT.testSnapshotRestore into a separate test suite
  3. Move MlMigrationFullClusterRestartIT.testMigration into a separate test suite *
  4. Exclude FullClusterRestartIT.testSnapshotRestore from the X-Pack full cluster restart tests - it's one of the tests that's already run in the OSS full cluster restart tests and duplicated in X-Pack to test with security (so this would mean testing snapshots only without security)
  5. Split include_global_state into more fine-grained options, so that it's possible to restore just cluster settings, or just index templates, or just persistent tasks

* I don't think this is a good option, as meddling with persistent tasks could be affecting other tests too. For example, #32773 might be caused by the same side effects of FullClusterRestartIT.testSnapshotRestore.

@droberts195
Copy link
Contributor

For future reference, the log message that alerted me to what was happening was Can't read metadata from store, will not reuse local files during restore - I'm not sure if this is expected, and if it's not and that bug is fixed then I don't think there would have been any sign of the root cause of the ML migration test failure in the logs...

@polyfractal
Copy link
Contributor

For test purposes, option 2 seems the easiest/safest. Since we know it could drastically alter other tests, keeping it isolated seems best. Option 1 and 4 would both work but seem more fragile?

Option 5 would probably be a nice feature to have in general. I could see people wanting to selectively restore different parts of the cluster state, potentially. But probably bigger impact change, and wouldn't help retroactively.

@jkakavas
Copy link
Member

jkakavas commented Feb 5, 2019

@droberts195 testMappingsUpgrade now fails reproducible for me on master for a number of seeds.

Would it be worth muting this ? I am unsure how other people's PR runs are passing

Apologies, this was a red herring. The real culprit was unrelated to this issue

  > Warnings: [[types removal] The response format of get index template requests will change in the next major version. Please start using the `include_type_name` parameter set to `false` in the request to move to the new, typeless response format that will be the default in 7.0.]

which causes all tests that implement AbstractUpgradeTestCase to fail

@droberts195 droberts195 changed the title [CI] MlMigrationFullClusterRestartIT fails consistently [CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks Mar 15, 2019
@droberts195 droberts195 added the low-risk An open issue or test failure that is a low risk to future releases label Oct 9, 2023
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 9, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024
@elasticsearchmachine
Copy link
Collaborator

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

@elasticsearchmachine
Copy link
Collaborator

This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

8 participants