[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816

tvernum · 2018-12-19T05:41:44Z

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/867/console

Reproduces for me with a variety of seeds.

/gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
-Dtests.seed=A29E5A089272CB0C \
-Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
-Dtests.method="testMigration" \
-Dtests.security.manager=true -Dtests.locale=ms -Dtests.timezone=America/Araguaina \
-Dcompiler.java=11 -Druntime.java=8

MlMigrationFullClusterRestartIT.testMigration <<< FAILURES!
   > Throwable #1: java.lang.AssertionError
   >    at __randomizedtesting.SeedInfo.seed([655FBD85FBD8974E:A6E3BD79B4BC8D2D]:0)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForMigration(MlMigrationFullClusterRestartIT.java:159)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:109)
   >    at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:77)
   >    at java.lang.Thread.run(Thread.java:745)
   >    Suppressed: java.lang.AssertionError
   >            at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   >            ... 40 more
   >    Suppressed: java.lang.AssertionError
   >            at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForMigration$2(MlMigrationFullClusterRestartIT.java:167)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:835)
   >            ... 40 more

I'm going to mute the test.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-19T05:41:45Z

Pinging @elastic/ml-core

tvernum · 2018-12-19T06:04:47Z

Muted in 06b175d

AwaitsFix: #36816 This test fails quite reliably.

pgomulka · 2019-01-30T14:57:11Z

I have this one failing on 6.6 and few times on 6.x build stats

I wonder if we should back port this change?

./gradlew :x-pack:qa:full-cluster-restart:with-system-key:v6.6.0#upgradedClusterTestRunner \
  -Dtests.seed=D791D0C47DC7EAF4 \
  -Dtests.class=org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT \
  -Dtests.method="testMigration" \
  -Dtests.security.manager=true \
  -Dtests.locale=mt \
  -Dtests.timezone=America/Nipigon \
  -Dcompiler.java=11 \
  -Druntime.java=8

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/125/console

droberts195 · 2019-01-31T15:46:54Z

I backported #37077 to 6.6 to fix the reproducible failure that occurs when testing upgrade from 6.6.0 to 6.6.1.

The 6.x failure must be something different as that already had #37077.

droberts195 · 2019-01-31T17:38:50Z

One of the 6.x failures is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/376/consoleText

The error is:

   > Throwable #1: org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
   > 	at __randomizedtesting.SeedInfo.seed([3C4D9B41A53DB36A:FFF19BBDEA59A909]:0)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.upgradedClusterTests(MlMigrationFullClusterRestartIT.java:146)
   > 	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.testMigration(MlMigrationFullClusterRestartIT.java:87)
   > 	at java.lang.Thread.run(Thread.java:748)

The root cause seems to be a race. From v5.6.10#upgradedClusterTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/full-cluster-restart.log:

[2019-01-30T22:06:25,714][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21203] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB

...

[2019-01-30T22:06:27,641][INFO ][o.e.x.m.j.p.a.ProcessContext] [node-0] Killing job [migration-old-cluster-open-job], because [task has been removed, cancelling locally]
[2019-01-30T22:06:27,649][WARN ][o.e.x.m.j.p.a.o.AutoDetectResultProcessor] [node-0] [migration-old-cluster-open-job] some results not processed due to the process being killed
[2019-01-30T22:06:27,653][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [controller/20292] [CDetachedProcessSpawner.cc@177] Child process with PID 21203 was terminated by signal 15
[2019-01-30T22:06:27,653][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [node-0] Received ban for the parent [lCoBrFUwS42T5jSfuPOm4A:890] on the node [lCoBrFUwS42T5jSfuPOm4A], reason: [task has been removed, cancelling locally]

...

[2019-01-30T22:06:29,935][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-0] [migration-old-cluster-open-job] [autodetect/21319] [CResourceMonitor.cc@68] Setting model memory limit to 4096 MB

...

[2019-01-30T22:06:30,137][ERROR][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Could not set job state to [opened] for job [migration-old-cluster-open-job]
org.elasticsearch.ResourceNotFoundException: the task with id job-migration-old-cluster-open-job and allocation id 7 doesn't exist
    at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:235) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:642) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:271) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:201) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:136) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

Or in plain English:

The node is upgraded and notices that it's supposed to have migration-old-cluster-open-job running on it, so starts a process for migration-old-cluster-open-job
The node notices that there is no persistent task for migration-old-cluster-open-job or the persistent task is not assigned to this node, so kills the process started in 1.
The node once again notices that it's supposed to have migration-old-cluster-open-job running on it, so starts another process for migration-old-cluster-open-job
Meanwhile the update to the task state caused by 2. has gone through
The update to the task state triggered by 3. fails because the task has been updated since the version that step was working off of

I worry that this is a real bug that would affect customers on 6.7. Immediately before the first attempt at starting the job we have:

[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml job configurations migrated: [migration-old-cluster-closed-job, migration-old-cluster-open-job]
[2019-01-30T22:06:24,641][INFO ][o.e.x.m.MlConfigMigrator ] [node-0] ml datafeed configurations migrated: [migration-old-cluster-started-datafeed, migration-old-cluster-stopped-datafeed]
[2019-01-30T22:06:24,695][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,775][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,831][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,833][WARN ][o.e.x.m.MlAssignmentNotifier] [node-0] [migration-old-cluster-open-job] No node found to start datafeed [migration-old-cluster-started-datafeed]. Reasons []
[2019-01-30T22:06:24,878][WARN ][o.e.x.w.WatcherService   ] [node-0] not starting watcher, upgrade API run required: .watches[false], .triggered_watches[false]
[2019-01-30T22:06:24,895][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [node-0] Opening job [migration-old-cluster-open-job]

We rewrite the persistent tasks in that migration step, and there could be a race there.

Thankfully the step of migrating configs for open jobs is new in 6.7, so not released yet. We need to scrutinise in detail the code that rewrites the persistent tasks when migrating the open job configs before 6.7 is released.

The race is probably not that likely, as the BWC tests had iterated from 5.6.0 to 5.6.10 before hitting this problem, and there's no reason to think it's unique to upgrading from 5.6.10. So that suggests something like a 9% chance of the race occurring (1 in 11).

cbuescher · 2019-02-04T09:48:24Z

@droberts195
I see several others of these over the weekend on 6.x and master as well, e.g.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/394/console

org.junit.ComparisonFailure: expected:<[open]ed> but was:<[clos]ed>
	at __randomizedtesting.SeedInfo.seed([37F2A222C9FBA52A:F44EA2DE869FBF49]:0)
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.lambda$waitForJobToBeAssigned$0(MlMigrationFullClusterRestartIT.java:181)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
	at org.elasticsearch.xpack.restart.MlMigrationFullClusterRestartIT.waitForJobToBeAssigned(MlMigrationFullClusterRestartIT.java:172)

and also:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1796/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/389/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/388/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=sles-12/216/console

Should we mute this or are you close to a fix?

droberts195 · 2019-02-04T09:59:27Z

@cbuescher please can you mute it in master and 6.x? I think it's going to be at least a day before I can have a fix ready to merge.

cbuescher · 2019-02-04T10:38:28Z

Mutes on master and 6.x

droberts195 · 2019-02-04T13:38:32Z

I found the problem, and thankfully it is not in the code that we expect production users will run. The problem is that FullClusterRestartIT.testSnapshotRestore restores a snapshot with include_global_state=true. Depending on the order the tests run, this can wipe out the persistent tasks in the cluster state, or cause them to be different to the ones the PersistentTasksNodeService has recorded as currently assigned to a node, thus causing a reassignment.

(The fact that restoring global cluster state resets persistent tasks to whatever state they were in at the time the snapshot was taken has been the case ever since persistent tasks were introduced. So this is an effect that production users can experience, but ML config migration doesn't change it one way or the other.)

The tests don't run in parallel so you may think that they should not interact in this way. But MlMigrationFullClusterRestartIT.testMigration is relying on behaviour that is performed automatically on any cluster state update. So the assertions that MlMigrationFullClusterRestartIT.testMigration makes are not looking for the result of actions taken by the test itself, but looking for effects of the cluster starting up on the newer version and doing anything that results in a cluster state update.

The possible solutions I can think of are:

Make sure FullClusterRestartIT.testSnapshotRestore always runs first in the old cluster and last in the upgraded cluster, so that it does not interfere with the state of persistent tasks that other tests rely on (or at least not during the period when those other tests are making assertions that rely on that state)
Move FullClusterRestartIT.testSnapshotRestore into a separate test suite
Move MlMigrationFullClusterRestartIT.testMigration into a separate test suite *
Exclude FullClusterRestartIT.testSnapshotRestore from the X-Pack full cluster restart tests - it's one of the tests that's already run in the OSS full cluster restart tests and duplicated in X-Pack to test with security (so this would mean testing snapshots only without security)
Split include_global_state into more fine-grained options, so that it's possible to restore just cluster settings, or just index templates, or just persistent tasks

* I don't think this is a good option, as meddling with persistent tasks could be affecting other tests too. For example, #32773 might be caused by the same side effects of FullClusterRestartIT.testSnapshotRestore.

droberts195 · 2019-02-04T14:59:01Z

For future reference, the log message that alerted me to what was happening was Can't read metadata from store, will not reuse local files during restore - I'm not sure if this is expected, and if it's not and that bug is fixed then I don't think there would have been any sign of the root cause of the ML migration test failure in the logs...

polyfractal · 2019-02-04T15:55:54Z

For test purposes, option 2 seems the easiest/safest. Since we know it could drastically alter other tests, keeping it isolated seems best. Option 1 and 4 would both work but seem more fragile?

Option 5 would probably be a nice feature to have in general. I could see people wanting to selectively restore different parts of the cluster state, potentially. But probably bigger impact change, and wouldn't help retroactively.

jkakavas · 2019-02-05T07:12:17Z

~~@droberts195 testMappingsUpgrade now fails reproducible for me on master for a number of seeds.~~

~~Would it be worth muting this ? I am unsure how other people's PR runs are passing~~

Apologies, this was a red herring. The real culprit was unrelated to this issue

  > Warnings: [[types removal] The response format of get index template requests will change in the next major version. Please start using the `include_type_name` parameter set to `false` in the request to move to the new, typeless response format that will be the default in 7.0.]

which causes all tests that implement AbstractUpgradeTestCase to fail

elasticsearchmachine · 2023-10-09T19:42:40Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-11-05T21:57:50Z

This issue has been closed because it has been open for too long with no activity.

Any muted tests that were associated with this issue have been unmuted.

If the tests begin failing again, a new issue will be opened, and they may be muted again.

elasticsearchmachine · 2024-11-08T22:02:53Z

This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future.

tvernum added >test-failure Triaged test failures from CI :ml Machine learning labels Dec 19, 2018

tvernum added a commit that referenced this issue Dec 19, 2018

[TEST] Mute MlMigrationFullClusterRestartIT

06b175d

AwaitsFix: #36816 This test fails quite reliably.

tvernum added a commit that referenced this issue Dec 19, 2018

[TEST] Mute MlMigrationFullClusterRestartIT

4b3fd15

AwaitsFix: #36816 This test fails quite reliably.

davidkyle mentioned this issue Dec 19, 2018

[ML] Metadata Migration Meta issue #32905

Closed

43 tasks

davidkyle closed this as completed in #37118 Jan 3, 2019

pgomulka reopened this Jan 30, 2019

droberts195 self-assigned this Jan 31, 2019

This was referenced Feb 4, 2019

Mute MlMigrationFullClusterRestartIT#testMigration #38315

Merged

Mute MlMigrationFullClusterRestartIT#testMigration #38316

Merged

droberts195 mentioned this issue Feb 4, 2019

[CI] FullClusterRestartIT.testRollupIDSchemeAfterRestart failed #32773

Closed

jkakavas mentioned this issue Feb 5, 2019

OIDC realm authentication flows #37787

Merged

droberts195 changed the title ~~[CI] MlMigrationFullClusterRestartIT fails consistently~~ [CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks Mar 15, 2019

droberts195 mentioned this issue Sep 27, 2019

[ML] Add basic BWC tests for data frame analytics #47212

Merged

droberts195 added the low-risk An open issue or test failure that is a low risk to future releases label Oct 9, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 9, 2023

elasticsearchmachine closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024

elasticsearchmachine reopened this Nov 8, 2024

benwtrent unassigned droberts195 Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816

[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816

tvernum commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

tvernum commented Dec 19, 2018

pgomulka commented Jan 30, 2019

droberts195 commented Jan 31, 2019

droberts195 commented Jan 31, 2019

cbuescher commented Feb 4, 2019

droberts195 commented Feb 4, 2019

cbuescher commented Feb 4, 2019

droberts195 commented Feb 4, 2019 •

edited

Loading

droberts195 commented Feb 4, 2019

polyfractal commented Feb 4, 2019

jkakavas commented Feb 5, 2019 •

edited

Loading

elasticsearchmachine commented Oct 9, 2023

elasticsearchmachine commented Nov 5, 2024

elasticsearchmachine commented Nov 8, 2024

[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816

[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816

Comments

tvernum commented Dec 19, 2018

elasticmachine commented Dec 19, 2018

tvernum commented Dec 19, 2018

pgomulka commented Jan 30, 2019

droberts195 commented Jan 31, 2019

droberts195 commented Jan 31, 2019

cbuescher commented Feb 4, 2019

droberts195 commented Feb 4, 2019

cbuescher commented Feb 4, 2019

droberts195 commented Feb 4, 2019 • edited Loading

droberts195 commented Feb 4, 2019

polyfractal commented Feb 4, 2019

jkakavas commented Feb 5, 2019 • edited Loading

elasticsearchmachine commented Oct 9, 2023

elasticsearchmachine commented Nov 5, 2024

elasticsearchmachine commented Nov 8, 2024

droberts195 commented Feb 4, 2019 •

edited

Loading

jkakavas commented Feb 5, 2019 •

edited

Loading