Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ML ClassificationIT.testTooLowConfiguredMemoryStillStarts failure #61704

Closed
henningandersen opened this issue Aug 31, 2020 · 4 comments · Fixed by #61846
Closed

[CI] ML ClassificationIT.testTooLowConfiguredMemoryStillStarts failure #61704

henningandersen opened this issue Aug 31, 2020 · 4 comments · Fixed by #61846
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI v7.10.0 v8.0.0-alpha1

Comments

@henningandersen
Copy link
Contributor

henningandersen commented Aug 31, 2020

Build scan:
https://gradle-enterprise.elastic.co/s/xxiunoj23mvfy

Repro line:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTest' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testTooLowConfiguredMemoryStillStarts" \
  -Dtests.seed=7B412D508A8B006F \
  -Dtests.security.manager=true \
  -Dtests.locale=es-UY \
  -Dtests.timezone=Canada/Yukon \
  -Druntime.java=11

Reproduces locally?:
It did reproduce something though a different assertion failure:

org.elasticsearch.xpack.ml.integration.ClassificationIT > testTooLowConfiguredMemoryStillStarts FAILED
    java.lang.AssertionError: Stats were: {"id":"low_memory_analysis","state":"analyzing","progress":[{"phase":"reindexing","progress_percent":100},{"phase":"loading_data","progress_percent":0},{"phase":"feature_selection","progress_percent":0},{"phase":"coarse_parameter_search","progress_percent":0},{"phase":"fine_tuning_parameters","progress_percent":0},{"phase":"final_training","progress_percent":0},{"phase":"writing_results","progress_percent":0},{"phase":"inference","progress_percent":0}],"data_counts":{"training_docs_count":0,"test_docs_count":0,"skipped_docs_count":0},"memory_usage":{"peak_usage_bytes":0,"status":"ok"},"node":{"id":"3KdHUcK_RpqLjHhXJDHjvg","name":"integTest-2","ephemeral_id":"n5isiXUMQYCJxK0_ypRujA","transport_address":"127.0.0.1:35541","attributes":{"testattr":"test","ml.machine_memory":"67459010560","ml.max_open_jobs":"20","xpack.installed":"true","transform.node":"true"}},"assignment_explanation":""}
    Expected: <failed>
         but: was <analyzing>
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertIsFailed(MlNativeDataFrameAnalyticsIntegTestCase.java:216)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.lambda$waitUntilAnalyticsIsFailed$1(MlNativeDataFrameAnalyticsIntegTestCase.java:161)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.waitUntilAnalyticsIsFailed(MlNativeDataFrameAnalyticsIntegTestCase.java:161)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.testTooLowConfiguredMemoryStillStarts(ClassificationIT.java:793)

Build-stats does indicate a number of different failure.

Applicable branches:
master and 7.x

Failure history:
https://build-stats.elastic.co/goto/8942e4faac3b7916cf4ca2e0433f82cf

Looks like failures started August 27th, with around 7 failures per day.

This test was also previously reported failing in: #61561

Failure excerpt:

org.elasticsearch.xpack.ml.integration.ClassificationIT > testTooLowConfiguredMemoryStillStarts FAILED
    java.lang.AssertionError: 
    Expected: <0L>
         but: was <1L>
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.cleanUpAnalytics(MlNativeDataFrameAnalyticsIntegTestCase.java:96)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.cleanUpResources(MlNativeDataFrameAnalyticsIntegTestCase.java:87)
        at org.elasticsearch.xpack.ml.integration.MlNativeIntegTestCase.cleanUp(MlNativeIntegTestCase.java:181)
        at org.elasticsearch.xpack.ml.integration.ClassificationIT.cleanup(ClassificationIT.java:111)

    java.lang.AssertionError: test leaves transient cluster metadata behind
    Expected: an empty collection
         but: <[logger.org.elasticsearch.xpack.ml.dataframe.inference, logger.org.elasticsearch.xpack.core.ml.inference]>
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
@henningandersen henningandersen added >test-failure Triaged test failures from CI :ml Machine learning v8.0.0 v7.10.0 labels Aug 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@henningandersen
Copy link
Contributor Author

Muted in master and 7.x in #61705

@benwtrent benwtrent self-assigned this Aug 31, 2020
@benwtrent
Copy link
Member

Two issues:

  1. The native process is checking the memory limit on start still. It shouldn't do this.
  2. There is a race condition if the process crashes REALLY early (i.e. failing to start).

The race condition is here:

Basically, the onCrash callback is being called, but then we are raising that the on start failed.

Logs:

[2020-08-31T11:06:13,881][DEBUG][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [low_memory_analysis] creating analytics process with config [{"job_id":"low_memory_analysis","rows":10000,"cols":8,"memory_limit":1024,"threads":1,"results_field":"ml","categorical_fields":["text-field","keyword-field","outer-field.inner-field","alias-to-nested-field","alias-to-keyword-field"],"analysis":{"name":"classification","parameters":{"prediction_field_type":"string","num_classes":2,"dependent_variable":"outer-field.inner-field","num_top_classes":2,"training_percent":100.0,"class_assignment_objective":"maximize_minimum_recall","prediction_field_name":"outer-field.inner-field_prediction"}}}]
[2020-08-31T11:06:13,980][FATAL][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [low_memory_analysis] [data_frame_analyzer/14993] [CLogger.cc@399] Input error: memory limit 0MB is too low to perform analysis. You need to give the process at least 26MB, but preferably more.
[2020-08-31T11:06:13,981][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [low_memory_analysis] State output finished
[2020-08-31T11:06:13,982][ERROR][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [low_memory_analysis] analytics process stopped unexpectedly: Input error: memory limit 0MB is too low to perform analysis. You need to give the process at least 26MB, but preferably more.

[2020-08-31T11:06:13,982][WARN ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [low_memory_analysis] crashed with a process context with reason [[low_memory_analysis] analytics process stopped unexpectedly: Input error: memory limit 0MB is too low to perform analysis. You need to give the process at least 26MB, but preferably more.
]
[2020-08-31T11:06:13,981][ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [low_memory_analysis] failed to start native process
org.elasticsearch.ElasticsearchException: Failed to start data frame analytics process
	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:51) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.createProcess(AnalyticsProcessManager.java:325) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager$ProcessContext.startProcess(AnalyticsProcessManager.java:486) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.lambda$runJob$2(AnalyticsProcessManager.java:147) [x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:647) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-08-31T11:06:13,983][DEBUG][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [low_memory_analysis] Stopping process
[2020-08-31T11:06:13,983][DEBUG][o.e.x.m.d.e.DataFrameDataExtractor] [integTest-0] [low_memory_analysis] Data extractor was cancelled
[2020-08-31T11:06:13,983][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [integTest-0] uncaught exception in thread [elasticsearch[integTest-0][generic][T#2]]
org.elasticsearch.ElasticsearchException: Failed to start data frame analytics process
	at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:51) ~[?:?]
	at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.createProcess(AnalyticsProcessManager.java:325) ~[?:?]
	at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager$ProcessContext.startProcess(AnalyticsProcessManager.java:486) ~[?:?]
	at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.lambda$runJob$2(AnalyticsProcessManager.java:147) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:647) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-08-31T11:06:14,002][WARN ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [controller/14807] [CDetachedProcessSpawner.cc@193] Child process with PID 14993 has exited with exit code 1
[2020-08-31T11:06:42,352][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [integTest-0] Upgrade mode noop
[2020-08-31T11:06:42,530][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [integTest-0] [low_memory_analysis] Stopping task with force [false]

@benwtrent
Copy link
Member

@dimitris-athanasiou ^^ ping on the race

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Sep 2, 2020
While starting the data frame analytics process it is possible
to get an exception before the process crash handler is in place.
In addition, right after starting the process, we check the process
is alive to ensure we capture a failed process. However, those exceptions
are unhandled.

This commit catches any exception thrown while starting the process
and sets the task to failed with the root cause error message.

I have also taken the chance to remove some unused parameters
in `NativeAnalyticsProcessFactory`.

Relates elastic#61704
dimitris-athanasiou added a commit that referenced this issue Sep 2, 2020
While starting the data frame analytics process it is possible
to get an exception before the process crash handler is in place.
In addition, right after starting the process, we check the process
is alive to ensure we capture a failed process. However, those exceptions
are unhandled.

This commit catches any exception thrown while starting the process
and sets the task to failed with the root cause error message.

I have also taken the chance to remove some unused parameters
in `NativeAnalyticsProcessFactory`.

Relates #61704
dimitris-athanasiou added a commit that referenced this issue Sep 2, 2020
…61838) (#61847)

While starting the data frame analytics process it is possible
to get an exception before the process crash handler is in place.
In addition, right after starting the process, we check the process
is alive to ensure we capture a failed process. However, those exceptions
are unhandled.

This commit catches any exception thrown while starting the process
and sets the task to failed with the root cause error message.

I have also taken the chance to remove some unused parameters
in `NativeAnalyticsProcessFactory`.

Relates #61704

Backport of #61838
benwtrent added a commit that referenced this issue Sep 2, 2020
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Sep 2, 2020
benwtrent added a commit that referenced this issue Sep 2, 2020
)

* [ML] unmute testTooLowConfiguredMemoryStillStarts (#61846)

Native PR addresses this test failure: elastic/ml-cpp#1465


closes #61704

closes #61561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants