[ML] Don't treat stale FAILED jobs as OPENING in job allocation #31800

droberts195 · 2018-07-04T15:54:00Z

Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794

Job persistent tasks with stale allocation IDs used to always be considered as OPENING jobs in the ML job node allocation decision. However, FAILED jobs are not relocated to other nodes, which leads to them blocking up the nodes they failed on after node restarts. FAILED jobs should not restrict how many other jobs can open on a node, regardless of whether they are stale or not. Closes elastic#31794

elasticmachine · 2018-07-04T15:54:02Z

Pinging @elastic/ml-core

davidkyle

LGTM left a readability suggestion

davidkyle · 2018-07-04T16:09:25Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java

-                    // Don't count FAILED jobs, as they don't consume native memory
-                    if (jobState != JobState.FAILED) {
+                    // Don't count CLOSED or FAILED jobs, as they don't consume native memory
+                    if (jobState != JobState.CLOSED && jobState != JobState.FAILED) {


You can use jobState.isAnyOf(JobState.CLOSED, JobState.FAILED) == false

Job persistent tasks with stale allocation IDs used to always be considered as OPENING jobs in the ML job node allocation decision. However, FAILED jobs are not relocated to other nodes, which leads to them blocking up the nodes they failed on after node restarts. FAILED jobs should not restrict how many other jobs can open on a node, regardless of whether they are stale or not. Closes #31794

* 6.x: Test: Do not remove xpack templates when cleaning (#31642) SQL: Allow long literals (#31777) SQL: Fix incorrect message for aliases (#31792) Detach Transport from TransportService (#31727) 6.3.1 release notes (#31829) Add unreleased version 6.3.2 [ML][TEST] Use java 11 valid time format in DataDescriptionTests (#31817) [ML] Don't treat stale FAILED jobs as OPENING in job allocation (#31800) [ML] Fix calendar and filter updates from non-master nodes (#31804) Fix license header generation on Windows (#31790) mark XPackRestIT.test {p0=monitoring/bulk/10_basic/Bulk indexing of monitoring data} as AwaitsFix Add JDK11 support without enabling in CI (#31644) Watcher: Fix check for currently executed watches (#31137) [DOCS] Fixes 6.3.0 release notes (#31771) Watcher: Ensure correct method is used to read secure settings (#31753) [ML] Rate limit established model memory updates (#31768) SQL: Update CLI logo

* master: REST high-level client: add get index API (#31703) SQL: Allow long literals (#31777) SQL: Fix incorrect message for aliases (#31792) Test: Do not remove xpack templates when cleaning (#31642) Reduce more raw types warnings (#31780) Add unreleased version 6.3.2 Scripting: Remove support for deprecated StoredScript contexts (#31394) [ML][TEST] Use java 11 valid time format in DataDescriptionTests (#31817) [ML] Don't treat stale FAILED jobs as OPENING in job allocation (#31800) [ML] Fix calendar and filter updates from non-master nodes (#31804) Fix license header generation on Windows (#31790) mark RollupIT.testTwoJobsStartStopDeleteOne as AwaitsFix mark SearchAsyncActionTests.testFanOutAndCollect as AwaitsFix Correct exclusion of test on JDK 11 Fix doclint jdk 11 Add JDK11 support and enable in CI (#31644) Watcher: Fix check for currently executed watches (#31137) Watcher: Ensure correct method is used to read secure settings (#31753) SQL: Update CLI logo

droberts195 added >bug review v7.0.0 :ml Machine learning v6.4.0 v6.3.2 labels Jul 4, 2018

davidkyle approved these changes Jul 4, 2018

View reviewed changes

Use isAnyOf()

1617e84

droberts195 merged commit 92de94c into elastic:master Jul 5, 2018

droberts195 deleted the ignore_stale_failed_jobs_in_job_allocation branch July 5, 2018 12:26

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Don't treat stale FAILED jobs as OPENING in job allocation #31800

[ML] Don't treat stale FAILED jobs as OPENING in job allocation #31800

droberts195 commented Jul 4, 2018

elasticmachine commented Jul 4, 2018

davidkyle left a comment

davidkyle Jul 4, 2018

[ML] Don't treat stale FAILED jobs as OPENING in job allocation #31800

[ML] Don't treat stale FAILED jobs as OPENING in job allocation #31800

Conversation

droberts195 commented Jul 4, 2018

elasticmachine commented Jul 4, 2018

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Jul 4, 2018

Choose a reason for hiding this comment