Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Don't treat stale FAILED jobs as OPENING in job allocation #31800

Conversation

droberts195
Copy link
Contributor

Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794

Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes elastic#31794
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM left a readability suggestion

// Don't count FAILED jobs, as they don't consume native memory
if (jobState != JobState.FAILED) {
// Don't count CLOSED or FAILED jobs, as they don't consume native memory
if (jobState != JobState.CLOSED && jobState != JobState.FAILED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use jobState.isAnyOf(JobState.CLOSED, JobState.FAILED) == false

@droberts195 droberts195 merged commit 92de94c into elastic:master Jul 5, 2018
@droberts195 droberts195 deleted the ignore_stale_failed_jobs_in_job_allocation branch July 5, 2018 12:26
droberts195 added a commit that referenced this pull request Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794
droberts195 added a commit that referenced this pull request Jul 5, 2018
Job persistent tasks with stale allocation IDs used to always be
considered as OPENING jobs in the ML job node allocation decision.
However, FAILED jobs are not relocated to other nodes, which leads
to them blocking up the nodes they failed on after node restarts.
FAILED jobs should not restrict how many other jobs can open on a
node, regardless of whether they are stale or not.

Closes #31794
dnhatn added a commit that referenced this pull request Jul 5, 2018
* 6.x:
  Test: Do not remove xpack templates when cleaning (#31642)
  SQL: Allow long literals (#31777)
  SQL: Fix incorrect message for aliases (#31792)
  Detach Transport from TransportService (#31727)
  6.3.1 release notes (#31829)
  Add unreleased version 6.3.2
  [ML][TEST] Use java 11 valid time format in DataDescriptionTests (#31817)
  [ML] Don't treat stale FAILED jobs as OPENING in job allocation (#31800)
  [ML] Fix calendar and filter updates from non-master nodes (#31804)
  Fix license header generation on Windows (#31790)
  mark XPackRestIT.test {p0=monitoring/bulk/10_basic/Bulk indexing of monitoring data} as AwaitsFix
  Add JDK11 support without enabling in CI (#31644)
  Watcher: Fix check for currently executed watches (#31137)
  [DOCS] Fixes 6.3.0 release notes (#31771)
  Watcher: Ensure correct method is used to read secure settings (#31753)
  [ML] Rate limit established model memory updates (#31768)
  SQL: Update CLI logo
dnhatn added a commit that referenced this pull request Jul 5, 2018
* master:
  REST high-level client: add get index API (#31703)
  SQL: Allow long literals (#31777)
  SQL: Fix incorrect message for aliases (#31792)
  Test: Do not remove xpack templates when cleaning (#31642)
  Reduce more raw types warnings (#31780)
  Add unreleased version 6.3.2
  Scripting: Remove support for deprecated StoredScript contexts (#31394)
  [ML][TEST] Use java 11 valid time format in DataDescriptionTests (#31817)
  [ML] Don't treat stale FAILED jobs as OPENING in job allocation (#31800)
  [ML] Fix calendar and filter updates from non-master nodes (#31804)
  Fix license header generation on Windows (#31790)
  mark RollupIT.testTwoJobsStartStopDeleteOne as AwaitsFix
  mark SearchAsyncActionTests.testFanOutAndCollect as AwaitsFix
  Correct exclusion of test on JDK 11
  Fix doclint jdk 11
  Add JDK11 support and enable in CI (#31644)
  Watcher: Fix check for currently executed watches (#31137)
  Watcher: Ensure correct method is used to read secure settings (#31753)
  SQL: Update CLI logo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants