Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML: Add upgrade mode docs, hlrc, and fix bug #37942

Merged
merged 8 commits into from
Jan 30, 2019

Conversation

benwtrent
Copy link
Member

This adds reference docs, the HLRC side of things, and fixes a bug for when there are no tasks when the option is set.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved
docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved

Before indices related to {ml} jobs and {dfeeds} can be locked and upgraded, all
currently running jobs and {dfeeds} should be paused. When you set the `enabled`
parameter to `true`, the API pauses all job and {dfeed} tasks, which facilitates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade docs say "Stop any X-Pack machine learning jobs that are running before starting the upgrade process. ". Can we therefore change this introductory sentence to something like this?:
"Before you start the upgrade progress, you must stop all jobs and datafeeds. This API simplifies that task."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer

I don't think we should use the word "stop" because we have a "stop datafeed" API and this endpoint does something different.

We could change the introductory sentence to something like this:

Before you start an upgrade that requires reindexing of ML internal indices, you must halt all jobs and {dfeeds}. This API simplifies that task. This API can also be used during other upgrades to prevent ML jobs shifting nodes multiple times as the upgrade progresses.

Long answer

We should probably expand the section in the upgrade docs to match current reality more closely. In the beginning (5.4/5.5) it was not safe to leave ML jobs running during a rolling upgrade, and the current advice dates back to then.

Subsequently we've worked to make rolling upgrade with open ML jobs safe, at least in terms of things not completely breaking. However, if practical, there is still a benefit to gracefully closing all ML jobs before upgrading and reopening them after the upgrade is complete because doing a graceful close means they'll persist their model state at the moment of closure, so will restart with the exact same model they had before the upgrade. The "if practical" bit is important though, because gracefully closing a large number of jobs or jobs with very large model state can take a long time, and such a delay is not always acceptable. We persist model state in the background at the end of lookback and every 3-4 hours in real-time operation, so a model is unlikely to have changed dramatically between the last background persistence and a forceful close.

Therefore what to do on upgrade is very much an "it depends" situation.

From an ML perspective there are two types of upgrade:

  1. An upgrade where nodes have to be restarted but indices do not have to be reindexed
  2. An upgrade where nodes have to be restarted and indices have to be reindexed

Type 2 here is much rarer than type 1. It only occurs on major version upgrade where ML was used in the previous major version, for example on upgrade from 6.x to 7.x when ML was first used in 5.x.

A type 2 upgrade has the hard requirement that jobs are not running during the reindexing portion of the upgrade, and there are two options to achieve this:

  • Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
  • Enable upgrade mode, do the upgrade, disable upgrade mode

A type 1 upgrade does not have such a hard requirement. There are 3 options:

  • Do the upgrade without closing ML jobs - ML jobs will shift around the available ML nodes as nodes are stopped and started, leading to highest availability of active ML jobs but also highest load on the cluster of ML jobs starting up and restoring model state
  • Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
  • Enable upgrade mode, do the upgrade, disable upgrade mode

So in the type 1 case upgrade mode avoids some of the cluster load from jobs potentially shifting nodes multiple times and loading their model state multiple times as the upgrade progresses. But it does not provide the benefit of persisting the absolute latest model state immediately before upgrade and restoring that on restart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcawl this is not stopping either. At least it is not stopping them in the same meaning we use elsewhere in the documentation.

It is stopping the current tasks from executing and disallowing new ones to be started. I will try and fix the wording

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of "halt", since that seems to just be a synonym of "stop". If it's doing something different (i.e. stopping tasks within the job or datafeed without actually stopping the job or datafeed), then we either need a different word or to be clearer about exactly what it's stopping.

Copy link
Contributor

@lcawl lcawl Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this:

When you upgrade your cluster, in some circumstances you must restart your nodes and
reindex your {ml} indices. In those circumstances, there must be no {ml} jobs running. You can close the {ml} jobs, do the upgrade, then open all the jobs again. Alternatively, you can use this API to temporarily pause the jobs [or whatever works best here... stop the tasks associated with the jobs?] and prevent new jobs from opening. You can also use this API during upgrades that do not require you to reindex your {ml} indices, though stopping jobs is not a requirement in that case.

For more information, see {stack-ref}/upgrading-elastic-stack.html[Upgrading the {stack}].

Note: I've created a task to improve the upgrade docs here: elastic/stack-docs#192

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, one of which is large and partly relates to the ML upgrade docs. The ML upgrade docs can be changed in a followup PR to keep this one moving, but I added my thoughts here to give the background for other suggestions I made.

docs/java-rest/high-level/ml/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved
docs/reference/ml/apis/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved
docs/reference/ml/apis/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved
docs/reference/ml/apis/set-upgrade-mode.asciidoc Outdated Show resolved Hide resolved

Before indices related to {ml} jobs and {dfeeds} can be locked and upgraded, all
currently running jobs and {dfeeds} should be paused. When you set the `enabled`
parameter to `true`, the API pauses all job and {dfeed} tasks, which facilitates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer

I don't think we should use the word "stop" because we have a "stop datafeed" API and this endpoint does something different.

We could change the introductory sentence to something like this:

Before you start an upgrade that requires reindexing of ML internal indices, you must halt all jobs and {dfeeds}. This API simplifies that task. This API can also be used during other upgrades to prevent ML jobs shifting nodes multiple times as the upgrade progresses.

Long answer

We should probably expand the section in the upgrade docs to match current reality more closely. In the beginning (5.4/5.5) it was not safe to leave ML jobs running during a rolling upgrade, and the current advice dates back to then.

Subsequently we've worked to make rolling upgrade with open ML jobs safe, at least in terms of things not completely breaking. However, if practical, there is still a benefit to gracefully closing all ML jobs before upgrading and reopening them after the upgrade is complete because doing a graceful close means they'll persist their model state at the moment of closure, so will restart with the exact same model they had before the upgrade. The "if practical" bit is important though, because gracefully closing a large number of jobs or jobs with very large model state can take a long time, and such a delay is not always acceptable. We persist model state in the background at the end of lookback and every 3-4 hours in real-time operation, so a model is unlikely to have changed dramatically between the last background persistence and a forceful close.

Therefore what to do on upgrade is very much an "it depends" situation.

From an ML perspective there are two types of upgrade:

  1. An upgrade where nodes have to be restarted but indices do not have to be reindexed
  2. An upgrade where nodes have to be restarted and indices have to be reindexed

Type 2 here is much rarer than type 1. It only occurs on major version upgrade where ML was used in the previous major version, for example on upgrade from 6.x to 7.x when ML was first used in 5.x.

A type 2 upgrade has the hard requirement that jobs are not running during the reindexing portion of the upgrade, and there are two options to achieve this:

  • Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
  • Enable upgrade mode, do the upgrade, disable upgrade mode

A type 1 upgrade does not have such a hard requirement. There are 3 options:

  • Do the upgrade without closing ML jobs - ML jobs will shift around the available ML nodes as nodes are stopped and started, leading to highest availability of active ML jobs but also highest load on the cluster of ML jobs starting up and restoring model state
  • Close all ML jobs, do the upgrade, reopen the jobs that were closed - notice that this requires an external person or process to remember which jobs need reopening
  • Enable upgrade mode, do the upgrade, disable upgrade mode

So in the type 1 case upgrade mode avoids some of the cluster load from jobs potentially shifting nodes multiple times and loading their model state multiple times as the upgrade progresses. But it does not provide the benefit of persisting the absolute latest model state immediately before upgrade and restoring that on restart.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now.

@lcawl would you like to suggest any further changes to wording before this is merged?

prohibits new job and {dfeed} tasks from starting.

Subsequently, you can call the API with the enabled parameter set to false,
which causes {ml} jobs and {dfeeds} to return to their desired states.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's a space at the beginning of the line

@benwtrent benwtrent merged commit 8280a20 into elastic:master Jan 30, 2019
@benwtrent benwtrent deleted the feature/ml-upgrade-mode-docs branch January 30, 2019 12:51
benwtrent added a commit that referenced this pull request Jan 30, 2019
* ML: Add upgrade mode docs, hlrc, and fix bug

* [DOCS] Fixes build error and edits text

* adjusting docs

* Update docs/reference/ml/apis/set-upgrade-mode.asciidoc

Co-Authored-By: benwtrent <ben.w.trent@gmail.com>

* Update set-upgrade-mode.asciidoc

* Update set-upgrade-mode.asciidoc
@@ -224,7 +224,7 @@ public MlMetadataDiff(StreamInput in) throws IOException {
public void writeTo(StreamOutput out) throws IOException {
jobs.writeTo(out);
datafeeds.writeTo(out);
if (out.getVersion().onOrAfter(Version.V_7_0_0)) {
if (out.getVersion().onOrAfter(Version.V_6_7_0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to fix this separately next time, and add unit tests for serialization that fail with the wrong version conditionals?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants