Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML: add migrate anomalies assistant #36643

Merged

Conversation

benwtrent
Copy link
Member

This adds an endpoint, transport, and execution service for upgrading indices .ml-anomailes*.

We cannot use the upgrade plugin for these indices as they are already aliased for ML jobs.

The steps of the process are as follows

  1. Create new write indices and include them in the read aliases
  2. Point write aliases to the new write indices
  3. make the old indices read only
  4. reindex the old indices into new indices so that the old data is readable
  5. adjust the read aliases so that they point to the newly reindexed indices
  6. Delete the old indices

This PR does NOT contain docs for this new endpoint, or the HLRC addition. The PR is getting pretty bloated, I will open a new one for docs + HLRC after this one gets approved and merged.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

);

// <1> Create the new write indices and set the read aliases to include them
createNewWriteIndicesIfNecessary(client, metaData, indexNameAndAliasProvider.newWriteIndices(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation of the write indices, and adding them to the read alias are not done at the same time just in case the write indices already exist (from manual creation, or from a previous upgrade attempt), but the aliases still need set.

@droberts195 droberts195 added v6.7.0 and removed v6.6.0 labels Dec 18, 2018
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I couldn't see any major problems on first pass but would like to have another look so please don't merge this year - we'll pick this up again in January.

I left a couple of minor comments about backporting.

One more thing I'd like to do before we merge this is go through all accesses to the results indices and make sure we're always using _search and never get, as get won't work when there are multiple indices. (This is important for #29946 as well as upgrade, which is why the upgrade work is such a big step towards being able to support rolling results indices.) But any changes for this aspect can be done in a separate PR.

"methods": [ "POST" ],
"url": {
"path": "/_ml/anomaly_detectors/results/_upgrade",
"paths": [ "/_ml/anomaly_detectors/results/_upgrade" ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to change to _xpack/ml in the backport to 6.x.

@benwtrent
Copy link
Member Author

run the default distro tests

@benwtrent
Copy link
Member Author

run the gradle build tests 1

@hendrikmuhs
Copy link

In the situation where one of the old indices was successfully reindexed (say index "A" -> "A2") , but another one failed (index "B" -> "B2"). We don't delete index "A" and we fail the migration and send the error to the user.

The user is free to retry the migration, we call the reindex for "A" -> "A2" again. With the parameters we send to the reindex call, we skip conflicts and ONLY create docs. We don't overwrite. So, nothing should happen with this attempt to reindex again.

Thanks @benwtrent, this answers my question. So if I re-run it "fills in" missing documents, so I re-read the whole source index but I do not write everything (only as needed). With other words, if I want - for some reason - to start from scratch I have to delete "A2", "B2", ... before running the migration again.

@benwtrent
Copy link
Member Author

@hendrikmuhs yeah, delete whatever new indices you want to recreate. The caution is if the new write index is deleted, those docs are not stored anywhere else, so that newer data would be lost.

@benwtrent
Copy link
Member Author

run the gradle build tests 1

@benwtrent
Copy link
Member Author

run the gradle build tests 1

@droberts195
Copy link
Contributor

Could you point to where you saw this?

Ah, sorry, I missed that the read aliases are updated in two places.

New write indices are created, we add them to the read aliases:
https://github.com/elastic/elasticsearch/pull/36643/files#diff-7331c0bf65f9aef5ff4d2d3aca4aef95R247

In that case I think we have the reverse problem, which is that as the reindex progresses the UI will see more and more duplicate results, until eventually the reindex is complete, the read aliases are adjusted to only point at the new indices and the UI only sees one copy of each result. Whether this is a major problem in reality depends on how long it takes to get from step <1> to step <5>. If it's only a few seconds any user who sees duplicates will think it's a display quirk, press refresh and by that time the duplicates will have disappeared. But if it's many minutes then they may suffer for long enough to report it as a bug.

This potential problem shouldn't stop this PR being merged, as this PR creates a framework for the ML upgrade steps and we can iterate on the details of what it does in a subsequent PR.


@Override
public Task createTask(long id, String type, String action, TaskId parentTaskId, Map<String, String> headers) {
return new CancellableTask(id, type, action, getDescription(), parentTaskId, headers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see where getDescription is overridden? The default implementation in TaskAwareRequest returns an empty string

* Creates a new TypedChainTaskExecutor.
* Each chainedTask is executed in order serially and after each execution the continuationPredicate is tested.
*
* On failures teh failureShortCircuitPredicate is tested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo teh

@benwtrent
Copy link
Member Author

benwtrent commented Jan 8, 2019

@droberts195 the user should never see duplicates with this process.
1 Write indexes are created
2 Read aliases include the new write indices
3 Write aliases are changed to ONLY point to the new write indices
4 Old read indices are set to read only
5 Old read indices are reindexed to new read indices
6 Read aliases are adjusted atomically to remove from the old read indices and add to the new read indices
7 old read indices are deleted

@benwtrent
Copy link
Member Author

run the gradle build tests 2

@benwtrent
Copy link
Member Author

Jenkins retest this please

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

the user should never see duplicates with this process.

Thanks for clarifying. I can see that now.

My next doubt is about whether renormalization might need changing in some way, but let's sort that out in a followup PR.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit df3b58c into elastic:master Jan 9, 2019
@benwtrent benwtrent deleted the feature/ml-migrate-anomalies-assistent branch January 9, 2019 20:25
benwtrent added a commit that referenced this pull request Jan 10, 2019
* ML: add migrate anomalies assistant

* adjusting failure handling for reindex

* Fixing request and tests

* Adding tests to blacklist

* adjusting test

* test fix: posting data directly to the job instead of relying on datafeed

* adjusting API usage

* adding Todos and adjusting endpoint

* Adding types to reindexRequest

* removing unreliable "live" data test

* adding index refresh to test

* adding index refresh to test

* adding index refresh to yaml test

* fixing bad exists call

* removing todo

* Addressing remove comments

* Adjusting rest endpoint name

* making service have its own logger

* adjusting validity check for newindex names

* fixing typos

* fixing renaming
benwtrent added a commit that referenced this pull request Jan 10, 2019
@benwtrent
Copy link
Member Author

Customer facing changes from this PR are reverted in: #37879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants