Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Machine learning data frame analytics #43544

Merged
merged 110 commits into from
Jun 25, 2019

Conversation

dimitris-athanasiou
Copy link
Contributor

This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license.

A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.

When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:

  1. reindex the source index into the dest index
  2. analyze the data through the data_frame_analyzer c++ process
  3. merge the results of the process back into the destination index

In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis.

dimitris-athanasiou and others added 30 commits November 2, 2018 16:22
With this commit before we start the analytics we first reindex
the source data frame into a new index. Note we should maintain
the settings and the mappings of the source index.
This also prepares for allowing the `DataFrameDataExtractor`
to be reused while joining the results with the raw documents.
change the download location to load the custom binaries created in elastic/ml-cpp#344
In order to sanity check that analytics results are joined
correctly with their corresponding dataframe rows, we write
a checksum for each dataframe row which is a 32-bit hash
of the analysis fields. The analytics process includes it in the results.
Upon joining we check that the checksums match.
Converts data frame analytics to run as persistent tasks.

Adds the following APIs:

- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- DELETE _ml/data_frame/analysis/{id}
* ML: Add query support for dataframe analytics config

* Adding query to reindex, if needed. Also fixing minor bug in extractor

* Adding default query to config, adjusting extractor factory

* adjusting analytics extractor factory

* Adjust config parse to not store default fields of parsed query

* fixing reindex and yml tests

* Only querying on reindex analytics run

* removing unused function
…unning analytics (#38928)

* [Feature][ML] Add authz check for dataframe source index

* fixing origin for client calls and adding headers

* addressing PR comments

* Having bulk request be done with headers in origin

* addressing pr comments and failing test

* making analyses immutable

* adjusting indexnames and privs for security tests
Adds progress reporting. Progress is reported per state.
In particular, this adds progress reporting for the
reindexing state and the analyzing state.

For reindexing, we now store the reindex task id and we use
it to get the task info and calculate progress by taking into
consideration the number of docs created against the total docs.

For analyzing, we read the progress reported from the native process
and store it in memory. The get tasks action has been changed
to direct to the node running the process when possible. Then,
progress is reported additionally to the rest of stats for
running tasks.

This commit adds integration tests on the multi-node environment.
Those tests have revealed some issues which are also fixed here:

- Registering named content correctly
- Wait for task state to be `started` before responding in the start API
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit 5fa36da into master Jun 25, 2019
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jun 25, 2019
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.

A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.

The APIs are:

- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}

When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:

1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index

In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:

- POST _ml/data_frame/_evaluate
dimitris-athanasiou added a commit that referenced this pull request Jun 25, 2019
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.

A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.

The APIs are:

- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}

When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:

1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index

In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:

- POST _ml/data_frame/_evaluate
@jpountz jpountz removed the :ml Machine learning label Jul 5, 2019
@droberts195 droberts195 added :ml Machine learning and removed :ml/Transform Transform labels Jul 8, 2019
@dimitris-athanasiou dimitris-athanasiou deleted the feature-ml-data-frame-analytics branch May 7, 2020 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants