-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Machine learning data frame analytics #43544
Merged
dimitris-athanasiou
merged 110 commits into
master
from
feature-ml-data-frame-analytics
Jun 25, 2019
Merged
[ML] Machine learning data frame analytics #43544
dimitris-athanasiou
merged 110 commits into
master
from
feature-ml-data-frame-analytics
Jun 25, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
With this commit before we start the analytics we first reindex the source data frame into a new index. Note we should maintain the settings and the mappings of the source index.
This also prepares for allowing the `DataFrameDataExtractor` to be reused while joining the results with the raw documents.
change the download location to load the custom binaries created in elastic/ml-cpp#344
In order to sanity check that analytics results are joined correctly with their corresponding dataframe rows, we write a checksum for each dataframe row which is a 32-bit hash of the analysis fields. The analytics process includes it in the results. Upon joining we check that the checksums match.
Converts data frame analytics to run as persistent tasks. Adds the following APIs: - PUT _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id}/_stats - POST _ml/data_frame/analysis/{id}/_start - DELETE _ml/data_frame/analysis/{id}
* ML: Add query support for dataframe analytics config * Adding query to reindex, if needed. Also fixing minor bug in extractor * Adding default query to config, adjusting extractor factory * adjusting analytics extractor factory * Adjust config parse to not store default fields of parsed query * fixing reindex and yml tests * Only querying on reindex analytics run * removing unused function
)" This reverts commit 9bb2a02.
…unning analytics (#38928) * [Feature][ML] Add authz check for dataframe source index * fixing origin for client calls and adding headers * addressing PR comments * Having bulk request be done with headers in origin * addressing pr comments and failing test * making analyses immutable * adjusting indexnames and privs for security tests
Adds progress reporting. Progress is reported per state. In particular, this adds progress reporting for the reindexing state and the analyzing state. For reindexing, we now store the reindex task id and we use it to get the task info and calculate progress by taking into consideration the number of docs created against the total docs. For analyzing, we read the progress reported from the native process and store it in memory. The get tasks action has been changed to direct to the node running the process when possible. Then, progress is reported additionally to the rest of stats for running tasks. This commit adds integration tests on the multi-node environment. Those tests have revealed some issues which are also fixed here: - Registering named content correctly - Wait for task state to be `started` before responding in the start API
dimitris-athanasiou
added
>feature
:ml
Machine learning
v8.0.0
:ml/Transform
Transform
v7.3.0
labels
Jun 24, 2019
Pinging @elastic/ml-core |
@elasticmachine update branch |
dimitris-athanasiou
added a commit
to dimitris-athanasiou/elasticsearch
that referenced
this pull request
Jun 25, 2019
This merges the initial work that adds a framework for performing machine learning analytics on data frames. The feature is currently experimental and requires a platinum license. Note that the original commits can be found in the `feature-ml-data-frame-analytics` branch. A new set of APIs is added which allows the creation of data frame analytics jobs. Configuration allows specifying different types of analysis to be performed on a data frame. At first there is support for outlier detection. The APIs are: - PUT _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id}/_stats - POST _ml/data_frame/analysis/{id}/_start - POST _ml/data_frame/analysis/{id}/_stop - DELETE _ml/data_frame/analysis/{id} When a data frame analytics job is started a persistent task is created and started. The main steps of the task are: 1. reindex the source index into the dest index 2. analyze the data through the data_frame_analyzer c++ process 3. merge the results of the process back into the destination index In addition, an evaluation API is added which packages commonly used metrics that provide evaluation of various analysis: - POST _ml/data_frame/_evaluate
dimitris-athanasiou
added a commit
that referenced
this pull request
Jun 25, 2019
This merges the initial work that adds a framework for performing machine learning analytics on data frames. The feature is currently experimental and requires a platinum license. Note that the original commits can be found in the `feature-ml-data-frame-analytics` branch. A new set of APIs is added which allows the creation of data frame analytics jobs. Configuration allows specifying different types of analysis to be performed on a data frame. At first there is support for outlier detection. The APIs are: - PUT _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id} - GET _ml/data_frame/analysis/{id}/_stats - POST _ml/data_frame/analysis/{id}/_start - POST _ml/data_frame/analysis/{id}/_stop - DELETE _ml/data_frame/analysis/{id} When a data frame analytics job is started a persistent task is created and started. The main steps of the task are: 1. reindex the source index into the dest index 2. analyze the data through the data_frame_analyzer c++ process 3. merge the results of the process back into the destination index In addition, an evaluation API is added which packages commonly used metrics that provide evaluation of various analysis: - POST _ml/data_frame/_evaluate
16 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license.
A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.
When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:
data_frame_analyzer
c++ processIn addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis.