Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add total feature importance to regression example #1379

Merged
merged 1 commit into from
Sep 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 71 additions & 14 deletions docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,10 @@ exclude fields that either contain erroneous data or describe the
`dependent_variable`.
.. Choose a training percent of `90` which means it randomly selects 90% of the
source data for training.
.. If you want to experiment with <<ml-feature-importance,feature importance>>,
specify a value in the advanced configuration options. In this example, we
choose to return a maximum of 5 feature importance values per document. This
option affects the speed of the analysis, so by default it is disabled.
.. If you want to experiment with <<ml-feature-importance,{feat-imp}>>, specify
a value in the advanced configuration options. In this example, we choose to
return a maximum of 5 {feat-imp} values per document. This option affects the
speed of the analysis, so by default it is disabled.
.. Use the default memory limit for the job. If the job requires more than this
amount of memory, it fails to start. If the available memory on the node is
limited, this setting makes it possible to prevent job execution.
Expand Down Expand Up @@ -329,16 +329,24 @@ table to show only testing or training data and you can select which fields are
shown in the table. You can also enable histogram charts to get a better
understanding of the distribution of values in your data.

If you chose to calculate feature importance, the destination index also
contains `ml.feature_importance` objects. Every field that is included in the
{reganalysis} (known as a _feature_ of the data point) is assigned a feature
importance value. However, only the most significant values (in this case, the
top 5) are stored in the index. These values indicate which features had the
biggest (positive or negative) impact on each prediction. In {kib}, you can see
this information displayed in the form of a decision plot:
If you chose to calculate {feat-imp}, the destination index also contains
`ml.feature_importance` objects. Every field that is included in the
{reganalysis} (known as a _feature_ of the data point) is assigned a {feat-imp}
value. This value has both a magnitude and a direction (positive or negative),
which indicates how each field affects a particular prediction. Only the most
significant values (in this case, the top 5) are stored in the index. However,
the trained model metadata also contains the average magnitude of the {feat-imp}
values for each field across all the training data. You can view this
summarized information in {kib}:

[role="screenshot"]
image::images/flights-regression-importance.png["A decision plot for feature importance values in {kib}"]
image::images/flights-regression-total-importance.png["Total {feat-imp} values in {kib}"]

You can also see the {feat-imp} values for each individual prediction in the
form of a decision plot:

[role="screenshot"]
image::images/flights-regression-importance.png["A decision plot for {feat-imp} values in {kib}"]

The decision path starts at a baseline, which is the average of the predictions
for all the data points in the training data set. From there, the feature
Expand All @@ -350,12 +358,60 @@ delay. This type of information can help you to understand how models arrive at
their predictions. It can also indicate which aspects of your data set are most
influential or least useful when you are training and tuning your model.

If you do not use {kib}, you can see the same information by using the standard
{es} search command to view the results in the destination index.
If you do not use {kib}, you can see summarized {feat-imp} values by using the
{ref}/get-inference.html[get trained model API] and the individual values by
searching the destination index.

.API example
[%collapsible]
====
[source,console]
--------------------------------------------------
GET _ml/inference/model-flight-delays*?include=total_feature_importance
--------------------------------------------------
// TEST[skip:TBD]
The snippet below shows an example of the total feature importance details in
the trained model metadata:
[source,console-result]
----
{
"count" : 1,
"trained_model_configs" : [
{
"model_id" : "model-flight-delays-1601312043770",
...
"metadata" : {
...
"total_feature_importance" : [
{
"feature_name" : "dayOfWeek",
"importance" : {
"mean_magnitude" : 0.38674590521018903, <1>
"min" : -9.42823116446923, <2>
"max" : 8.707461689065173 <3>
}
},
{
"feature_name" : "OriginWeather",
"importance" : {
"mean_magnitude" : 0.18548393012368913,
"min" : -9.079576266629092,
"max" : 5.142479101907649
}
...
----
<1> This value is the average of the absolute {feat-imp} values for the
`dayOfWeek` field across all the training data.
<2> This value is the minimum {feat-imp} value across all the training data for
this field.
<3> This value is the maximum {feat-imp} value across all the training data for
this field.
To see the top {feat-imp} values for each prediction, search the destination
index. For example:
[source,console]
--------------------------------------------------
GET model-flight-delays/_search
Expand Down Expand Up @@ -399,6 +455,7 @@ The snippet below shows a part of a document with the annotated results:
}
...
----
====

[[flightdata-regression-evaluate]]
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.