[ML] Return total SHAP per feature as a new result type #1387

valeriy42 · 2020-07-06T13:29:37Z

This PR add computation of the total feature importance values and outputs it as a new result type.

Example outputs:

Regression

{
    "row_results": {
        "checksum": 0,
        "results": {
            "ml": {
                "target_prediction": 1358.2039794921876,
                "is_training": true,
                "feature_importance": [
                    {
                        "feature_name": "c1",
                        "importance": 448.3700609767421
                    },
                    {
                        "feature_name": "c2",
                        "importance": 1146.2576129663276
                    },
                    {
                        "feature_name": "c3",
                        "importance": -388.896988459571
                    },
                    {
                        "feature_name": "c4",
                        "importance": 158.7185644712811
                    }
                ]
            }
        }
    }
},
{
    "model_metadata": {
        "total_feature_importance": [
            {
                "feature_name": "c4",
                "importance": {
                    "mean_magnitude": 233.4600671221131,
                    "min": -565.7664157184156,
                    "max": 468.8953979253651
                }
            },
            {
                "feature_name": "c3",
                "importance": {
                    "mean_magnitude": 227.52681995349807,
                    "min": -474.187447119175,
                    "max": 500.79764582176218
                }
            },
            {
                "feature_name": "c1",
                "importance": {
                    "mean_magnitude": 479.8491325534919,
                    "min": -584.6059620924166,
                    "max": 601.3424189083114
                }
            },
            {
                "feature_name": "c2",
                "importance": {
                    "mean_magnitude": 729.7375145579323,
                    "min": -1438.469059491588,
                    "max": 1428.738023747545
                }
            }
        ]
    }
}

Binary classification

{
    "row_results": {
        "checksum": 0,
        "results": {
            "ml": {
                "target_prediction": "foo",
                "prediction_probability": 0.9632688006864724,
                "prediction_score": 0.9632688006864724,
                "is_training": true,
                "feature_importance": [
                    {
                        "feature_name": "c1",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": -0.050036180361561228
                            },
                            {
                                "class_name": "bar",
                                "importance": 0.050036180361561228
                            }
                        ]
                    },
                    {
                        "feature_name": "c2",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": -2.787898169333443
                            },
                            {
                                "class_name": "bar",
                                "importance": 2.787898169333443
                            }
                        ]
                    },
                    {
                        "feature_name": "c3",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": -0.9016447487592819
                            },
                            {
                                "class_name": "bar",
                                "importance": 0.9016447487592819
                            }
                        ]
                    },
                    {
                        "feature_name": "c4",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": 0.4345632399908005
                            },
                            {
                                "class_name": "bar",
                                "importance": -0.4345632399908005
                            }
                        ]
                    }
                ]
            }
        }
    }
},
{
    "model_metadata": {
        "total_feature_importance": [
            {
                "feature_name": "c4",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 0.5077140893490363,
                            "min": -1.2245953772608847,
                            "max": 1.2245953772608847
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 0.5077140893490363,
                            "min": -1.2245953772608847,
                            "max": 1.2245953772608847
                        }
                    }
                ]
            },
            {
                "feature_name": "c3",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 0.37436172343432769,
                            "min": -1.3221622827321056,
                            "max": 1.3221622827321056
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 0.37436172343432769,
                            "min": -1.3221622827321056,
                            "max": 1.3221622827321056
                        }
                    }
                ]
            },
            {
                "feature_name": "c1",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 1.0116256529234005,
                            "min": -2.4239089033397378,
                            "max": 2.4239089033397378
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 1.0116256529234005,
                            "min": -2.4239089033397378,
                            "max": 2.4239089033397378
                        }
                    }
                ]
            },
            {
                "feature_name": "c2",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 1.878800695094461,
                            "min": -3.288343526748284,
                            "max": 3.288343526748284
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 1.878800695094461,
                            "min": -3.288343526748284,
                            "max": 3.288343526748284
                        }
                    }
                ]
            }
        ]
    }
}

Multi-class classification

{
    "row_results": {
        "checksum": 0,
        "results": {
            "ml": {
                "target_prediction": "foo",
                "prediction_probability": 0.9462761477876273,
                "prediction_score": 0.17141987121246278,
                "is_training": true,
                "top_classes": [
                    {
                        "class_name": "foo",
                        "class_probability": 0.9462761477876273,
                        "class_score": 0.17141987121246278
                    },
                    {
                        "class_name": "bar",
                        "class_probability": 0.034511190424692039,
                        "class_score": 0.034511190424692039
                    },
                    {
                        "class_name": "baz",
                        "class_probability": 0.019212661787680529,
                        "class_score": 0.003818541178138616
                    }
                ],
                "feature_importance": [
                    {
                        "feature_name": "c1",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": 0.27949101986590138
                            },
                            {
                                "class_name": "baz",
                                "importance": -0.12386717688503159
                            },
                            {
                                "class_name": "bar",
                                "importance": -0.1556238429808595
                            }
                        ]
                    },
                    {
                        "feature_name": "c2",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": 1.663225619115169
                            },
                            {
                                "class_name": "baz",
                                "importance": -1.72288680107119
                            },
                            {
                                "class_name": "bar",
                                "importance": 0.05966118195592636
                            }
                        ]
                    },
                    {
                        "feature_name": "c3",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": 0.22379061504358678
                            },
                            {
                                "class_name": "baz",
                                "importance": -0.23352199623126147
                            },
                            {
                                "class_name": "bar",
                                "importance": 0.00973138118767497
                            }
                        ]
                    },
                    {
                        "feature_name": "c4",
                        "classes": [
                            {
                                "class_name": "foo",
                                "importance": -0.24052346877901588
                            },
                            {
                                "class_name": "baz",
                                "importance": 0.19615020783390645
                            },
                            {
                                "class_name": "bar",
                                "importance": 0.04437326094510882
                            }
                        ]
                    }
                ]
            }
        }
    }
},
{
    "model_metadata": {
        "total_feature_importance": [
            {
                "feature_name": "c4",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 0.24209656030374336,
                            "min": -0.5757885311922144,
                            "max": 0.6352558320805585
                        }
                    },
                    {
                        "class_name": "baz",
                        "importance": {
                            "mean_magnitude": 0.21362926518754464,
                            "min": -0.6975561926823535,
                            "max": 0.5758437812831863
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 0.0346461346585683,
                            "min": -0.15232748182282736,
                            "max": 0.10645868140524567
                        }
                    }
                ]
            },
            {
                "feature_name": "c3",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 0.30818045910633587,
                            "min": -0.7931796597779941,
                            "max": 0.3339785961510332
                        }
                    },
                    {
                        "class_name": "baz",
                        "importance": {
                            "mean_magnitude": 0.3302457015751672,
                            "min": -0.45783991999546966,
                            "max": 0.8242004300074223
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 0.05158362329758712,
                            "min": -0.31757994080335757,
                            "max": 0.2435538443329867
                        }
                    }
                ]
            },
            {
                "feature_name": "c1",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 0.6477535654120877,
                            "min": -1.9137505247875509,
                            "max": 1.287337819860563
                        }
                    },
                    {
                        "class_name": "baz",
                        "importance": {
                            "mean_magnitude": 0.7520521962038734,
                            "min": -1.531931414792879,
                            "max": 1.6810760229277138
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 0.16574602557111424,
                            "min": -0.3434321409380257,
                            "max": 0.4223094459672269
                        }
                    }
                ]
            },
            {
                "feature_name": "c2",
                "classes": [
                    {
                        "class_name": "foo",
                        "importance": {
                            "mean_magnitude": 1.101426013839722,
                            "min": -2.2925533638349937,
                            "max": 1.7987193407562752
                        }
                    },
                    {
                        "class_name": "baz",
                        "importance": {
                            "mean_magnitude": 1.1717215530182037,
                            "min": -1.8971843995291227,
                            "max": 2.6284289404335188
                        }
                    },
                    {
                        "class_name": "bar",
                        "importance": {
                            "mean_magnitude": 0.18929675306403358,
                            "min": -0.43649354351400246,
                            "max": 0.612025928728523
                        }
                    }
                ]
            }
        ]
    }
}

Closes #974

EDIT: I updated the format example above.

valeriy42 · 2020-07-06T13:44:32Z

For v7.9.0 only if we manage to have a Java parser implemented timely.

tveasey

I made a couple of style suggestions. My main comments are I think:

We should normalise by the document count. (You should be able to switch the map's value type to a CBasicStatistics::SSampleMean<TVector>::TAccumulator to do this. Although note you have to initialise this with a zero vector of the correct size.)
It would be nice to compute these quantities accurately, i.e. not just summing over the top importances for each document.

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc

lib/api/CDataFrameTrainBoostedTreeRegressionRunner.cc

droberts195 · 2020-07-14T10:37:50Z

We agreed to defer this from 7.9 to 7.10, so I altered the labels.

benwtrent · 2020-07-16T18:19:28Z

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc

+                writer.Double(item.second(0));
+            } else {
+                for (int j = 0; j < item.second.size() && j < numberClasses; ++j) {
+                    writer.Key(classValues[j]);


This will not work for storage in ES. This index stores the information for all trained models and indexing the class names for the feature importances will not scale.

I propose this format:

{ "feature_name": "c4", "importance": 0.4810469375580312, "class_importance": [ { "class_name": "foo", "importance": 0.24052346877901588 }, { "class_name": "baz", "importance": 0.19615020783390645 }, { "class_name": "bar", "importance": 0.04437326094510882 } ] }

class_importance will be a nested data type that allows aggregations and searches for specific models and classnames.

benwtrent · 2020-07-16T19:25:36Z

Java side parsing: elastic/elasticsearch#59725

We will probably have to mute integration tests, merge C++ side, then unmute and merge the parsing java side.

… total-shap

…-shap

benwtrent · 2020-08-10T17:04:18Z

@valeriy42 the new format has min, max, mean ? Are there any other changes we are considering?

valeriy42 · 2020-08-11T07:25:31Z

the new format has min, max, mean ? Are there any other changes we are considering?

@benwtrent I cannot think of anything more. min and max are useful for visualization to define the axis range.

benwtrent · 2020-08-11T12:41:01Z

@valeriy42 it would be nice if the per class importance looked like the regression importance. That way we have consistent JSON objects.

{
                "feature_name": "c2",
                "classes":[
                    {
                        "class_name": "baz",
                        "importance": {
                           "mean_magnitude": 1.1717215530182037,
                           "min": -1.8971843995291227,
                           "max": 2.6284289404335188
                        }
                    }...
                ]
            }

valeriy42 · 2020-08-11T12:43:34Z

@benwtrent you are right! I overlooked at in my past commit. I'll fix it accordingly.

tveasey

Only a couple of minor comments and a suggestion for one additional bit of testing, which I think is worthwhile. Otherwise, LGTM.

include/api/CInferenceModelMetadata.h

lib/api/CDataFrameTrainBoostedTreeClassifierRunner.cc

lib/api/unittest/CDataFrameAnalyzerFeatureImportanceTest.cc

This PR add computation of the total feature importance values.

This PR add computation of the total feature importance values. Backport of #1387.

This updates the feature_importance mapping change from elastic/ml-cpp#1387

Activate the output of the model metadata and the corresponding unit tests for total feature importance. The implementation itself was introduced in #1387 however, I need to fix the documentation, it was originally attributed to v7.10. Hence, I mark this PR as enhancement to rectify the docs.

Activate the output of the model metadata and the corresponding unit tests for total feature importance. The implementation itself was introduced in #1387 however, I need to fix the documentation, it was originally attributed to v7.10. Hence, I mark this PR as enhancement to rectify the docs. Co-authored-by: Valeriy Khakhutskyy <1292899+valeriy42@users.noreply.github.com>

Activate the output of the model metadata and the corresponding unit tests for total feature importance. The implementation itself was introduced in elastic#1387 however, I need to fix the documentation, it was originally attributed to v7.10. Hence, I mark this PR as enhancement to rectify the docs.

valeriy42 added 2 commits July 6, 2020 15:16

code commit

10e5a6c

Unit test added

df13ed0

valeriy42 added :ml >enhancement v7.9.0 v8.0.0 labels Jul 6, 2020

valeriy42 requested a review from tveasey July 6, 2020 13:30

changelog updated

874140c

unit test updated

6f40db9

tveasey reviewed Jul 6, 2020

View reviewed changes

use accumulate

38a180a

droberts195 added v7.10.0 and removed v7.9.0 labels Jul 14, 2020

solution with unique ptr compiles

08bd4ec

benwtrent reviewed Jul 16, 2020

View reviewed changes

benwtrent mentioned this pull request Jul 16, 2020

[ML] handle new model metadata stream from native process elastic/elasticsearch#59725

Merged

valeriy42 added the WIP label Jul 17, 2020

valeriy42 added 6 commits July 17, 2020 12:07

cleaning up

13357af

total importance mean variance

3f7ec0a

total importance mean variance min max

0ba97a0

remove variance

30c109b

Merge branch 'total-shap' of https://github.com/valeriy42/ml-cpp into…

f7689c3

… total-shap

Merge branch 'master' of https://github.com/elastic/ml-cpp into total…

d3758d4

…-shap

valeriy42 added 2 commits August 11, 2020 11:20

Fixing unit tests

8700f5f

cleaning up

6fe6399

multiclass format change

f8126c8

valeriy42 removed the WIP label Aug 12, 2020

change result format for binary classification

1672019

tveasey approved these changes Aug 13, 2020

View reviewed changes

valeriy42 added 3 commits August 13, 2020 13:13

Unit tests extended

3f8f6c2

remove const_cast

1c3cfaf

fix test failure

1452f28

benwtrent mentioned this pull request Aug 13, 2020

[ML] updating feature_importance results mapping elastic/elasticsearch#61104

Merged

valeriy42 merged commit 3f1b575 into elastic:master Aug 13, 2020

valeriy42 deleted the total-shap branch August 13, 2020 18:32

valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Aug 13, 2020

[ML] Return total SHAP per feature as a new result type (elastic#1387)

c75748b

This PR add computation of the total feature importance values.

This was referenced Aug 13, 2020

[7.x][ML] Return total SHAP per feature as a new result type #1455

Merged

[ML] Activate model metadata output #1456

Merged

valeriy42 added a commit that referenced this pull request Aug 14, 2020

[7.x][ML] Return total SHAP per feature as a new result type (#1455)

e169729

This PR add computation of the total feature importance values. Backport of #1387.

benwtrent added a commit to elastic/elasticsearch that referenced this pull request Aug 14, 2020

[ML] updating feature_importance results mapping (#61104)

69f7066

This updates the feature_importance mapping change from elastic/ml-cpp#1387

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 14, 2020

[ML] updating feature_importance results mapping (elastic#61104)

bef0652

This updates the feature_importance mapping change from elastic/ml-cpp#1387

benwtrent added a commit to elastic/elasticsearch that referenced this pull request Aug 14, 2020

[ML] updating feature_importance results mapping (#61104) (#61144)

7c3bfb9

This updates the feature_importance mapping change from elastic/ml-cpp#1387

benwtrent mentioned this pull request Aug 14, 2020

[7.x][ML] Activate model metadata output (#1456) #1457

Merged

davidkyle mentioned this pull request Aug 18, 2020

[ML][7.x] handle new model metadata stream from native process elastic/elasticsearch#61251

Merged

valeriy42 mentioned this pull request Sep 1, 2020

[7.x][ML] Activate model metadata output #1466

Closed

This was referenced Sep 28, 2020

[DOCS] Add total feature importance to overview elastic/stack-docs#1378

Merged

[DOCS] Add total feature importance to regression example elastic/stack-docs#1379

Merged

[DOCS] Add total feature importance to classification example elastic/stack-docs#1382

Merged

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Return total SHAP per feature as a new result type #1387

[ML] Return total SHAP per feature as a new result type #1387

valeriy42 commented Jul 6, 2020 •

edited

Loading

valeriy42 commented Jul 6, 2020

tveasey left a comment •

edited

Loading

droberts195 commented Jul 14, 2020

benwtrent Jul 16, 2020

benwtrent commented Jul 16, 2020

benwtrent commented Aug 10, 2020

valeriy42 commented Aug 11, 2020

benwtrent commented Aug 11, 2020

valeriy42 commented Aug 11, 2020

tveasey left a comment

[ML] Return total SHAP per feature as a new result type #1387

[ML] Return total SHAP per feature as a new result type #1387

Conversation

valeriy42 commented Jul 6, 2020 • edited Loading

valeriy42 commented Jul 6, 2020

tveasey left a comment • edited Loading

Choose a reason for hiding this comment

droberts195 commented Jul 14, 2020

benwtrent Jul 16, 2020

Choose a reason for hiding this comment

benwtrent commented Jul 16, 2020

benwtrent commented Aug 10, 2020

valeriy42 commented Aug 11, 2020

benwtrent commented Aug 11, 2020

valeriy42 commented Aug 11, 2020

tveasey left a comment

Choose a reason for hiding this comment

valeriy42 commented Jul 6, 2020 •

edited

Loading

tveasey left a comment •

edited

Loading