Enabled feature_importances_ for our ForestDML and ForestDRLearner estimators #306

vsyrgkanis · 2020-11-07T15:12:35Z

This required changing the subsampled honest forest code a bit so that it does not alter the arrays of the tree structures of sklearn but rather stores two additional arrays required for prediction. This does add around 1.5 times the original running time, so makes it slightly slower due to the extra memory allocation.

However this enables correct feature_importance calculation and also in the future correct SHAP calculation (fixes #297), as now the tree entries are consistent with a tree in a randomforestregressor and so shap logic can be applied if we recast the subsampled honest forest as a randomforestregressor (additivity of shap will still be violated since the prediction of the subsample honest forest is not just the aggregation of the predictions across the trees but more complex weighted average). But we can still call shap and still get meaningful shap numbers. One discrepancy is that shap is explaining a different value that what effect returns, since it explains the value that corresponds to the average of the predictions of each honest tree regressor. however, the prediction of an honest forest is not the average of the tree predictions. For a full solution to this small discrepancy, one would need a full re-working of Shap's tree explainer and the tree explainer algorithm to account for such alternative aggregations of tree predictors.

This enables the following two uses:

from econml.dml import ForestDMLCateEstimator
import shap
import sklearn.ensemble
import copy
est3 = ForestDMLCateEstimator(model_y=RandomForestRegressor(),
                              model_t=RandomForestRegressor(),
                              n_estimators=1000,
                              subsample_fr=.8,
                              min_samples_leaf=10,
                              min_impurity_decrease=0.001,
                              verbose=0, min_weight_fraction_leaf=.01)
est3.fit(Y, T, X, W)
print(est3.feature_importances_)
model = copy.deepcopy(est3.model_cate)
model.__class__ = sklearn.ensemble.RandomForestRegressor
explainer = shap.Explainer(model, X)
shap_values = explainer(X)
shap.plots.beeswarm(shap_values, X)

from econml.drlearner import ForestDRLearner
import shap
import sklearn.ensemble
import copy
est3 = ForestDRLearner(model_regression=RandomForestRegressor(),
                       model_propensity=RandomForestClassifier(min_samples_leaf=10),
                       min_propensity=1e-3,
                       n_estimators=1000,
                       subsample_fr=.8,
                       min_samples_leaf=10,
                       min_impurity_decrease=0.001,
                       verbose=0, min_weight_fraction_leaf=.01)
est3.fit(Y, T, X, W)
for t in np.unique(T):
    if t > 0:
        print(est3.feature_importances_(T=t))
        model = copy.deepcopy(est3.model_cate(T=t))
        model.__class__ = sklearn.ensemble.RandomForestRegressor
        explainer = shap.Explainer(model, X)
        shap_values = explainer(X)
        shap.plots.beeswarm(shap_values, X)

Side benefits:

6x speed up of SubsampledHonestForest by pre-transforming dense but sparsely represented matrices to dense representation before a loop with many slicing operations on them.

… but rather create auxiliary numpy arrays that store the numerator and denominator of every node. This enables consistent feature_importance calculation and also potentially more accurate shap_values calcualtion.

…attribute

…ure_importances_. Added tests that the feature_importances_ API is working in test_drlearner and test_dml.

econml/dml.py

econml/drlearner.py

econml/sklearn_extensions/ensemble.py

Co-authored-by: Keith Battocchi <kebatt@microsoft.com>

…onML into vasilis/feature_importances

…ree level was causing trouble, since due to sample splitting feature_improtance can many times be negative (increase in variance) due to honesty and sample splitting. Now averaging the un-normalized feature importance. There is still a small caveat in the current version of how we use impurity. Added that as a TODO.

econml/sklearn_extensions/ensemble.py

…orest, that now makes feature_importances_ exactly correct and no need to re-implement the method. Now impurities are computed on the estimation sample and replacing the pre-calculated node impurities.

…rallel_add_trees_ of ensemble.py. This leads to 6 fold speed-up as we were doing many slicing operations to sparse matrices before, which are very slow!

kbattocchi

Mostly looks fine, though I'm not really familiar with the tree internals to vouch for correctness. Please consider addressing the comments I left before merging.

econml/sklearn_extensions/ensemble.py

… to csr matrix

vasilismsr added 5 commits November 7, 2020 08:56

replaced copy with empty_like

4f1fe19

added feature improtances in dr learner example notebook

e4aac6a

added feature_importances_ to DML example notebook

7744197

enabled feature_importances_ for forestDML and forestDRLearner as an …

ebae662

…attribute

vsyrgkanis added the enhancement New feature or request label Nov 7, 2020

vsyrgkanis requested review from kbattocchi, moprescu and heimengqi November 7, 2020 15:12

vasilismsr added 2 commits November 7, 2020 14:36

fixed doctest in subsample honest forest which was producing old feat…

7128230

…ure_importances_. Added tests that the feature_importances_ API is working in test_drlearner and test_dml.

fixed missing .shape in new test_dml

cdf0cf9

kbattocchi requested changes Nov 7, 2020

View reviewed changes

econml/dml.py Outdated Show resolved Hide resolved

econml/drlearner.py Outdated Show resolved Hide resolved

econml/sklearn_extensions/ensemble.py Outdated Show resolved Hide resolved

vsyrgkanis and others added 5 commits November 8, 2020 00:58

Merge branch 'master' into vasilis/feature_importances

297a2dc

Update econml/sklearn_extensions/ensemble.py

7567f57

Co-authored-by: Keith Battocchi <kebatt@microsoft.com>

changed order of mixins

c9a0103

Merge branch 'vasilis/feature_importances' of github.com:microsoft/Ec…

baa27c2

…onML into vasilis/feature_importances

kbattocchi approved these changes Nov 8, 2020

View reviewed changes

econml/sklearn_extensions/ensemble.py Outdated Show resolved Hide resolved

vasilismsr added 2 commits November 8, 2020 14:17

fixed docstring reference

5860a37

fixed the problem with inconsistent impurities in subsampled honest f…

10186d2

…orest, that now makes feature_importances_ exactly correct and no need to re-implement the method. Now impurities are computed on the estimation sample and replacing the pre-calculated node impurities.

vsyrgkanis requested a review from kbattocchi November 8, 2020 21:01

vasilismsr added 2 commits November 8, 2020 16:11

fixed docstring doctest

5e3dd95

Transformed sparse matrices to dense matrices after dot product in pa…

92851a8

…rallel_add_trees_ of ensemble.py. This leads to 6 fold speed-up as we were doing many slicing operations to sparse matrices before, which are very slow!

kbattocchi approved these changes Nov 9, 2020

View reviewed changes

econml/sklearn_extensions/ensemble.py Outdated Show resolved Hide resolved

econml/sklearn_extensions/ensemble.py Outdated Show resolved Hide resolved

econml/sklearn_extensions/ensemble.py Outdated Show resolved Hide resolved

changed sparse product to @, without the need to convert dense matrix…

2bf2b76

… to csr matrix

vsyrgkanis merged commit 61cd136 into master Nov 9, 2020

vsyrgkanis deleted the vasilis/feature_importances branch November 16, 2020 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled feature_importances_ for our ForestDML and ForestDRLearner estimators #306

Enabled feature_importances_ for our ForestDML and ForestDRLearner estimators #306

vsyrgkanis commented Nov 7, 2020 •

edited

Loading

kbattocchi left a comment

Enabled feature_importances_ for our ForestDML and ForestDRLearner estimators #306

Enabled feature_importances_ for our ForestDML and ForestDRLearner estimators #306

Conversation

vsyrgkanis commented Nov 7, 2020 • edited Loading

kbattocchi left a comment

Choose a reason for hiding this comment

vsyrgkanis commented Nov 7, 2020 •

edited

Loading