Shap interpretability with SubsampledHonestForest #297

vsyrgkanis · 2020-10-27T14:08:40Z

The subsampled honest forest currently is not exactly compatible with Shap, due to some internal re-workings, even though it inherits from sciki-learn's random forest regressor. Thus getting Shap values for the learned model is not trivial. We should enable shap interpretability of the subsampled honest forest, so that we can interpret ForestDML, ForestDRLearner and ForestDRIV.

heimengqi · 2020-10-27T14:15:55Z

I will take this.

…timators (#306) This required changing the subsampled honest forest code a bit so that it does not alter the arrays of the tree structures of sklearn but rather stores two additional arrays required for prediction. This does add around 1.5 times the original running time, so makes it slightly slower due to the extra memory allocation. However this enables correct feature_importance calculation and also in the future correct SHAP calculation (fixes #297), as now the tree entries are consistent with a tree in a randomforestregressor and so shap logic can be applied if we recast the subsampled honest forest as a randomforestregressor (additivity of shap will still be violated since the prediction of the subsample honest forest is not just the aggregation of the predictions across the trees but more complex weighted average). But we can still call shap and still get meaningful shap numbers. One discrepancy is that shap is explaining a different value that what effect returns, since it explains the value that corresponds to the average of the predictions of each honest tree regressor. however, the prediction of an honest forest is not the average of the tree predictions. For a full solution to this small discrepancy, one would need a full re-working of Shap's tree explainer and the tree explainer algorithm to account for such alternative aggregations of tree predictors. * changed subsampledhonest forest to not alter the entries of each tree but rather create auxiliary numpy arrays that store the numerator and denominator of every node. This enables consistent feature_importance calculation and also potentially more accurate shap_values calcualtion. * added feature improtances in dr learner example notebook * added feature_importances_ to DML example notebook * enabled feature_importances_ for forestDML and forestDRLearner as an attribute * fixed doctest in subsample honest forest which was producing old feature_importances_. Added tests that the feature_importances_ API is working in test_drlearner and test_dml. * Transformed sparse matrices to dense matrices after dot product in parallel_add_trees_ of ensemble.py. This leads to 6 fold speed-up as we were doing many slicing operations to sparse matrices before, which are very slow!

vsyrgkanis added the enhancement New feature or request label Oct 27, 2020

vsyrgkanis assigned kbattocchi and moprescu Oct 27, 2020

vsyrgkanis assigned heimengqi Oct 27, 2020

vsyrgkanis mentioned this issue Nov 8, 2020

Enabled feature_importances_ for our ForestDML and ForestDRLearner estimators #306

Merged

vsyrgkanis closed this as completed in #306 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shap interpretability with SubsampledHonestForest #297

Shap interpretability with SubsampledHonestForest #297

vsyrgkanis commented Oct 27, 2020

heimengqi commented Oct 27, 2020

Shap interpretability with SubsampledHonestForest #297

Shap interpretability with SubsampledHonestForest #297

Comments

vsyrgkanis commented Oct 27, 2020

heimengqi commented Oct 27, 2020