Skip to content

Grouped feature importance and selection with random forests

License

Notifications You must be signed in to change notification settings

bgregorutti/rfgroove-py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RFgroove - Importance Measure and Selection for Groups of Variables with Random Forests

Implement an importance measure for groups of features and the Recursive Feature Elimination algorithm (from Guyon et al. 2002).

Based on the two following articles:

  • B. Gregorutti, B. Michel, P. Saint-Pierre (2017). Correlation and variable importance in random forests. arXiv link
  • B. Gregorutti, B. Michel, P. Saint-Pierre (2015). Grouped variable importance with random forests and application to multiple functional data analysis. arXiv link
  • I. Guyon, J. Weston, S. Barnhill, & V. Vapnik (2002). Gene selection for cancer classification using support vector machines, Mach. Learn., 46(1-3), 389–422.

REQUIREMENTS

  • numpy
  • joblib
  • scikit-learn

INSTALLATION

git clone git@github.com:bgregorutti/rfgroove-py.git
cd rfgroove-py/
pip install .

CODE EXAMPLES

Feature importance:

from sklearn.ensemble import RandomForestRegressor
from rfgroove.dataset_generation import gaussian_multidimensional
from rfgroove.importance import grouped_importance

# Build a dataset
X, y = gaussian_multidimensional(c=.5, size=1000, n_features_per_groups=5)

# Fit a RF model
regr = RandomForestRegressor(n_estimators=100, oob_score=True, max_samples=.1)
regr.fit(X, y)

# Compute the feature importance measure
groups = [list(range(5)), list(range(5, 10))] + [[k] for k in range(10, 15)]
imp = grouped_importance(regr, X, y, groups)

see test/test_selection.py.

Feature selection:

from sklearn.ensemble import RandomForestRegressor
from rfgroove.dataset_generation import gaussian_multidimensional
from rfgroove.selection import RFE
    
# Build a dataset
X, y = gaussian_multidimensional(c=.5, size=1000, n_features_per_groups=5)

# Instanciate a RandomForestRegressor object, as base model
base = RandomForestRegressor(n_estimators=1000, bootstrap=True, oob_score=True, max_samples=.1)

# Run the selection algorithm
groups = [list(range(5)), list(range(5, 10))] + [[k] for k in range(10, 15)]
selector = RFE(base, groups, n_jobs=-1)
selector.fit(X, y)

see test/test_importance.py.

About

Grouped feature importance and selection with random forests

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages