Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Gradient Feature Selection (Ready to review) #1734

Merged
merged 32 commits into from
Nov 25, 2019
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
185f698
first update
xuehui1991 Nov 13, 2019
306ecf1
update by folder naming
xuehui1991 Nov 13, 2019
61b4719
add gradient feature selection example
xuehui1991 Nov 13, 2019
5371111
add examples
xuehui1991 Nov 14, 2019
405ba9c
delete unused example
xuehui1991 Nov 14, 2019
024e9c7
update by pylint
xuehui1991 Nov 15, 2019
0488949
update by pylint
xuehui1991 Nov 15, 2019
b5c865c
update learnability by info from pylint
xuehui1991 Nov 15, 2019
9d9d118
fix pylint in fgtrain
xuehui1991 Nov 15, 2019
15d416a
update fginitlize and learnability by pylint
xuehui1991 Nov 15, 2019
39c99b5
update by evan's response
xuehui1991 Nov 18, 2019
4364f8a
add gbdt_selector
xuehui1991 Nov 18, 2019
d2d8328
update gbdt_selector
xuehui1991 Nov 18, 2019
5420202
refine the example folder structure
xuehui1991 Nov 18, 2019
635f0d9
update feature engineering doc
xuehui1991 Nov 18, 2019
11290dc
update docs of feature selector
xuehui1991 Nov 18, 2019
0b11826
update doc of gradientfeature selector
xuehui1991 Nov 18, 2019
319abe5
update docs of GBDTSelector
xuehui1991 Nov 18, 2019
4a3338c
update examples of gradientfeature selector
xuehui1991 Nov 18, 2019
ef0899f
update folder structure
xuehui1991 Nov 19, 2019
e43cfef
update docs by folder structure
xuehui1991 Nov 19, 2019
565c211
test pylint
xuehui1991 Nov 20, 2019
d710d8f
test
xuehui1991 Nov 20, 2019
1497999
Merge remote-tracking branch 'upstream/master' into diff_feature_sele…
xuehui1991 Nov 20, 2019
9c509a6
update by pylint
xuehui1991 Nov 20, 2019
7050556
update by pylint
xuehui1991 Nov 20, 2019
63ce6a0
update docs and remove some dependency
xuehui1991 Nov 20, 2019
cee67af
remove unused code
xuehui1991 Nov 21, 2019
0845ce9
update by comments
xuehui1991 Nov 21, 2019
d1c6ac0
update by comments
xuehui1991 Nov 21, 2019
4ef2bb7
move the feature selection example path
xuehui1991 Nov 22, 2019
f86342b
delete unused dependency
xuehui1991 Nov 22, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions docs/en_US/FeatureEngineering/GBDTSelector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## GBDTSelector

GBDTSelector is based on [LightGBM](https://github.com/microsoft/LightGBM), which is a gradient boosting framework that uses tree-based learning algorithms.

When passing the data into the GBDT model, the model will construct the boosting tree. And the feature importance comes from the score in construction, which indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model.

We could use this method as a strong baseline in Feature Selector, especially when using the GBDT model as a classifier or regressor.

For now, we support the `importance_type` is `split` and `gain`. But we will support customized `importance_type` in the future, which means the user could define how to calculate the `feature score` by themselves.

### Usage

First you need to install dependency:

```
pip install lightgbm
```

Then

```python
from nni.feature_engineering.gbdt_selector import GBDTSelector

# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# initlize a selector
fgs = GBDTSelector()
# fit data
fgs.fit(X_train, y_train, ...)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features(10))

...
```

And you could reference the examples in `/examples/trials/feature-selection/gbdt_selector/`, too.


**Requirement of classArgs**

For now, classArgs has no parameters.
xuehui1991 marked this conversation as resolved.
Show resolved Hide resolved

**Requirement of `fit` FuncArgs**

* **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]

* **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].

* **lgb_params** (dict, require) - The parameters for lightgbm model. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html)

* **eval_ratio** (float, require) - The ratio of data size. It's used for split the eval data and train data from self.X.

* **early_stopping_rounds** (int, require) - The early stopping setting in lightgbm. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

* **importance_type** (str, require) - could be 'split' or 'gain'. The 'split' means ' result contains numbers of times the feature is used in a model' and the 'gain' means 'result contains total gains of splits which use the feature'. The detail you could reference in [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance).

* **num_boost_round** (int, require) - number of boost round. The detail you could reference [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train).

**Requirement of `get_selected_features` FuncArgs**

For now, the `get_selected_features` function has no parameters.
xuehui1991 marked this conversation as resolved.
Show resolved Hide resolved

87 changes: 87 additions & 0 deletions docs/en_US/FeatureEngineering/GradientFeatureSelector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
## GradientFeatureSelector

The algorithm in GradinetFeatureSelector comes from ["Feature Gradients: Scalable Feature Selection via Discrete Relaxation"](https://arxiv.org/pdf/1908.10382.pdf).

GradientFeatureSelector, a gradient-based search algorithm
for feature selection.

1) This approach extends a recent result on the estimation of
learnability in the sublinear data regime by showing that the calculation can be performed iteratively (i.e., in mini-batches) and in **linear time and space** with respect to both the number of features D and the sample size N.

2) This, along with a discrete-to-continuous relaxation of the search domain, allows for an **efficient, gradient-based** search algorithm among feature subsets for very **large datasets**.

3) Crucially, this algorithm is capable of finding **higher-order correlations** between features and targets for both the N > D and N < D regimes, as opposed to approaches that do not consider such interactions and/or only consider one regime.


### Usage

```python
from nni.feature_engineering.gradient_selector import FeatureGradientSelector

# load data
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# initlize a selector
fgs = FeatureGradientSelector(n_features=10)
# fit data
fgs.fit(X_train, y_train)
# get improtant features
# will return the index with important feature here.
print(fgs.get_selected_features())

...
```

And you could reference the examples in `/examples/trials/feature-selection/gradient_feature_selector/`, too.


**Requirement of classArgs**
xuehui1991 marked this conversation as resolved.
Show resolved Hide resolved

* **order** (int, optional, default = 4) - What order of interactions to include. Higher orders may be more accurate but increase the run time. 12 is the maximum allowed order.

* **penatly** (int, optional, default = 1) - Constant that multiplies the regularization term.

* **n_features** (int, optional, default = None) - If None, will automatically choose number of features based on search. Otherwise, the number of top features to select.

* **max_features** (int, optional, default = None) - If not None, will use the 'elbow method' to determine the number of features with max_features as the upper limit.

* **learning_rate** (float, optional, default = 1e-1) - learning rate

* **init** (*zero, on, off, onhigh, offhigh, or sklearn, optional, default = zero*) - How to initialize the vector of scores. 'zero' is the default.

* **n_epochs** (int, optional, default = 1) - number of epochs to run

* **shuffle** (bool, optional, default = True) - Shuffle "rows" prior to an epoch.

* **batch_size** (int, optional, default = 1000) - Nnumber of "rows" to process at a time.

* **target_batch_size** (int, optional, default = 1000) - Number of "rows" to accumulate gradients over. Useful when many rows will not fit into memory but are needed for accurate estimation.

* **classification** (bool, optional, default = True) - If True, problem is classification, else regression.

* **ordinal** (bool, optional, default = True) - If True, problem is ordinal classification. Requires classification to be True.

* **balanced** (bool, optional, default = True) - If true, each class is weighted equally in optimization, otherwise weighted is done via support of each class. Requires classification to be True.

* **prerocess** (str, optional, default = 'zscore') - 'zscore' which refers to centering and normalizing data to unit variance or 'center' which only centers the data to 0 mean.

* **soft_grouping** (bool, optional, default = True) - If True, groups represent features that come from the same source. Used to encourage sparsity of groups and features within groups.

* **verbose** (int, optional, default = 0) - Controls the verbosity when fitting. Set to 0 for no printing 1 or higher for printing every verbose number of gradient steps.

* **device** (str, optional, default = 'cpu') - 'cpu' to run on CPU and 'cuda' to run on GPU. Runs much faster on GPU


**Requirement of `fit` FuncArgs**

* **X** (array-like, require) - The training input samples which shape = [n_samples, n_features]

* **y** (array-like, require) - The target values (class labels in classification, real numbers in regression) which shape = [n_samples].

* **groups** (array-like, optional, default = None) - Groups of columns that must be selected as a unit. e.g. [0, 0, 1, 2] specifies the first two columns are part of a group. Which shape is [n_features].

**Requirement of `get_selected_features` FuncArgs**

For now, the `get_selected_features` function has no parameters.

3 changes: 3 additions & 0 deletions docs/en_US/FeatureEngineering/Overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# FeatureEngineering

We are glad to announce the alpha release for Feature Engineering toolkit on top of NNI, it's still in the experiment phase which might evolve based on usage feedback. We'd like to invite you to use, feedback and even contribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to give more description, the design, supported algorithms, etc.

Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright (c) Microsoft Corporation
xuehui1991 marked this conversation as resolved.
Show resolved Hide resolved
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

import bz2
import urllib.request
import numpy as np

from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split

from nni.feature_engineering.gbdt_selector import GBDTSelector

url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
urllib.request.urlretrieve(url_zip_train, filename='train.bz2')

f_svm = open('train.svm', 'wt')
with bz2.open('train.bz2', 'rb') as f_zip:
data = f_zip.read()
f_svm.write(data.decode('utf-8'))
f_svm.close()

X, y = load_svmlight_file('train.svm')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

lgb_params = {
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'l2', 'l1'},
'num_leaves': 20,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'verbose': 0}

eval_ratio = 0.1
early_stopping_rounds = 10
importance_type = 'gain'
num_boost_round = 1000
topk = 10

selector = GBDTSelector()
selector.fit(X_train, y_train,
lgb_params = lgb_params,
eval_ratio = eval_ratio,
early_stopping_rounds = early_stopping_rounds,
importance_type = importance_type,
num_boost_round = num_boost_round)

print("selected features\t", selector.get_selected_features(topk=topk))

Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


import bz2
import urllib.request
import numpy as np

from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

from nni.feature_engineering.gradient_selector import FeatureGradientSelector

url_zip_train = 'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/rcv1_train.binary.bz2'
urllib.request.urlretrieve(url_zip_train, filename='train.bz2')

f_svm = open('train.svm', 'wt')
with bz2.open('train.bz2', 'rb') as f_zip:
data = f_zip.read()
f_svm.write(data.decode('utf-8'))
f_svm.close()


X, y = load_svmlight_file('train.svm')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

fgs = FeatureGradientSelector(n_features=10)
fgs.fit(X_train, y_train)

print("selected features\t", fgs.get_selected_features())

pipeline = make_pipeline(FeatureGradientSelector(n_epochs=1, n_features=10), LogisticRegression())
pipeline = make_pipeline(SelectFromModel(ExtraTreesClassifier(n_estimators=50)), LogisticRegression())
pipeline.fit(X_train, y_train)

print("Pipeline Score: ", pipeline.score(X_train, y_train))
Empty file.
59 changes: 59 additions & 0 deletions src/sdk/pynni/nni/feature_engineering/feature_selector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Copyright (c) Microsoft Corporation. All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
# associated documentation files (the "Software"), to deal in the Software without restriction,
# including without limitation the rights to use, copy, modify, merge, publish, distribute,
# sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or
# substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
# NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT
# OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# ==================================================================================================

import logging

_logger = logging.getLogger(__name__)


class FeatureSelector():

def __init__(self, **kwargs):
self.selected_features_ = None
self.X = None
self.y = None


def fit(self, X, y, **kwargs):
"""
Fit the training data to FeatureSelector

Paramters
---------
X : array-like numpy matrix
The training input samples, which shape is [n_samples, n_features].
y: array-like numpy matrix
The target values (class labels in classification, real numbers in
regression). Which shape is [n_samples].
"""
self.X = X
self.y = y


def get_selected_features(self):
"""
Fit the training data to FeatureSelector

Returns
-------
list :
Return the index of imprtant feature.
"""
return self.selected_features_
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .gbdt_selector import GBDTSelector
Loading