-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP015: Feature Names Propagation #48
Changes from all commits
3b6631d
55b62f6
2f1d2f8
22b9b00
4392546
0608c37
902f792
1fff514
388eda8
7537b15
8916cb1
e9b275c
ffd0954
3f6426f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,7 @@ | |
:caption: Rejected | ||
|
||
slep014/proposal | ||
slep015/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,191 @@ | ||||||
.. _slep_015: | ||||||
|
||||||
================================== | ||||||
SLEP015: Feature Names Propagation | ||||||
================================== | ||||||
|
||||||
:Author: Thomas J Fan | ||||||
:Status: Rejected | ||||||
:Type: Standards Track | ||||||
:Created: 2020-10-03 | ||||||
|
||||||
Abstract | ||||||
######## | ||||||
|
||||||
This SLEP proposes adding the ``get_feature_names_out`` method to all | ||||||
transformers and the ``feature_names_in_`` attribute for all estimators. | ||||||
The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``, | ||||||
contains the feature names. | ||||||
|
||||||
Motivation | ||||||
########## | ||||||
|
||||||
``scikit-learn`` is commonly used as a part of a larger data processing | ||||||
pipeline. When this pipeline is used to transform data, the result is a | ||||||
NumPy array, discarding column names. The current workflow for | ||||||
extracting the feature names requires calling ``get_feature_names`` on the | ||||||
transformer that created the feature. This interface can be cumbersome when used | ||||||
together with a pipeline with multiple column names:: | ||||||
|
||||||
X = pd.DataFrame({'letter': ['a', 'b', 'c'], | ||||||
'pet': ['dog', 'snake', 'dog'], | ||||||
'distance': [1, 2, 3]}) | ||||||
y = [0, 0, 1] | ||||||
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] | ||||||
|
||||||
ct = ColumnTransformer( | ||||||
[('cat', OneHotEncoder(), orig_cat_cols), | ||||||
('num', StandardScaler(), orig_num_cols)]) | ||||||
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) | ||||||
|
||||||
cat_names = (pipe['columntransformer'] | ||||||
.named_transformers_['onehotencoder'] | ||||||
.get_feature_names(orig_cat_cols)) | ||||||
|
||||||
feature_names = np.r_[cat_names, orig_num_cols] | ||||||
|
||||||
The ``feature_names`` extracted above corresponds to the features directly | ||||||
passed into ``LogisticRegression``. As demonstrated above, the process of | ||||||
extracting ``feature_names`` requires knowing the order of the selected | ||||||
categories in the ``ColumnTransformer``. Furthermore, if there is feature | ||||||
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method | ||||||
would need to be used to infer the column names that were selected. | ||||||
|
||||||
Solution | ||||||
######## | ||||||
|
||||||
This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators | ||||||
that will extract the feature names of ``X`` during ``fit``. This will also | ||||||
be used for validation during non-``fit`` methods such as ``transform`` or | ||||||
``predict``. If the ``X`` is not a recognized container with columns, then | ||||||
``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined, | ||||||
then it will not be validated. | ||||||
|
||||||
Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)`` | ||||||
to all transformers. By default, the input features will be determined by the | ||||||
``feature_names_in_`` attribute. The feature names of a pipeline can then be | ||||||
easily extracted as follows:: | ||||||
|
||||||
pipe[:-1].get_feature_names_out() | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and maybe mention? pipe[-1].feature_names_in_
ogrisel marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c', | ||||||
'cat__pet_dog', 'cat__pet_snake', 'num__distance'] | ||||||
|
||||||
Note that ``get_feature_names_out`` does not require ``input_names`` | ||||||
because the feature names was stored in the pipeline itself. These | ||||||
features will be passed to each step's ``get_feature_names_out`` method to | ||||||
obtain the output feature names of the ``Pipeline`` itself. | ||||||
|
||||||
Enabling Functionality | ||||||
###################### | ||||||
|
||||||
The following enhancements are **not** a part of this SLEP. These features are | ||||||
made possible if this SLEP gets accepted. | ||||||
|
||||||
1. This SLEP enables us to implement an ``array_out`` keyword argument to | ||||||
all ``transform`` methods to specify the array container outputted by | ||||||
``transform``. An implementation of ``array_out`` requires | ||||||
``feature_names_in_`` to validate that the names in ``fit`` and | ||||||
``transform`` are consistent. An implementation of ``array_out`` needs | ||||||
a way to map from the input feature names to output feature names, which is | ||||||
provided by ``get_feature_names_out``. | ||||||
|
||||||
2. An alternative to ``array_out``: Transformers in a pipeline may wish to have | ||||||
feature names passed in as ``X``. This can be enabled by adding a | ||||||
``array_input`` parameter to ``Pipeline``:: | ||||||
|
||||||
pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), | ||||||
array_input='pandas') | ||||||
|
||||||
In this case, the pipeline will construct a pandas DataFrame to be inputted | ||||||
into ``MyTransformer`` and ``LogisticRegression``. The feature names | ||||||
will be constructed by calling ``get_feature_names_out`` as data is passed | ||||||
through the ``Pipeline``. This feature implies that ``Pipeline`` is | ||||||
doing the construction of the DataFrame. | ||||||
|
||||||
Considerations and Limitations | ||||||
############################## | ||||||
|
||||||
1. The ``get_feature_names_out`` will be constructed using the name generation | ||||||
specification from :ref:`slep_007`. | ||||||
|
||||||
2. For a ``Pipeline`` with only one estimator, slicing will not work and one | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway?? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to distinguish between the following two pipelines: pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe1[:-1].get_feature_names_out() # this works
pipe2 = make_pipeline(LogisticRegression())
pipe2[:-1].get_feature_names_out() # does not work
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline... |
||||||
would need to access the feature names directly:: | ||||||
|
||||||
pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) | ||||||
pipe[:-1].feature_names_in_ # Works | ||||||
|
||||||
pipe2 = make_pipeline(LogisticRegression()) | ||||||
pipe[:-1].feature_names_in_ # Does not work | ||||||
|
||||||
This is because `pipe2[:-1]` raises an error because it will result in | ||||||
a pipeline with no steps. We can work around this by allowing pipelines | ||||||
with no steps. | ||||||
|
||||||
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But ndarray is not a sequence: numpy/numpy#2776 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe "Iterable that returns a string" would be enough. In our discussions, I think we want to make sure the feature names are strings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm. We'd better accept Sequences and 1d array-likes whose elements are strings: pd.Index is not a Sequence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
an ndarray. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be worth noting that this allowance can avoid unnecessary memory consumption/copies, with reduced implementation complexity, although it may reduce usability a bit. |
||||||
|
||||||
4. Meta-estimators will delegate the setting and validation of | ||||||
``feature_names_in_`` to its inner estimators. The meta-estimator will | ||||||
define ``feature_names_in_`` by referencing its inner estimators. For | ||||||
example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as | ||||||
the input feature names. If the inner estimators do not define | ||||||
``feature_names_in_`` then the meta-estimator will not defined | ||||||
``feature_names_in_`` as well. | ||||||
|
||||||
Backward compatibility | ||||||
###################### | ||||||
|
||||||
1. This SLEP is fully backward compatible with previous versions. With the | ||||||
introduction of ``get_feature_names_out``, ``get_feature_names`` will | ||||||
be deprecated. Note that ``get_feature_names_out``'s signature will | ||||||
always contain ``input_features`` which can be used or ignored. This | ||||||
helps standardize the interface for the get feature names method. | ||||||
|
||||||
2. The inclusion of a ``get_feature_names_out`` method will not introduce any | ||||||
overhead to estimators. | ||||||
|
||||||
3. The inclusion of a ``feature_names_in_`` attribute will increase the size of | ||||||
estimators because they would store the feature names. Users can remove | ||||||
the attribute by calling ``del est.feature_names_in_`` if they want to | ||||||
remove the feature and disable validation. | ||||||
|
||||||
Alternatives | ||||||
############ | ||||||
|
||||||
There have been many attempts to address this issue: | ||||||
|
||||||
1. ``array_out`` in keyword parameter in ``transform`` : This approach requires | ||||||
third party estimators to unwrap and wrap array containers in transform, | ||||||
which introduces more burden for third party estimator maintainers. | ||||||
Furthermore, ``array_out`` with sparse data will introduce an overhead when | ||||||
being passed along in a ``Pipeline``. This overhead comes from the | ||||||
construction of the sparse data container that has the feature names. | ||||||
|
||||||
2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute | ||||||
while this SLEP proposes a ``get_feature_names_out`` method to accomplish | ||||||
the same task. The benefit of the ``get_feature_names_out`` method is that | ||||||
it can be used even if the feature names were not passed in ``fit`` with a | ||||||
dataframe. For example, in a ``Pipeline`` the feature names are not passed | ||||||
through to each step and a ``get_feature_names_out`` method can be used to | ||||||
get the names of each step with slicing. | ||||||
|
||||||
3. :ref:`slep_012` : The ``InputArray`` was developed to work around the | ||||||
overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The | ||||||
introduction of another data structure into the Python Data Ecosystem, would | ||||||
lead to more burden for third party estimator maintainers. | ||||||
|
||||||
|
||||||
References and Footnotes | ||||||
######################## | ||||||
|
||||||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||||||
domain (see this SLEP as an example) or licensed under the `Open | ||||||
Publication License`_. | ||||||
|
||||||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||||||
|
||||||
|
||||||
Copyright | ||||||
######### | ||||||
|
||||||
This document has been placed in the public domain. [1]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.