-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP010 n_features_in_ attribute #22
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
354a6a0
slep10
NicolasHug df083c4
added copyright
NicolasHug ecff33d
addressed comments
NicolasHug 08630ed
move motivation for solution below
NicolasHug 5a247e7
addressed comments
NicolasHug 732dc34
added argument that its ok to break test suite compat
NicolasHug f26bc32
style comments
NicolasHug 78a0d8e
update vectorizers
NicolasHug 8d4ccb6
added warning instead of exception in common check
NicolasHug 593e92c
now propose validate_data(X, y=None, reset=True)
NicolasHug 9cee1c9
added note about private API
NicolasHug 2f37147
update SLEP to only make it about n_features_in_ attribute
NicolasHug File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
.. _slep_010: | ||
|
||
===================================== | ||
SLEP010: ``n_features_in_`` attribute | ||
===================================== | ||
|
||
:Author: Nicolas Hug | ||
:Status: Under review | ||
:Type: Standards Track | ||
:Created: 2019-11-23 | ||
|
||
Abstract | ||
######## | ||
|
||
This SLEP proposes the introduction of a public ``n_features_in_`` attribute | ||
for most estimators (where relevant). | ||
|
||
Motivation | ||
########## | ||
|
||
Knowing the number of features that an estimator expects is useful for | ||
inspection purposes. This is also useful for implementing the feature names | ||
propagation (`SLEP 8 | ||
<https://github.com/scikit-learn/enhancement_proposals/pull/18>`_) . For | ||
example any of the scaler can easily create feature names if they know | ||
``n_features_in_``. | ||
|
||
Solution | ||
######## | ||
|
||
The proposed solution is to replace most calls to ``check_array()`` or | ||
``check_X_y()`` by calls to a newly created private method:: | ||
|
||
def _validate_data(self, X, y=None, reset=True, **check_array_params) | ||
... | ||
|
||
The ``_validate_data()`` method will call ``check_array()`` or | ||
``check_X_y()`` function depending on the ``y`` parameter. | ||
|
||
If the ``reset`` parameter is True (default), the method will set the | ||
``n_feature_in_`` attribute of the estimator, regardless of its potential | ||
previous value. This should typically be used in ``fit()``, or in the first | ||
``partial_fit()`` call. Passing ``reset=False`` will not set the attribute but | ||
instead check against it, and potentially raise an error. This should typically | ||
be used in ``predict()`` or ``transform()``, or on subsequent calls to | ||
``partial_fit``. | ||
|
||
In most cases, the ``n_features_in_`` attribute exists only once ``fit`` has | ||
been called, but there are exceptions (see below). | ||
|
||
A new common check is added: it makes sure that for most estimators, the | ||
``n_features_in_`` attribute does not exist until ``fit`` is called, and | ||
that its value is correct. Instead of raising an exception, this check will | ||
raise a warning for the next two releases. This will give downstream | ||
packages some time to adjust (see considerations below). | ||
|
||
Since the introduced method is private, third party libraries are | ||
recommended not to rely on it. | ||
|
||
The logic that is proposed here (calling a stateful method instead of a | ||
stateless function) is a pre-requisite to fixing the dataframe column | ||
ordering issue: with a stateless ``check_array``, there is no way to raise | ||
an error if the column ordering of a dataframe was changed between ``fit`` | ||
and ``predict``. This is however out os scope for this SLEP, which only focuses | ||
on the introduction of the ``n_features_in_`` attribute. | ||
|
||
Considerations | ||
############## | ||
|
||
The main consideration is that the addition of the common test means that | ||
existing estimators in downstream libraries will not pass our test suite, | ||
unless the estimators also have the ``n_features_in_`` attribute. | ||
|
||
The newly introduced checks will only raise a warning instead of an exception | ||
for the next 2 releases, so this will give more time for downstream packages | ||
to adjust. | ||
|
||
There are other minor considerations: | ||
|
||
- In most meta-estimators, the input validation is handled by the | ||
sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator | ||
is thus explicitly set to that of the sub-estimator, either via a | ||
``@property``, or directly in ``fit()``. | ||
- Some estimators like the dummy estimators do not validate the input | ||
(the 'no_validation' tag should be True). The ``n_features_in_`` attribute | ||
should be set to None, though this is not enforced in the common check. | ||
- Some estimators expect a non-rectangular input: the vectorizers. These | ||
estimators expect dicts or lists, not a ``n_samples * n_features`` matrix. | ||
``n_features_in_`` makes no sense here and these estimators just don't have | ||
the attribute. | ||
- Some estimators may know the number of input features before ``fit`` is | ||
called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``__init__`` from the ``dictionary`` parameter. In this case the attribute | ||
is a property and is available right after object instantiation. | ||
|
||
References and Footnotes | ||
------------------------ | ||
|
||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||
domain (see this SLEP as an example) or licensed under the `Open | ||
Publication License`_. | ||
|
||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||
|
||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. [1]_ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
SLEPs under review | ||
================== | ||
|
||
Nothing here | ||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
slep010/proposal |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we say "private" do we mean that we do not authorise third party libraries to rely on this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I added a note.