-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP010 n_features_in_ attribute #22
Changes from 9 commits
354a6a0
df083c4
ecff33d
08630ed
5a247e7
732dc34
f26bc32
78a0d8e
8d4ccb6
593e92c
9cee1c9
2f37147
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
.. _slep_010: | ||
|
||
===================================== | ||
SLEP010: ``n_features_in_`` attribute | ||
===================================== | ||
|
||
:Author: Nicolas Hug | ||
:Status: Under review | ||
:Type: Standards Track | ||
:Created: 2019-11-23 | ||
|
||
Abstract | ||
######## | ||
|
||
This SLEP proposes the introduction of a public ``n_features_in_`` attribute | ||
for most estimators (where relevant). This attribute is automatically set | ||
when calling new methods on ``BaseEstimator`` -- ``_validate_X()`` or | ||
``_validate_X_y()`` -- which are meant to replace ``check_array`` and | ||
``check_X_y`` in most cases, calling those under the hood. | ||
|
||
Motivation | ||
########## | ||
|
||
Knowing the number of features that an estimator expects is useful for | ||
inspection purposes, as well as for input validation. | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Solution | ||
######## | ||
|
||
The proposed solution is to replace most calls to ``check_array()`` or | ||
``check_X_y()`` by calls to two newly created private methods:: | ||
|
||
def _validate_X(self, X, check_n_features=False, **check_array_params) | ||
... | ||
|
||
def _validate_X_y(self, X, y, check_n_features=False, **check_X_y_params) | ||
... | ||
|
||
The ``_validate_XXX()`` methods will call the corresponding ``check_XXX()`` | ||
functions. | ||
|
||
The ``check_n_features`` parameter is False by default and can be set to True | ||
to raise an error when ``self.n_features_in_ != X.shape[1]``. The idea is to | ||
leave it to False in ``fit()`` and set it to True in ``predict()`` or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we instead have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you give an example that would handle both Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Forgot about this point. I'm -.5 on having If we want to do two methods There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it rather I think the point is that where we really want the two validation methods to differ is whether they are storing state or checking against state. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts here? Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry this was hidden for some reason. I think |
||
``transform()``. | ||
|
||
In most cases, the attribute exists only once ``fit`` has been called, but | ||
there are exceptions (see below). | ||
|
||
A new common check is added: it makes sure that for most estimators, the | ||
``n_features_in_`` attribute does not exist until ``fit`` is called, and | ||
that its value is correct. Instead of raising an exception, this check will | ||
raise a warning for the next two releases. This will give downstream | ||
packages some time to adjust (see considerations below). | ||
|
||
The logic that is proposed here (calling a stateful method instead of a | ||
stateless function) is a pre-requisite to fixing the dataframe column | ||
ordering issue: with a stateless `check_array`, there is no way to raise an | ||
error if the column ordering of a dataframe was changed between ``fit`` and | ||
``predict``. | ||
|
||
Considerations | ||
############## | ||
|
||
The main consideration is that the addition of the common test means that | ||
existing estimators in downstream libraries will not pass our test suite, | ||
unless the estimators also have the `n_features_in_` attribute (which can be | ||
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think given that this was the main concern it should be discussed a bit? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have done such things to support different versions of scipy or python, I don't see how that should be a blocker. We can have clear guidelines in our docs and tell people how they can handle it, or even backport these methods to older versions. |
||
|
||
The newly introduced checks will only raise a warning instead of an exception | ||
for the next 2 releases, so this will give more times for downstream packages | ||
to adjust. | ||
|
||
Note that we have never guaranteed any kind of backward compatibility | ||
regarding the test suite: see e.g. `#12328 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (We should probably support the |
||
<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680 | ||
<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270 | ||
<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new | ||
checks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are categorically different, since estimators implemented according to our developers' guide would continue to work after these checks were added. The current proposal does not do that, without an update to the developers' guide, hence making the new requirement force the version of scikit-learn with which an estimator is compatible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about when That check fails on And I'm sure we have tons of instances like that where our tests are so specific that they will break existing well-behaved estimators. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. StackingClassifier and StackingRegressor required a change in the common tests (I think sample weight tests), so adding those sample weight tests also would have broken other estimators that are implemented according to the developers guide. In some sense the requirement added here is qualitatively different in that it requires a new attribute. But I'm not sure if that's a difference in practice. We add a test and the third-party developer has his test break, and needs to change the code to make the test work. I'm not sure if it makes a difference to the third-party developer whether the breakage was due to an implicit detail of the tests or an explicit change of the API. I would argue the second one might actually be less annoying. What is unfortunate about the change is that it makes it hard for a third-party developer to be compatible with several versions of scikit-learn. However, I would suggest that they keep using If/when we do feature name consistency, this will might be a bit trickier because it might require a bit more code to implement, but I don't think it's that bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I'm asking is that these aspects be considered and noted, so that the vote on the SLEP takes this into account rather than allowing the reviewers to miss this compatibility issue. |
||
|
||
There are other minor considerations: | ||
|
||
- In most meta-estimators, the input validation is handled by the | ||
sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator | ||
is thus explicitly set to that of the sub-estimator, either via a | ||
``@property``, or directly in ``fit()``. | ||
- Some estimators like the dummy estimators do not validate the input | ||
(the 'no_validation' tag should be True). The ``n_features_in_`` attribute | ||
should be set to None, though this is not enforced in the common tests. | ||
- Some estimators expect a non-rectangular input: the vectorizers. These | ||
estimators expect dicts or lists, not a ``n_samples * n_features`` matrix. | ||
``n_features_in_`` makes no sense here and these estimators just don't have | ||
the attribute. | ||
- Some estimators may know the number of input features before ``fit`` is | ||
called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at | ||
NicolasHug marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``__init__`` from the ``dictionary`` parameter. In this case the attribute | ||
is a property and is available right after object instantiation. | ||
|
||
References and Footnotes | ||
------------------------ | ||
|
||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||
domain (see this SLEP as an example) or licensed under the `Open | ||
Publication License`_. | ||
|
||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||
|
||
|
||
Copyright | ||
--------- | ||
|
||
This document has been placed in the public domain. [1]_ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,7 @@ | ||
SLEPs under review | ||
================== | ||
|
||
Nothing here | ||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
slep010/proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the status of these methods in terms of API compatibility? Do we document them similarly to public methods? I don't think we have any other instances of "protected" (in the Java sense) API that we provide compatibility guarantees on.
Indeed, would it be more consistent to establish these as public helper functions which nonetheless take
self
(orest
) as a first argument and mutate its attributes?? Doing so would also make it easier for downstream packages to backport the latest_validate_X
implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are two separate points, right, public/private and method/function.
public vs private: Do we want to have backward compatibility guarantees for these? In terms of API or in terms of behavior? If we add checking of column names, that's not backward compatible behavior and we're certainly planning that.
method vs function:
from a pure design point a method seems like the right thing to do. I do agree with it being easier to backport a function. In what settings would they want to backport that, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I guess the column name checking would need deprecation anyway and then it's backward compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that we can't suddenly require the caller of this method to pass an additional param.
Column name checking could be FutureWarninged for a while, or could be regarded as a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't follow your first point. Was are you arguing for?
If it's private, we could (in theory), right? And are you thinking about something specific or just in general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts on this are:
We have an API contract (via the common checks) saying "estimators need to have a n_features_in_ attrribute", and maybe later "column names must be consistent".
But the way third party libraries implement this is up to them. If they want to use our helpers, fine, but they're private.