Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP010 n_features_in_ attribute #22

Merged
merged 12 commits into from
Nov 7, 2019
112 changes: 112 additions & 0 deletions slep010/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. _slep_010:

=====================================
SLEP010: ``n_features_in_`` attribute
=====================================

:Author: Nicolas Hug
:Status: Under review
:Type: Standards Track
:Created: 2019-11-23

Abstract
########

This SLEP proposes the introduction of a public ``n_features_in_`` attribute
for most estimators (where relevant). This attribute is automatically set
when calling new methods on ``BaseEstimator`` -- ``_validate_X()`` or
``_validate_X_y()`` -- which are meant to replace ``check_array`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status of these methods in terms of API compatibility? Do we document them similarly to public methods? I don't think we have any other instances of "protected" (in the Java sense) API that we provide compatibility guarantees on.

Indeed, would it be more consistent to establish these as public helper functions which nonetheless take self (or est) as a first argument and mutate its attributes?? Doing so would also make it easier for downstream packages to backport the latest _validate_X implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two separate points, right, public/private and method/function.

public vs private: Do we want to have backward compatibility guarantees for these? In terms of API or in terms of behavior? If we add checking of column names, that's not backward compatible behavior and we're certainly planning that.

method vs function:
from a pure design point a method seems like the right thing to do. I do agree with it being easier to backport a function. In what settings would they want to backport that, though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I guess the column name checking would need deprecation anyway and then it's backward compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that we can't suddenly require the caller of this method to pass an additional param.

Column name checking could be FutureWarninged for a while, or could be regarded as a bug fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't follow your first point. Was are you arguing for?

If it's private, we could (in theory), right? And are you thinking about something specific or just in general?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts on this are:

We have an API contract (via the common checks) saying "estimators need to have a n_features_in_ attrribute", and maybe later "column names must be consistent".

But the way third party libraries implement this is up to them. If they want to use our helpers, fine, but they're private.

``check_X_y`` in most cases, calling those under the hood.

Motivation
##########

Knowing the number of features that an estimator expects is useful for
inspection purposes, as well as for input validation.
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved

Solution
########

The proposed solution is to replace most calls to ``check_array()`` or
``check_X_y()`` by calls to two newly created private methods::

def _validate_X(self, X, check_n_features=False, **check_array_params)
...

def _validate_X_y(self, X, y, check_n_features=False, **check_X_y_params)
...

The ``_validate_XXX()`` methods will call the corresponding ``check_XXX()``
functions.

The ``check_n_features`` parameter is False by default and can be set to True
to raise an error when ``self.n_features_in_ != X.shape[1]``. The idea is to
leave it to False in ``fit()`` and set it to True in ``predict()`` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead have _validate_for_fit and _validate_after_fit???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example that would handle both check_n_features and check_column_names?

Does _validate_after_fit also counts as _validate_for_subsequent_calls_to_partial_fit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about this point. I'm -.5 on having check_n_features. I'm not sure if two methods are the way to go (though it might be good?).
My suggestion was to have a parameter fit=True (or initial_fit or fresh_fit or new_fit or even reset).
This is also relevant when having warm_start. Then it's even different for calls to fit.

If we want to do two methods _validate_for_fit sounds good but the other one is harder to name, unless we say _validate_after_fit means _validate_after_initial_call_to_fit.
Which wouldn't be too bad but maybe slightly misleading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it rather _validate_fit and _validate_consistency? I'm happy with _validate_data(initial=True) or _validate_data(reset=True).

I think the point is that where we really want the two validation methods to differ is whether they are storing state or checking against state. The X vs X_y distinction is confusing outside of supervised learning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts here? Is _validate_X and _validate_X_y the right pair of methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was hidden for some reason. I think _validate_data is better.
I'm also ok with _validate_fit and _validate_consistency. Though one might argue it should be _validate_initial_fit or something like that? but maybe that's splitting hairs.

``transform()``.

In most cases, the attribute exists only once ``fit`` has been called, but
there are exceptions (see below).

A new common check is added: it makes sure that for most estimators, the
``n_features_in_`` attribute does not exist until ``fit`` is called, and
that its value is correct. Instead of raising an exception, this check will
raise a warning for the next two releases. This will give downstream
packages some time to adjust (see considerations below).

The logic that is proposed here (calling a stateful method instead of a
stateless function) is a pre-requisite to fixing the dataframe column
ordering issue: with a stateless `check_array`, there is no way to raise an
error if the column ordering of a dataframe was changed between ``fit`` and
``predict``.

Considerations
##############

The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless the estimators also have the `n_features_in_` attribute (which can be
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given that this was the main concern it should be discussed a bit?
I would argue that we constantly change check_estimator and add new requirements, and that we have never guaranteed that tests will be forward-compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling _validate_XXX will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have done such things to support different versions of scipy or python, I don't see how that should be a blocker. We can have clear guidelines in our docs and tell people how they can handle it, or even backport these methods to older versions.


The newly introduced checks will only raise a warning instead of an exception
for the next 2 releases, so this will give more times for downstream packages
to adjust.

Note that we have never guaranteed any kind of backward compatibility
regarding the test suite: see e.g. `#12328
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We should probably support the :issue: role in this repo)

<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680
<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270
<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new
checks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are categorically different, since estimators implemented according to our developers' guide would continue to work after these checks were added. The current proposal does not do that, without an update to the developers' guide, hence making the new requirement force the version of scikit-learn with which an estimator is compatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when check_classifiers_classes was introduced (can't find the original PR)?

That check fails on HistGradientBoostingClassifier with its default parameters (n_samples_leaves=20 is too high for this dataset with 30 samples).

And HistGradientBoostingClassifier is definitely implemented according to our developers guide.

I'm sure we have tons of instances like that where our tests are so specific that they will break existing well-behaved estimators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StackingClassifier and StackingRegressor required a change in the common tests (I think sample weight tests), so adding those sample weight tests also would have broken other estimators that are implemented according to the developers guide.

In some sense the requirement added here is qualitatively different in that it requires a new attribute. But I'm not sure if that's a difference in practice. We add a test and the third-party developer has his test break, and needs to change the code to make the test work.

I'm not sure if it makes a difference to the third-party developer whether the breakage was due to an implicit detail of the tests or an explicit change of the API. I would argue the second one might actually be less annoying.

What is unfortunate about the change is that it makes it hard for a third-party developer to be compatible with several versions of scikit-learn. However, I would suggest that they keep using check_array and implement n_features_in_ themselves.

If/when we do feature name consistency, this will might be a bit trickier because it might require a bit more code to implement, but I don't think it's that bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm asking is that these aspects be considered and noted, so that the vote on the SLEP takes this into account rather than allowing the reviewers to miss this compatibility issue.


There are other minor considerations:

- In most meta-estimators, the input validation is handled by the
sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator
is thus explicitly set to that of the sub-estimator, either via a
``@property``, or directly in ``fit()``.
- Some estimators like the dummy estimators do not validate the input
(the 'no_validation' tag should be True). The ``n_features_in_`` attribute
should be set to None, though this is not enforced in the common tests.
- Some estimators expect a non-rectangular input: the vectorizers. These
estimators expect dicts or lists, not a ``n_samples * n_features`` matrix.
``n_features_in_`` makes no sense here and these estimators just don't have
the attribute.
- Some estimators may know the number of input features before ``fit`` is
called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
``__init__`` from the ``dictionary`` parameter. In this case the attribute
is a property and is available right after object instantiation.

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_
5 changes: 4 additions & 1 deletion under_review.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
SLEPs under review
==================

Nothing here
.. toctree::
:maxdepth: 1

slep010/proposal