Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP010 n_features_in_ attribute #22

Merged
merged 12 commits into from
Nov 7, 2019
112 changes: 112 additions & 0 deletions slep010/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
.. _slep_010:

=====================================
SLEP010: ``n_features_in_`` attribute
=====================================

:Author: Nicolas Hug
:Status: Under review
:Type: Standards Track
:Created: 2019-11-23

Abstract
########

This SLEP proposes the introduction of a public ``n_features_in_`` attribute
for most estimators (where relevant). This attribute is automatically set
when calling a new method ``BaseEstimator._validate_data(X, y=None)`` which
is meant to replace ``check_array`` and ``check_X_y`` in most cases, calling
those under the hood.

Motivation
##########

Knowing the number of features that an estimator expects is useful for
inspection purposes, as well as for input validation.
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved

Solution
########

The proposed solution is to replace most calls to ``check_array()`` or
``check_X_y()`` by calls to a newly created private method::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say "private" do we mean that we do not authorise third party libraries to rely on this API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I added a note.


def _validate_data(self, X, y=None, reset=True, **check_array_params)
...

The ``_validate_data()`` method will call ``check_array()`` or
``check_X_y()`` function depending on the ``y`` parameter.

If the ``reset`` parameter is True (default), the method will set the
``n_feature_in_`` attribute of the estimator, regardless of its potential
previous value. This should typically be used in ``fit()``, or in the first
``partial_fit()`` call. Passing ``reset=False`` will not set the attribute but
instead check against it, and potentially raise an error. This should typically
be used in ``predict()`` or ``transform()``, or on subsequent calls to
``partial_fit``.

In most cases, the ``n_features_in_`` attribute exists only once ``fit`` has
been called, but there are exceptions (see below).

A new common check is added: it makes sure that for most estimators, the
``n_features_in_`` attribute does not exist until ``fit`` is called, and
that its value is correct. Instead of raising an exception, this check will
raise a warning for the next two releases. This will give downstream
packages some time to adjust (see considerations below).

The logic that is proposed here (calling a stateful method instead of a
stateless function) is a pre-requisite to fixing the dataframe column
ordering issue: with a stateless ``check_array``, there is no way to raise
an error if the column ordering of a dataframe was changed between ``fit``
and ``predict``.

Considerations
##############

The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless the estimators also have the `n_features_in_` attribute (which can be
done by updating calls to ``check_XXX()`` into calls to ``_validate_data()``).

The newly introduced checks will only raise a warning instead of an exception
for the next 2 releases, so this will give more time for downstream packages
to adjust.

Note that we have never guaranteed any kind of backward compatibility
regarding the test suite: see e.g. `#12328
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We should probably support the :issue: role in this repo)

<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680
<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270
<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new
checks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are categorically different, since estimators implemented according to our developers' guide would continue to work after these checks were added. The current proposal does not do that, without an update to the developers' guide, hence making the new requirement force the version of scikit-learn with which an estimator is compatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when check_classifiers_classes was introduced (can't find the original PR)?

That check fails on HistGradientBoostingClassifier with its default parameters (n_samples_leaves=20 is too high for this dataset with 30 samples).

And HistGradientBoostingClassifier is definitely implemented according to our developers guide.

I'm sure we have tons of instances like that where our tests are so specific that they will break existing well-behaved estimators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StackingClassifier and StackingRegressor required a change in the common tests (I think sample weight tests), so adding those sample weight tests also would have broken other estimators that are implemented according to the developers guide.

In some sense the requirement added here is qualitatively different in that it requires a new attribute. But I'm not sure if that's a difference in practice. We add a test and the third-party developer has his test break, and needs to change the code to make the test work.

I'm not sure if it makes a difference to the third-party developer whether the breakage was due to an implicit detail of the tests or an explicit change of the API. I would argue the second one might actually be less annoying.

What is unfortunate about the change is that it makes it hard for a third-party developer to be compatible with several versions of scikit-learn. However, I would suggest that they keep using check_array and implement n_features_in_ themselves.

If/when we do feature name consistency, this will might be a bit trickier because it might require a bit more code to implement, but I don't think it's that bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm asking is that these aspects be considered and noted, so that the vote on the SLEP takes this into account rather than allowing the reviewers to miss this compatibility issue.


There are other minor considerations:

- In most meta-estimators, the input validation is handled by the
sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator
is thus explicitly set to that of the sub-estimator, either via a
``@property``, or directly in ``fit()``.
- Some estimators like the dummy estimators do not validate the input
(the 'no_validation' tag should be True). The ``n_features_in_`` attribute
should be set to None, though this is not enforced in the common tests.
- Some estimators expect a non-rectangular input: the vectorizers. These
estimators expect dicts or lists, not a ``n_samples * n_features`` matrix.
``n_features_in_`` makes no sense here and these estimators just don't have
the attribute.
- Some estimators may know the number of input features before ``fit`` is
called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
``__init__`` from the ``dictionary`` parameter. In this case the attribute
is a property and is available right after object instantiation.

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_
5 changes: 4 additions & 1 deletion under_review.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
SLEPs under review
==================

Nothing here
.. toctree::
:maxdepth: 1

slep010/proposal