Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP010 n_features_in_ attribute #22

Merged
merged 12 commits into from
Nov 7, 2019
81 changes: 81 additions & 0 deletions slep010/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
.. _slep_010:

=================================
SLEP010: n_features_in_ attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is being treated as markup. Put it in ` or ``

=================================

:Author: Nicolas Hug
:Status: Under review
:Type: Standards Track
:Created: 2019-11-23

Abstract
########

This SLEP proposes the introduction of a public ``n_features_in_`` attribute
for most estimators (where relevant). This attribute is automatically set
when calling ``_validate_X()`` or ``_validate_X_y`` which are meant to replace
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
``check_array`` and ``check_X_y`` (they are still called under the hood).
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved

Motivation
##########

Knowing the number of features that an estimator expects is useful for
inspection purposes, as well as for input validation.
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved

The logic that is proposed here (calling a stateful method instead of a
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
stateless function) is also a pre-requisit to fixing the dataframe column
ordering issue: at the moment, there is no way to raise an error if the column
ordering of a dataframe was changed between ``fit`` and ``predict``.

Solution
########

The proposed solution is to replace most calls to ``check_X()`` or
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
``check_X_y()`` by calls to two newly created private methods::

def _validate_X(self, X, check_n_features=False, **check_array_params)
...

def _validate_X_y(self, X, check_n_features=False, **check_X_y_params)
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
...

The ``_validate_XXX()`` methods will call the corresponding ``check_XXX()``
functions.

The ``check_n_features`` parameter is False by default and can be set to True
to raise an error when ``self.n_features_in_ != X.shape[1]``. The idea is to
leave it to False in ``fit()`` and set it to True in ``predict()`` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead have _validate_for_fit and _validate_after_fit???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example that would handle both check_n_features and check_column_names?

Does _validate_after_fit also counts as _validate_for_subsequent_calls_to_partial_fit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot about this point. I'm -.5 on having check_n_features. I'm not sure if two methods are the way to go (though it might be good?).
My suggestion was to have a parameter fit=True (or initial_fit or fresh_fit or new_fit or even reset).
This is also relevant when having warm_start. Then it's even different for calls to fit.

If we want to do two methods _validate_for_fit sounds good but the other one is harder to name, unless we say _validate_after_fit means _validate_after_initial_call_to_fit.
Which wouldn't be too bad but maybe slightly misleading?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it rather _validate_fit and _validate_consistency? I'm happy with _validate_data(initial=True) or _validate_data(reset=True).

I think the point is that where we really want the two validation methods to differ is whether they are storing state or checking against state. The X vs X_y distinction is confusing outside of supervised learning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts here? Is _validate_X and _validate_X_y the right pair of methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was hidden for some reason. I think _validate_data is better.
I'm also ok with _validate_fit and _validate_consistency. Though one might argue it should be _validate_initial_fit or something like that? but maybe that's splitting hairs.

``transform()``.

In most cases, the attribute exists only once ``fit`` has been called, but
there are exceptions (see below).

A new common check is added: it makes sure that for most esitmators, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A new common check is added: it makes sure that for most esitmators, the
A new common check is added: it makes sure that for most estimators, the

``n_features_in_`` attribute does not exist until ``fit`` is called, and
that its value is correct.

Considerations
##############

The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless they update their calls to ``check_XXX`` into calls to
``_validate_XXX``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also have a deprecation period for this test in estimator checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really true / the point, right? They don't need to use our private methods, though they can. They "just" need to provide n_features_int_. The mechanism they use for that doesn't matter. Again I would separate interface and implementation more.


There are other minor considerations:

- In most meta-estimators, the input validation is handled by the
sub-estimator(s). The ``n_features_in_`` attribute of the meta-estimator
is thus explicitly set to that of the sub-estimator, either via a
``@property``, or directly in ``fit()``.
- Some estimators like the dummy estimators do not validate the input
(the 'no_validation' tag should be True). The ``n_features_in_`` attribute
should be set to None, though this is not enforced in the common tests.
- Some estimators expect a non-rectangular input: the vectorizers. These
estimators never have a ``n_features_in_`` attribute (they never call
``check_array`` anyway).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's no excuse. They should applicable for check_array... What should their behaviour be were they tested, or were someone implementing a vectorizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment.

Are you simply suggesting to make clear that n_features_in_ doesn't make sense here since their input isn't a n_samples x n_features matrix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that you should remove "(they never call check_array anyway)" because there should be a policy stated here for such estimators, regardless of whether they are currently tested sufficiently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there should be a way for them to comply with the new requirements and pass check_estimator.

- Some estimators may know the number of input features before ``fit`` is
called: typically the ``SparseCoder``, where ``n_feature_in_`` is known at
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
``__init__`` from the ``dictionary`` parameter. In this case the attribute is
set in ``__init__``.
adrinjalali marked this conversation as resolved.
Show resolved Hide resolved
NicolasHug marked this conversation as resolved.
Show resolved Hide resolved
5 changes: 4 additions & 1 deletion under_review.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
SLEPs under review
==================

Nothing here
.. toctree::
:maxdepth: 1

slep010/proposal