SLEP010 n_features_in_ attribute #22

NicolasHug · 2019-09-23T15:55:16Z

adrinjalali

We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API, which can probably also help in formulating what happens when people use a third party estimator which hasn't fixed their API yet.

adrinjalali · 2019-09-23T16:21:30Z

slep010/proposal.rst

+The main consideration is that the addition of the common test means that
+existing estimators in downstream libraries will not pass our test suite,
+unless they update their calls to ``check_XXX`` into calls to
+``_validate_XXX``.


We can also have a deprecation period for this test in estimator checks.

This is not really true / the point, right? They don't need to use our private methods, though they can. They "just" need to provide n_features_int_. The mechanism they use for that doesn't matter. Again I would separate interface and implementation more.

slep010/proposal.rst

NicolasHug · 2019-09-23T17:12:09Z

We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API,

I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).

adrinjalali · 2019-09-23T17:43:57Z

I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).

True. I was thinking of feature names API. That concern doesn't apply here.

amueller · 2019-09-23T18:26:34Z

Yeah this API will be quite simple tbh.
I wonder if we should discuss the n_features_out_ in this SLEP as well. Unfortunately I missed the meeting this morning. Who argued for t he SLEP? @rth @ogrisel @jnothman thoughts on whether n_features_out_ should be included here?

NicolasHug · 2019-09-23T18:38:15Z

Maybe we should keep SLEPs simple when we can? (i.e. have a separate one for n_features_out_ if needed)

amueller · 2019-09-23T18:39:37Z

Happy to keep it simple. But if the discussion is only around backward compatibility of check_estimator then the SLEP will be a copy & paste.

slep010/proposal.rst

NicolasHug · 2019-09-25T14:42:46Z

@scikit-learn/core-devs can we merge and vote please?

thomasjpfan

So there are three options for n_features_in_:

int -> rectangular data.
None -> does not validate.
Not defined -> non-rectangular data.

Is this what a third party library would need to know at a glance?

slep010/proposal.rst

amueller · 2019-09-25T16:48:01Z

slep010/proposal.rst

+In most cases, the attribute exists only once ``fit`` has been called, but
+there are exceptions (see below).
+
+A new common check is added: it makes sure that for most esitmators, the


Suggested change

A new common check is added: it makes sure that for most esitmators, the

A new common check is added: it makes sure that for most estimators, the

slep010/proposal.rst

amueller · 2019-09-25T16:50:36Z

slep010/proposal.rst

+The main consideration is that the addition of the common test means that
+existing estimators in downstream libraries will not pass our test suite,
+unless the estimators also have the `n_features_in_` attribute (which can be
+done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).


I think given that this was the main concern it should be discussed a bit?
I would argue that we constantly change check_estimator and add new requirements, and that we have never guaranteed that tests will be forward-compatible.

Calling _validate_XXX will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.

We have done such things to support different versions of scipy or python, I don't see how that should be a blocker. We can have clear guidelines in our docs and tell people how they can handle it, or even backport these methods to older versions.

adrinjalali · 2019-09-26T13:19:14Z

I'm happy to have this merged and have a separate but very similar one for n_features_out_.

amueller

good to merge from my side

jnothman · 2019-09-27T07:34:27Z

slep010/proposal.rst

+.. _slep_010:
+
+=================================
+SLEP010: n_features_in_ attribute


this is being treated as markup. Put it in ` or ``

slep010/proposal.rst

jnothman · 2019-09-27T07:41:18Z

slep010/proposal.rst

+The main consideration is that the addition of the common test means that
+existing estimators in downstream libraries will not pass our test suite,
+unless the estimators also have the `n_features_in_` attribute (which can be
+done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).


Calling _validate_XXX will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.

jnothman · 2019-09-27T07:41:53Z

slep010/proposal.rst

+done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).
+
+Note that we have never guaranteed any kind of backward compatibility
+regarding the test suite: see e.g. `#12328


(We should probably support the :issue: role in this repo)

jnothman · 2019-09-27T07:43:56Z

slep010/proposal.rst

+<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680
+<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270
+<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new
+checks.


These are categorically different, since estimators implemented according to our developers' guide would continue to work after these checks were added. The current proposal does not do that, without an update to the developers' guide, hence making the new requirement force the version of scikit-learn with which an estimator is compatible.

What about when check_classifiers_classes was introduced (can't find the original PR)?

That check fails on HistGradientBoostingClassifier with its default parameters (n_samples_leaves=20 is too high for this dataset with 30 samples).

And HistGradientBoostingClassifier is definitely implemented according to our developers guide.

I'm sure we have tons of instances like that where our tests are so specific that they will break existing well-behaved estimators.

StackingClassifier and StackingRegressor required a change in the common tests (I think sample weight tests), so adding those sample weight tests also would have broken other estimators that are implemented according to the developers guide.

In some sense the requirement added here is qualitatively different in that it requires a new attribute. But I'm not sure if that's a difference in practice. We add a test and the third-party developer has his test break, and needs to change the code to make the test work.

I'm not sure if it makes a difference to the third-party developer whether the breakage was due to an implicit detail of the tests or an explicit change of the API. I would argue the second one might actually be less annoying.

What is unfortunate about the change is that it makes it hard for a third-party developer to be compatible with several versions of scikit-learn. However, I would suggest that they keep using check_array and implement n_features_in_ themselves.

If/when we do feature name consistency, this will might be a bit trickier because it might require a bit more code to implement, but I don't think it's that bad.

What I'm asking is that these aspects be considered and noted, so that the vote on the SLEP takes this into account rather than allowing the reviewers to miss this compatibility issue.

jnothman · 2019-09-27T07:45:21Z

slep010/proposal.rst

+  should be set to None, though this is not enforced in the common tests.
+- Some estimators expect a non-rectangular input: the vectorizers. These
+  estimators never have a ``n_features_in_`` attribute (they never call
+  ``check_array`` anyway).


That's no excuse. They should applicable for check_array... What should their behaviour be were they tested, or were someone implementing a vectorizer?

I don't understand your comment.

Are you simply suggesting to make clear that n_features_in_ doesn't make sense here since their input isn't a n_samples x n_features matrix?

I'm saying that you should remove "(they never call check_array anyway)" because there should be a policy stated here for such estimators, regardless of whether they are currently tested sufficiently.

But there should be a way for them to comply with the new requirements and pass check_estimator.

slep010/proposal.rst

jnothman

Given the aforementioned backwards compatibility issues (both in terms of new requirements, and the burden of implementing estimators compatible with multiple versions of scikit-learn), I think this needs more work.

jnothman

We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.

thomasjpfan · 2019-09-28T14:49:01Z

~~If we go down the path of warning, it should be shown to library developers and not users.~~

Edit: Oops, we can have check_estimator warn...

NicolasHug · 2019-10-20T20:57:20Z

Thanks Joel for your feedback.

We discussed with Andy and I updated the SLEP to propose now a single method _validate_data(X, y=None, reset=True, ...)

It seems that the only remaining discussion is #22 (comment)

jnothman · 2019-10-22T13:19:42Z

slep010/proposal.rst

+########
+
+The proposed solution is to replace most calls to ``check_array()`` or
+``check_X_y()`` by calls to a newly created private method::


When we say "private" do we mean that we do not authorise third party libraries to rely on this API?

yes. I added a note.

NicolasHug · 2019-10-24T14:51:23Z

If we're all ok to make the method private and not recommend libraries to use it, let's merge please. I'd like the implementation PR to be merged before we release.

jnothman · 2019-10-24T20:19:46Z

I don't think there is any great need to rush this to release. If we are keeping these helper methods private then they should be regarded as an implementation detail here: the essence of the SLEP becomes "estimators should provide n_features_in_, and this will be required in check_estimator two versions after introduction. No public helpers will be provided to assist third party libraries with this." If you think this can pass the vote and lead to quick merge of the implementation, then go ahead and merge this and bring it to vote. Personally I had understood that a major goal of this work was to facilitate data consistency validation between fit and predict/transform, which was extensible to the column reordering issues. Perhaps that API would then require another SLEP, where this SLEP wouldn't provide any leverage towards fixing the column ordering bug for scikit-learn-compatible estimators.

GaelVaroquaux · 2019-10-25T02:33:53Z

We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.

That's an idea that I like.

NicolasHug · 2019-10-25T14:25:36Z

I don't think there is any great need to rush this to release.

It is indeed too late now to get anything useful merged before the release.
But I'd still like to get this done soon. All the effort put into scikit-learn/scikit-learn#13603 is pretty much lost now because we've been taking so long at coming up with a consensus.

Personally I had understood that a major goal of this work was to
facilitate data consistency validation

Yes, this work will be useful for consistency checks, and also for feature names propagation. I still don't think that means we should make the method public. But if you disagree, please let it be known.

thomasjpfan · 2019-10-25T15:18:33Z

This SLEP has been cut down to say "Please define n_features_in_ to pass check_estimator in the future".

For most of this SLEPS timeline, I have understood the goal of this SLEP is to improve data consistency validation. We want to do this, because it allows for data validation checks in our meta estimators, (mostly pipeline and column transformer). To actually perform these checks, the estimator needs to remember information about the data used in fit. If we want to perform these checks on third-party estimators, we need these estimators to also remember information about the data. Here are the options discussed in this PR:

This SLEP in its current state, which asks estimators to defined n_features_in_. (The private methods are not recommended to be used.)
Create public methods that will help do some of the work. (I do not think this would be good, since these public methods are only for developers, and it would not be good to expose it to users)
Create public functions that will help do some of the work. The public functions will need to be passed self and will add attributes to the estimator.

I think there is a path for the public function to work:

# 0.23
def validate_data(est, check_n_features_in=True, ..):
    ...
    est.n_features_in_ = ...

# 0.24
# When we move to FutureWarning for deprecations, we can use the developer
# focused, DeprecationWarning, to signal this. (Also check_estimator will warn)
def validate_data(est, check_n_features_in=True, 
			      check_feature_columns=False, ...)

# 0.25
def validate_data(est, check_n_features_in=True, 
			      check_feature_columns=False, 
                  check_other_things=False, ...)

# 0.26 (check_feature_columns=True)
def validate_data(est, check_n_features_in=True, 
			      check_feature_columns=True, 
                  check_other_things=False, ...)

Closing

I saw this SLEP as "laying out the foundation to provide data validation for third party librarys". This means getting third party libraries to call validate_data, which right now will do something really simple. As we all more checks to validate_data, third-party libraries will only need to add another flag to validate_data to enable the new feature.

At the moment, we do not have a compelling story on "What are the benefits on defining n_features_in_" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.

NicolasHug · 2019-10-25T16:34:15Z

@thomasjpfan I'm afraid your understanding of the SLEP is outdated. We have dropped the check_something parameters in favor of reset=True/false, which would replace all of these.

Also, I don't think the method vs function discussion is relevant anymore.*

The current point of friction is whether the method should be private or not. I'd like to hear Joel's input on that.

*EDIT (OK, I understand now that if we want to make it public it should rather be a function)

NicolasHug · 2019-10-25T16:36:23Z

At the moment, we do not have a compelling story on "What are the benefits on defining n_features_in_" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.

I agree on that.

@amueller , could you please write out your thoughts on why n_feature_in_ will also help the feature_names SLEP?

amueller · 2019-11-06T05:25:58Z

I agree that we should try to move forward with this as it's blocking a lot of things, but we should also not rush it.

I think n_features_in_ and n_features_out_ are useful for model inspection on their own.
They will help for feature_names because they allow us to create feature names. For example any of the scaler can easily create feature names if they know n_features_in_. anything in components can create feature names if it knows n_features_out.

There are conceptually two things that I want:
a) adding two required attributes, n_features_in_ and n_features_out_ to all transformers (where applicable).

b) refactor the validation API to allow column name consistency checking and feature name propagation.

I don't think we should make any helpers public at this stage, so b) would be an internal refactoring, which wouldn't require a SLEP.
If we want to create a public function in utils, that would also probably not require a SLEP (though I also don't really see the benefit of that for now).

Honestly I was a bit skeptical of the need for a SLEP here overall, but if anything then we need a SLEP for a), the addition of required attributes.
In that case, we should probably write a slep for both of them that just say "we'll require these attributes, they are helpful for model inspection".

Even if we end up deciding we want a public function to validate n_features_in_ I'm not sure why that would require a SLEP. We've certainly changed check_estimator frequently in the past without week-long discussions.

amueller · 2019-11-06T05:27:24Z

Btw, I'm also happy to accept the SLEP in the current stage, which is basically saying "we require n_features_in_". I can write another one for n_features_out_ later... I hope that'll be quicker...

NicolasHug · 2019-11-06T15:46:30Z

Thanks for the feedback.

I updated the SLEP to only make it about the n_features_in_ attribute (2f37147). The _validate_data method is only a private helper that does nothing but set and validate the attribute.

I'll leave the n_features_out_ attribute to another SLEP, for simplicity.

@amueller @adrinjalali @jnothman @thomasjpfan

amueller · 2019-11-06T17:53:33Z

good to merge and vote from my side

adrinjalali · 2019-11-07T14:32:34Z

Nice, let's vote then \o/

NicolasHug · 2019-11-07T14:53:55Z

Thanks for the merge and the reviews!

@jnothman should I wait after the release to call for a vote, or are you happy if we do it now?

GaelVaroquaux · 2019-12-04T04:32:18Z

I'm happy with the overall proposal:

I think that n_features_in_ is not controversial
I like a lot the use of a validation method. It will add useful flexibility

A couple points (that could be added to the SLEP, or to the PR implementing it):

I would rather not set anything in init, because it is a change of behavior (things can be set in fit)
How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute? Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

jnothman · 2019-12-04T07:32:08Z

How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute? Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

I'd be keen to not put any additional burden on estimator developers.... the question then is whether we require meta-estimators be flexible to the case that n_features_in_ is / is not present in the base estimator?

NicolasHug · 2019-12-04T12:29:40Z

I would rather not set anything in init, because it is a change of behavior (things can be set in fit)

We don't set anything in init. There's only the case of the SparseCoder where n_feature_in_ is a property which is available right after instantiation, but the init code is unchanged. We could artificially call check_is_fitted in the property, if that's what you mean.

How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute?

The check is run depending on the no_validation tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.

Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

I'm in favor of having different levels of rigidity for the tests. I think we're already discussing that w.r.t. the error message checks.

GaelVaroquaux · 2019-12-04T15:05:09Z

We don't set anything in init. There's only the case of the SparseCoder where n_feature_in_ is a property which is available right after instantiation, but the init code is unchanged.

All good thank you!

The check is run depending on the no_validation tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.

This is great!! Thanks! Thorough work!! Awesome. Thank you very much.

amueller · 2019-12-04T15:46:40Z

Definitely +1 on having a non-strict mode for tests, see scikit-learn/scikit-learn#13969

amueller · 2019-12-04T15:48:26Z

the question then is whether we require meta-estimators be flexible to the case that n_features_in_ is / is not present in the base estimator?

Excellent question. I think we should error for any behavior that would require it but not enforce it's presence. Right now the only behavior I can think of is accessing n_features_in_ on the meta-estimator. It might at some point also be feature_names_in_ or something like that.

jnothman · 2019-12-05T22:31:09Z

But that also dictates that n_features_in_ on a meta estimator is implemented either as a dynamic property, or in fit but resiliently...

amueller · 2019-12-06T19:01:22Z

What do you mean by dynamic property? Doesn't @if_delegate_has_attribute together with delegating work?

jnothman · 2019-12-07T15:29:20Z

Yes that works

slep10

354a6a0

NicolasHug changed the title ~~[MRG] SLEP 10 n_features_in_ attribute~~ SLEP010 n_features_in_ attribute Sep 23, 2019

NicolasHug mentioned this pull request Sep 23, 2019

[MRG] Add n_features_in_ attribute to BaseEstimator scikit-learn/scikit-learn#13603

Closed

adrinjalali reviewed Sep 23, 2019

View reviewed changes

added copyright

df083c4

amueller reviewed Sep 23, 2019

View reviewed changes

slep010/proposal.rst Outdated Show resolved Hide resolved

amueller reviewed Sep 23, 2019

View reviewed changes

slep010/proposal.rst Outdated Show resolved Hide resolved

amueller reviewed Sep 23, 2019

View reviewed changes

slep010/proposal.rst Outdated Show resolved Hide resolved

NicolasHug added 2 commits September 23, 2019 15:10

addressed comments

ecff33d

move motivation for solution below

08630ed

thomasjpfan reviewed Sep 25, 2019

View reviewed changes

slep010/proposal.rst Show resolved Hide resolved

amueller reviewed Sep 25, 2019

View reviewed changes

slep010/proposal.rst Outdated Show resolved Hide resolved

amueller reviewed Sep 25, 2019

View reviewed changes

addressed comments

5a247e7

added argument that its ok to break test suite compat

732dc34

amueller approved these changes Sep 26, 2019

View reviewed changes

jnothman reviewed Sep 27, 2019

View reviewed changes

jnothman requested changes Sep 27, 2019

View reviewed changes

NicolasHug added 2 commits September 27, 2019 09:15

style comments

f26bc32

update vectorizers

78a0d8e

jnothman reviewed Sep 28, 2019

View reviewed changes

jnothman reviewed Oct 22, 2019

View reviewed changes

added note about private API

9cee1c9

update SLEP to only make it about n_features_in_ attribute

2f37147

thomasjpfan approved these changes Nov 6, 2019

View reviewed changes

adrinjalali merged commit 953457d into scikit-learn:master Nov 7, 2019

thomasjpfan mentioned this pull request Dec 5, 2019

RFC Support for Some Developer Utilities scikit-learn/scikit-learn#15801

Open

thomasjpfan mentioned this pull request Oct 2, 2020

ENH Uses _validate_data in other methods in the neural_network module scikit-learn/scikit-learn#18514

Merged

	A new common check is added: it makes sure that for most esitmators, the
	A new common check is added: it makes sure that for most estimators, the

SLEP010 n_features_in_ attribute #22

SLEP010 n_features_in_ attribute #22

Conversation

NicolasHug commented Sep 23, 2019

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Sep 23, 2019

adrinjalali commented Sep 23, 2019

amueller commented Sep 23, 2019

NicolasHug commented Sep 23, 2019

amueller commented Sep 23, 2019

NicolasHug commented Sep 25, 2019

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Sep 26, 2019

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

thomasjpfan commented Sep 28, 2019 • edited Loading

NicolasHug commented Oct 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented Oct 24, 2019

jnothman commented Oct 24, 2019 via email

GaelVaroquaux commented Oct 25, 2019 via email

NicolasHug commented Oct 25, 2019

thomasjpfan commented Oct 25, 2019

Closing

NicolasHug commented Oct 25, 2019 • edited Loading

NicolasHug commented Oct 25, 2019

amueller commented Nov 6, 2019

amueller commented Nov 6, 2019

NicolasHug commented Nov 6, 2019

amueller commented Nov 6, 2019

adrinjalali commented Nov 7, 2019

NicolasHug commented Nov 7, 2019

GaelVaroquaux commented Dec 4, 2019

jnothman commented Dec 4, 2019

NicolasHug commented Dec 4, 2019

GaelVaroquaux commented Dec 4, 2019 via email

amueller commented Dec 4, 2019

amueller commented Dec 4, 2019

jnothman commented Dec 5, 2019 via email

amueller commented Dec 6, 2019

jnothman commented Dec 7, 2019 via email

thomasjpfan commented Sep 28, 2019 •

edited

Loading

NicolasHug commented Oct 25, 2019 •

edited

Loading