Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP010 n_features_in_ attribute #22

Merged
merged 12 commits into from
Nov 7, 2019

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug changed the title [MRG] SLEP 10 n_features_in_ attribute SLEP010 n_features_in_ attribute Sep 23, 2019
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API, which can probably also help in formulating what happens when people use a third party estimator which hasn't fixed their API yet.

The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless they update their calls to ``check_XXX`` into calls to
``_validate_XXX``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also have a deprecation period for this test in estimator checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really true / the point, right? They don't need to use our private methods, though they can. They "just" need to provide n_features_int_. The mechanism they use for that doesn't matter. Again I would separate interface and implementation more.

slep010/proposal.rst Outdated Show resolved Hide resolved
@NicolasHug
Copy link
Member Author

We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API,

I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).

@adrinjalali
Copy link
Member

I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored).

True. I was thinking of feature names API. That concern doesn't apply here.

@amueller
Copy link
Member

Yeah this API will be quite simple tbh.
I wonder if we should discuss the n_features_out_ in this SLEP as well. Unfortunately I missed the meeting this morning. Who argued for t he SLEP? @rth @ogrisel @jnothman thoughts on whether n_features_out_ should be included here?

@NicolasHug
Copy link
Member Author

Maybe we should keep SLEPs simple when we can? (i.e. have a separate one for n_features_out_ if needed)

@amueller
Copy link
Member

Happy to keep it simple. But if the discussion is only around backward compatibility of check_estimator then the SLEP will be a copy & paste.

slep010/proposal.rst Outdated Show resolved Hide resolved
slep010/proposal.rst Outdated Show resolved Hide resolved
slep010/proposal.rst Outdated Show resolved Hide resolved
@NicolasHug
Copy link
Member Author

@scikit-learn/core-devs can we merge and vote please?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are three options for n_features_in_:

  1. int -> rectangular data.
  2. None -> does not validate.
  3. Not defined -> non-rectangular data.

Is this what a third party library would need to know at a glance?

slep010/proposal.rst Show resolved Hide resolved
In most cases, the attribute exists only once ``fit`` has been called, but
there are exceptions (see below).

A new common check is added: it makes sure that for most esitmators, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A new common check is added: it makes sure that for most esitmators, the
A new common check is added: it makes sure that for most estimators, the

slep010/proposal.rst Outdated Show resolved Hide resolved
The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless the estimators also have the `n_features_in_` attribute (which can be
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given that this was the main concern it should be discussed a bit?
I would argue that we constantly change check_estimator and add new requirements, and that we have never guaranteed that tests will be forward-compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling _validate_XXX will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have done such things to support different versions of scipy or python, I don't see how that should be a blocker. We can have clear guidelines in our docs and tell people how they can handle it, or even backport these methods to older versions.

@adrinjalali
Copy link
Member

I'm happy to have this merged and have a separate but very similar one for n_features_out_.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to merge from my side

.. _slep_010:

=================================
SLEP010: n_features_in_ attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is being treated as markup. Put it in ` or ``

slep010/proposal.rst Outdated Show resolved Hide resolved
slep010/proposal.rst Outdated Show resolved Hide resolved
slep010/proposal.rst Outdated Show resolved Hide resolved
slep010/proposal.rst Outdated Show resolved Hide resolved
The main consideration is that the addition of the common test means that
existing estimators in downstream libraries will not pass our test suite,
unless the estimators also have the `n_features_in_` attribute (which can be
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling _validate_XXX will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.

done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``).

Note that we have never guaranteed any kind of backward compatibility
regarding the test suite: see e.g. `#12328
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We should probably support the :issue: role in this repo)

<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680
<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270
<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new
checks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are categorically different, since estimators implemented according to our developers' guide would continue to work after these checks were added. The current proposal does not do that, without an update to the developers' guide, hence making the new requirement force the version of scikit-learn with which an estimator is compatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when check_classifiers_classes was introduced (can't find the original PR)?

That check fails on HistGradientBoostingClassifier with its default parameters (n_samples_leaves=20 is too high for this dataset with 30 samples).

And HistGradientBoostingClassifier is definitely implemented according to our developers guide.

I'm sure we have tons of instances like that where our tests are so specific that they will break existing well-behaved estimators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StackingClassifier and StackingRegressor required a change in the common tests (I think sample weight tests), so adding those sample weight tests also would have broken other estimators that are implemented according to the developers guide.

In some sense the requirement added here is qualitatively different in that it requires a new attribute. But I'm not sure if that's a difference in practice. We add a test and the third-party developer has his test break, and needs to change the code to make the test work.

I'm not sure if it makes a difference to the third-party developer whether the breakage was due to an implicit detail of the tests or an explicit change of the API. I would argue the second one might actually be less annoying.

What is unfortunate about the change is that it makes it hard for a third-party developer to be compatible with several versions of scikit-learn. However, I would suggest that they keep using check_array and implement n_features_in_ themselves.

If/when we do feature name consistency, this will might be a bit trickier because it might require a bit more code to implement, but I don't think it's that bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm asking is that these aspects be considered and noted, so that the vote on the SLEP takes this into account rather than allowing the reviewers to miss this compatibility issue.

should be set to None, though this is not enforced in the common tests.
- Some estimators expect a non-rectangular input: the vectorizers. These
estimators never have a ``n_features_in_`` attribute (they never call
``check_array`` anyway).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's no excuse. They should applicable for check_array... What should their behaviour be were they tested, or were someone implementing a vectorizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment.

Are you simply suggesting to make clear that n_features_in_ doesn't make sense here since their input isn't a n_samples x n_features matrix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that you should remove "(they never call check_array anyway)" because there should be a policy stated here for such estimators, regardless of whether they are currently tested sufficiently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there should be a way for them to comply with the new requirements and pass check_estimator.

slep010/proposal.rst Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the aforementioned backwards compatibility issues (both in terms of new requirements, and the burden of implementing estimators compatible with multiple versions of scikit-learn), I think this needs more work.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 28, 2019

If we go down the path of warning, it should be shown to library developers and not users.

Edit: Oops, we can have check_estimator warn...

@NicolasHug
Copy link
Member Author

Thanks Joel for your feedback.

We discussed with Andy and I updated the SLEP to propose now a single method _validate_data(X, y=None, reset=True, ...)

It seems that the only remaining discussion is #22 (comment)

########

The proposed solution is to replace most calls to ``check_array()`` or
``check_X_y()`` by calls to a newly created private method::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say "private" do we mean that we do not authorise third party libraries to rely on this API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I added a note.

@NicolasHug
Copy link
Member Author

If we're all ok to make the method private and not recommend libraries to use it, let's merge please. I'd like the implementation PR to be merged before we release.

@jnothman
Copy link
Member

jnothman commented Oct 24, 2019 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 25, 2019 via email

@NicolasHug
Copy link
Member Author

I don't think there is any great need to rush this to release.

It is indeed too late now to get anything useful merged before the release.
But I'd still like to get this done soon. All the effort put into scikit-learn/scikit-learn#13603 is pretty much lost now because we've been taking so long at coming up with a consensus.

Personally I had understood that a major goal of this work was to
facilitate data consistency validation

Yes, this work will be useful for consistency checks, and also for feature names propagation. I still don't think that means we should make the method public. But if you disagree, please let it be known.

@thomasjpfan
Copy link
Member

This SLEP has been cut down to say "Please define n_features_in_ to pass check_estimator in the future".

For most of this SLEPS timeline, I have understood the goal of this SLEP is to improve data consistency validation. We want to do this, because it allows for data validation checks in our meta estimators, (mostly pipeline and column transformer). To actually perform these checks, the estimator needs to remember information about the data used in fit. If we want to perform these checks on third-party estimators, we need these estimators to also remember information about the data. Here are the options discussed in this PR:

  1. This SLEP in its current state, which asks estimators to defined n_features_in_. (The private methods are not recommended to be used.)
  2. Create public methods that will help do some of the work. (I do not think this would be good, since these public methods are only for developers, and it would not be good to expose it to users)
  3. Create public functions that will help do some of the work. The public functions will need to be passed self and will add attributes to the estimator.

I think there is a path for the public function to work:

# 0.23
def validate_data(est, check_n_features_in=True, ..):
    ...
    est.n_features_in_ = ...

# 0.24
# When we move to FutureWarning for deprecations, we can use the developer
# focused, DeprecationWarning, to signal this. (Also check_estimator will warn)
def validate_data(est, check_n_features_in=True, 
			      check_feature_columns=False, ...)

# 0.25
def validate_data(est, check_n_features_in=True, 
			      check_feature_columns=False, 
                  check_other_things=False, ...)

# 0.26 (check_feature_columns=True)
def validate_data(est, check_n_features_in=True, 
			      check_feature_columns=True, 
                  check_other_things=False, ...)

Closing

I saw this SLEP as "laying out the foundation to provide data validation for third party librarys". This means getting third party libraries to call validate_data, which right now will do something really simple. As we all more checks to validate_data, third-party libraries will only need to add another flag to validate_data to enable the new feature.

At the moment, we do not have a compelling story on "What are the benefits on defining n_features_in_" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.

@NicolasHug
Copy link
Member Author

NicolasHug commented Oct 25, 2019

@thomasjpfan I'm afraid your understanding of the SLEP is outdated. We have dropped the check_something parameters in favor of reset=True/false, which would replace all of these.

Also, I don't think the method vs function discussion is relevant anymore.*

The current point of friction is whether the method should be private or not. I'd like to hear Joel's input on that.

*EDIT (OK, I understand now that if we want to make it public it should rather be a function)

@NicolasHug
Copy link
Member Author

At the moment, we do not have a compelling story on "What are the benefits on defining n_features_in_" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features.

I agree on that.

@amueller , could you please write out your thoughts on why n_feature_in_ will also help the feature_names SLEP?

@amueller
Copy link
Member

amueller commented Nov 6, 2019

I agree that we should try to move forward with this as it's blocking a lot of things, but we should also not rush it.

I think n_features_in_ and n_features_out_ are useful for model inspection on their own.
They will help for feature_names because they allow us to create feature names. For example any of the scaler can easily create feature names if they know n_features_in_. anything in components can create feature names if it knows n_features_out.

There are conceptually two things that I want:
a) adding two required attributes, n_features_in_ and n_features_out_ to all transformers (where applicable).

b) refactor the validation API to allow column name consistency checking and feature name propagation.

I don't think we should make any helpers public at this stage, so b) would be an internal refactoring, which wouldn't require a SLEP.
If we want to create a public function in utils, that would also probably not require a SLEP (though I also don't really see the benefit of that for now).

Honestly I was a bit skeptical of the need for a SLEP here overall, but if anything then we need a SLEP for a), the addition of required attributes.
In that case, we should probably write a slep for both of them that just say "we'll require these attributes, they are helpful for model inspection".

Even if we end up deciding we want a public function to validate n_features_in_ I'm not sure why that would require a SLEP. We've certainly changed check_estimator frequently in the past without week-long discussions.

@amueller
Copy link
Member

amueller commented Nov 6, 2019

Btw, I'm also happy to accept the SLEP in the current stage, which is basically saying "we require n_features_in_". I can write another one for n_features_out_ later... I hope that'll be quicker...

@NicolasHug
Copy link
Member Author

Thanks for the feedback.

I updated the SLEP to only make it about the n_features_in_ attribute (2f37147). The _validate_data method is only a private helper that does nothing but set and validate the attribute.

I'll leave the n_features_out_ attribute to another SLEP, for simplicity.

@amueller @adrinjalali @jnothman @thomasjpfan

@amueller
Copy link
Member

amueller commented Nov 6, 2019

good to merge and vote from my side

@adrinjalali
Copy link
Member

Nice, let's vote then \o/

@adrinjalali adrinjalali merged commit 953457d into scikit-learn:master Nov 7, 2019
@NicolasHug
Copy link
Member Author

Thanks for the merge and the reviews!

@jnothman should I wait after the release to call for a vote, or are you happy if we do it now?

@GaelVaroquaux
Copy link
Member

I'm happy with the overall proposal:

  • I think that n_features_in_ is not controversial
  • I like a lot the use of a validation method. It will add useful flexibility

A couple points (that could be added to the SLEP, or to the PR implementing it):

  • I would rather not set anything in init, because it is a change of behavior (things can be set in fit)
  • How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute? Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

@jnothman
Copy link
Member

jnothman commented Dec 4, 2019

How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute? Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

I'd be keen to not put any additional burden on estimator developers.... the question then is whether we require meta-estimators be flexible to the case that n_features_in_ is / is not present in the base estimator?

@NicolasHug
Copy link
Member Author

I would rather not set anything in init, because it is a change of behavior (things can be set in fit)

We don't set anything in init. There's only the case of the SparseCoder where n_feature_in_ is a property which is available right after instantiation, but the init code is unchanged. We could artificially call check_is_fitted in the property, if that's what you mean.

How will the check be dealt by in check_estimator? Should we consider adding a parameter to test for this attribute?

The check is run depending on the no_validation tag. You can take a look at https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.

Should we consider having a parameter "minimal=False" for check_estimator that checks only the minimal set of requirements?

I'm in favor of having different levels of rigidity for the tests. I think we're already discussing that w.r.t. the error message checks.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 4, 2019 via email

@amueller
Copy link
Member

amueller commented Dec 4, 2019

Definitely +1 on having a non-strict mode for tests, see scikit-learn/scikit-learn#13969

@amueller
Copy link
Member

amueller commented Dec 4, 2019

the question then is whether we require meta-estimators be flexible to the case that n_features_in_ is / is not present in the base estimator?

Excellent question. I think we should error for any behavior that would require it but not enforce it's presence. Right now the only behavior I can think of is accessing n_features_in_ on the meta-estimator. It might at some point also be feature_names_in_ or something like that.

@jnothman
Copy link
Member

jnothman commented Dec 5, 2019 via email

@amueller
Copy link
Member

amueller commented Dec 6, 2019

What do you mean by dynamic property? Doesn't @if_delegate_has_attribute together with delegating work?

@jnothman
Copy link
Member

jnothman commented Dec 7, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants