-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP010 n_features_in_ attribute #22
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also explain what happens to a pipeline if there's at least one estimator which does not provide the API, which can probably also help in formulating what happens when people use a third party estimator which hasn't fixed their API yet.
slep010/proposal.rst
Outdated
The main consideration is that the addition of the common test means that | ||
existing estimators in downstream libraries will not pass our test suite, | ||
unless they update their calls to ``check_XXX`` into calls to | ||
``_validate_XXX``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also have a deprecation period for this test in estimator checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really true / the point, right? They don't need to use our private methods, though they can. They "just" need to provide n_features_int_
. The mechanism they use for that doesn't matter. Again I would separate interface and implementation more.
I don't understand your intent here? Pipelines just delegate to the first step (other steps are purely ignored). |
True. I was thinking of feature names API. That concern doesn't apply here. |
Maybe we should keep SLEPs simple when we can? (i.e. have a separate one for n_features_out_ if needed) |
Happy to keep it simple. But if the discussion is only around backward compatibility of |
@scikit-learn/core-devs can we merge and vote please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are three options for n_features_in_
:
int
-> rectangular data.None
-> does not validate.- Not defined -> non-rectangular data.
Is this what a third party library would need to know at a glance?
slep010/proposal.rst
Outdated
In most cases, the attribute exists only once ``fit`` has been called, but | ||
there are exceptions (see below). | ||
|
||
A new common check is added: it makes sure that for most esitmators, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new common check is added: it makes sure that for most esitmators, the | |
A new common check is added: it makes sure that for most estimators, the |
slep010/proposal.rst
Outdated
The main consideration is that the addition of the common test means that | ||
existing estimators in downstream libraries will not pass our test suite, | ||
unless the estimators also have the `n_features_in_` attribute (which can be | ||
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given that this was the main concern it should be discussed a bit?
I would argue that we constantly change check_estimator
and add new requirements, and that we have never guaranteed that tests will be forward-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling _validate_XXX
will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have done such things to support different versions of scipy or python, I don't see how that should be a blocker. We can have clear guidelines in our docs and tell people how they can handle it, or even backport these methods to older versions.
I'm happy to have this merged and have a separate but very similar one for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to merge from my side
slep010/proposal.rst
Outdated
.. _slep_010: | ||
|
||
================================= | ||
SLEP010: n_features_in_ attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is being treated as markup. Put it in `
or ``
slep010/proposal.rst
Outdated
The main consideration is that the addition of the common test means that | ||
existing estimators in downstream libraries will not pass our test suite, | ||
unless the estimators also have the `n_features_in_` attribute (which can be | ||
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling _validate_XXX
will not work if the downstream library supports Scikit-learn < 0.22. They will need to backport BaseEstimator from 0.22, or handle two cases.
slep010/proposal.rst
Outdated
done by updating calls to ``check_XXX`` into calls to ``_validate_XXX``). | ||
|
||
Note that we have never guaranteed any kind of backward compatibility | ||
regarding the test suite: see e.g. `#12328 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We should probably support the :issue:
role in this repo)
slep010/proposal.rst
Outdated
<https://github.com/scikit-learn/scikit-learn/pull/12328>`_, `14680 | ||
<https://github.com/scikit-learn/scikit-learn/pull/14680>`_, or `9270 | ||
<https://github.com/scikit-learn/scikit-learn/pull/9270>`_ which all add new | ||
checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are categorically different, since estimators implemented according to our developers' guide would continue to work after these checks were added. The current proposal does not do that, without an update to the developers' guide, hence making the new requirement force the version of scikit-learn with which an estimator is compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when check_classifiers_classes
was introduced (can't find the original PR)?
That check fails on HistGradientBoostingClassifier
with its default parameters (n_samples_leaves=20 is too high for this dataset with 30 samples).
And HistGradientBoostingClassifier
is definitely implemented according to our developers guide.
I'm sure we have tons of instances like that where our tests are so specific that they will break existing well-behaved estimators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StackingClassifier and StackingRegressor required a change in the common tests (I think sample weight tests), so adding those sample weight tests also would have broken other estimators that are implemented according to the developers guide.
In some sense the requirement added here is qualitatively different in that it requires a new attribute. But I'm not sure if that's a difference in practice. We add a test and the third-party developer has his test break, and needs to change the code to make the test work.
I'm not sure if it makes a difference to the third-party developer whether the breakage was due to an implicit detail of the tests or an explicit change of the API. I would argue the second one might actually be less annoying.
What is unfortunate about the change is that it makes it hard for a third-party developer to be compatible with several versions of scikit-learn. However, I would suggest that they keep using check_array
and implement n_features_in_
themselves.
If/when we do feature name consistency, this will might be a bit trickier because it might require a bit more code to implement, but I don't think it's that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm asking is that these aspects be considered and noted, so that the vote on the SLEP takes this into account rather than allowing the reviewers to miss this compatibility issue.
slep010/proposal.rst
Outdated
should be set to None, though this is not enforced in the common tests. | ||
- Some estimators expect a non-rectangular input: the vectorizers. These | ||
estimators never have a ``n_features_in_`` attribute (they never call | ||
``check_array`` anyway). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's no excuse. They should applicable for check_array
... What should their behaviour be were they tested, or were someone implementing a vectorizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your comment.
Are you simply suggesting to make clear that n_features_in_
doesn't make sense here since their input isn't a n_samples x n_features
matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying that you should remove "(they never call check_array
anyway)" because there should be a policy stated here for such estimators, regardless of whether they are currently tested sufficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there should be a way for them to comply with the new requirements and pass check_estimator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the aforementioned backwards compatibility issues (both in terms of new requirements, and the burden of implementing estimators compatible with multiple versions of scikit-learn), I think this needs more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.
Edit: Oops, we can have |
Thanks Joel for your feedback. We discussed with Andy and I updated the SLEP to propose now a single method It seems that the only remaining discussion is #22 (comment) |
######## | ||
|
||
The proposed solution is to replace most calls to ``check_array()`` or | ||
``check_X_y()`` by calls to a newly created private method:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we say "private" do we mean that we do not authorise third party libraries to rely on this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I added a note.
If we're all ok to make the method private and not recommend libraries to use it, let's merge please. I'd like the implementation PR to be merged before we release. |
I don't think there is any great need to rush this to release.
If we are keeping these helper methods private then they should be regarded
as an implementation detail here: the essence of the SLEP becomes
"estimators should provide n_features_in_, and this will be required in
check_estimator two versions after introduction. No public helpers will be
provided to assist third party libraries with this."
If you think this can pass the vote and lead to quick merge of the
implementation, then go ahead and merge this and bring it to vote.
Personally I had understood that a major goal of this work was to
facilitate data consistency validation between fit and predict/transform,
which was extensible to the column reordering issues. Perhaps that API
would then require another SLEP, where this SLEP wouldn't provide any
leverage towards fixing the column ordering bug for scikit-learn-compatible
estimators.
|
We also have the option of warning, rather than failing, about non-compliance, for a couple of releases.
That's an idea that I like.
|
It is indeed too late now to get anything useful merged before the release.
Yes, this work will be useful for consistency checks, and also for feature names propagation. I still don't think that means we should make the method public. But if you disagree, please let it be known. |
This SLEP has been cut down to say "Please define n_features_in_ to pass check_estimator in the future". For most of this SLEPS timeline, I have understood the goal of this SLEP is to improve data consistency validation. We want to do this, because it allows for data validation checks in our meta estimators, (mostly pipeline and column transformer). To actually perform these checks, the estimator needs to remember information about the data used in fit. If we want to perform these checks on third-party estimators, we need these estimators to also remember information about the data. Here are the options discussed in this PR:
I think there is a path for the public function to work: # 0.23
def validate_data(est, check_n_features_in=True, ..):
...
est.n_features_in_ = ...
# 0.24
# When we move to FutureWarning for deprecations, we can use the developer
# focused, DeprecationWarning, to signal this. (Also check_estimator will warn)
def validate_data(est, check_n_features_in=True,
check_feature_columns=False, ...)
# 0.25
def validate_data(est, check_n_features_in=True,
check_feature_columns=False,
check_other_things=False, ...)
# 0.26 (check_feature_columns=True)
def validate_data(est, check_n_features_in=True,
check_feature_columns=True,
check_other_things=False, ...) ClosingI saw this SLEP as "laying out the foundation to provide data validation for third party librarys". This means getting third party libraries to call At the moment, we do not have a compelling story on "What are the benefits on defining n_features_in_" besides "to pass check_estimator in the future". We need to describe where this will be used, and how we will use this features. |
@thomasjpfan I'm afraid your understanding of the SLEP is outdated. We have dropped the Also, I don't think the method vs function discussion is relevant anymore.* The current point of friction is whether the method should be private or not. I'd like to hear Joel's input on that. *EDIT (OK, I understand now that if we want to make it public it should rather be a function) |
I agree on that. @amueller , could you please write out your thoughts on why |
I agree that we should try to move forward with this as it's blocking a lot of things, but we should also not rush it. I think There are conceptually two things that I want: b) refactor the validation API to allow column name consistency checking and feature name propagation. I don't think we should make any helpers public at this stage, so b) would be an internal refactoring, which wouldn't require a SLEP. Honestly I was a bit skeptical of the need for a SLEP here overall, but if anything then we need a SLEP for a), the addition of required attributes. Even if we end up deciding we want a public function to validate |
Btw, I'm also happy to accept the SLEP in the current stage, which is basically saying "we require n_features_in_". I can write another one for |
Thanks for the feedback. I updated the SLEP to only make it about the I'll leave the |
good to merge and vote from my side |
Nice, let's vote then \o/ |
Thanks for the merge and the reviews! @jnothman should I wait after the release to call for a vote, or are you happy if we do it now? |
I'm happy with the overall proposal:
A couple points (that could be added to the SLEP, or to the PR implementing it):
|
I'd be keen to not put any additional burden on estimator developers.... the question then is whether we require meta-estimators be flexible to the case that |
We don't set anything in init. There's only the case of the
The check is run depending on the
I'm in favor of having different levels of rigidity for the tests. I think we're already discussing that w.r.t. the error message checks. |
We don't set anything in init. There's only the case of the SparseCoder where n_feature_in_ is a property which is available right after instantiation, but the init code is unchanged.
All good thank you!
The check is run depending on the no_validation tag. You can take a look at
https://github.com/scikit-learn/scikit-learn/pull/13603/files#diff-a95fe0e40350c536a5e303e87ac979c4R2679 for details.
This is great!! Thanks!
Thorough work!! Awesome. Thank you very much.
|
Definitely +1 on having a non-strict mode for tests, see scikit-learn/scikit-learn#13969 |
Excellent question. I think we should error for any behavior that would require it but not enforce it's presence. Right now the only behavior I can think of is accessing |
But that also dictates that n_features_in_ on a meta estimator is
implemented either as a dynamic property, or in fit but resiliently...
|
What do you mean by dynamic property? Doesn't @if_delegate_has_attribute together with delegating work? |
Yes that works
|
See PR scikit-learn/scikit-learn#13603