[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro · 2024-09-11T10:42:47Z

(edit: taken over by @jameslamb, description re-written below)

raises minimum supported version to scikit-learn>=0.24.2
implements __sklearn_tags__() (replacement for _more_tags()) for scikit-learn estimators
starts using sklearn.utils.validation.validate_data() in fit() and predict()
adds tests confirming that scikit-learn estimators reject inputs with the wrong number of features

Notes for Reviewers

see https://scikit-learn.org/dev/whats_new/v1.6.html and scikit-learn/scikit-learn#29677

vnherdeiro · 2024-09-11T13:53:31Z

Update:

The change introduced in scikit-learn/scikit-learn#29677 makes it hard to subclass a sklearn estimator in a codebase while being compatible with sklearn < 1.6.0 and sklearn >= 1.6.0. Essentially the former looks up ._more_tags() and ignore __sklearn_tags__() while the former looks up __sklearn_tags__() and forbids existence of a
._more_tags() tags method.

The issue is discussed here:
scikit-learn/scikit-learn#29801

and it looks like a relaxation of the impossibility of having both ._more_tags() and __sklearn_tags__() simulatenously will be relaxed. If it goes through let's park this MR until lightgbm decides to force a scikit-learn>=1.6.0 dependency.

adrinjalali · 2024-09-12T10:33:53Z

@vnherdeiro note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment)), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801

vnherdeiro · 2024-09-12T12:03:47Z

Correct I am waiting for that PR to go in to bring back _more_tags Using @available_if would require another sklearn import and make the code less readable I reckon

…

On Thu, 12 Sept 2024, 11:34 am Adrin Jalali, ***@***.***> wrote: @vnherdeiro <https://github.com/vnherdeiro> note that it's possible already to support both with this method (scikit-learn/scikit-learn#29677 (comment) <scikit-learn/scikit-learn#29677 (comment)>), however, the version check and @available_if are going to be unnecessary once we merge scikit-learn/scikit-learn#29801 <scikit-learn/scikit-learn#29801> — Reply to this email directly, view it on GitHub <#6651 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE4CNVUURU6AMLDYUXKPFTTZWFU2TAVCNFSM6AAAAABOAVNTLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBVHA4DCOJWGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jameslamb · 2024-09-15T03:08:09Z

Thanks for starting on this @vnherdeiro . I've documented it in an issue: #6653 (and added that to the PR description).

Note there that I intentionally put the exact errors messages in plain text instead of just referring to _more_tags() ... that helps people to find this work from search engines.

Note also that the _more_tags() thing is only 1 of 3 breaking changes in scikit-learn that lightgbm will have to adjust to to get those tests passing again with scikit-learn==1.6.0.

jameslamb

Thanks for starting on this! Please see scikit-learn/scikit-learn#29801 (comment):

The story becomes "If you want to support multiple scikit-learn versions, define both."

I think we should leave _more_tags() untouched and add __sklearn_tags__(). And have self.__sklearn_tags__() call self._more_tags() to get its data, so we don't define things like _xfail_checks twice.

Do you have time to do that in the next few days? We need to fix this to unblock CI here, so if you don't have time to fix it this week please let me know and I will work on this.

…n_tags

vnherdeiro · 2024-09-15T12:12:23Z

@jameslamb Have just pushe a sklearn_tags trying a conversion from _more_tags. I added a out of current argument scope warning to catch a change from the arguments in _more_tags (they don't seem to change much though).

adrinjalali

Not a maintainer here, but coming from sklearn side. Leaving thoughts hoping it'd help.

python-package/lightgbm/sklearn.py

jameslamb

Thanks for this.

I've reviewed the dataclasses at https://github.com/scikit-learn/scikit-learn/blob/e2ee93156bd3692722a39130c011eea313628690/sklearn/utils/_tags.py and agree with the choices you've made about how to map the dictionary-formatted values from _more_tags() to the dataclass attributes scikit-learn now prefers.

Please see the other comments about simplifying this.

python-package/lightgbm/sklearn.py

jameslamb · 2024-10-05T06:25:55Z

Ok, this is ready for another review!

But understand if reviewers would like to wait until CI is fixed first before reviewing (#6663).

trivialfis

The change looks great! Thank you for the heads-up

StrikerRUS

Very impressive work!
I left some minor comments below:

python-package/lightgbm/compat.py

StrikerRUS · 2024-10-05T16:57:54Z

python-package/lightgbm/sklearn.py

@@ -144,6 +147,32 @@ def _get_weight_from_constructed_dataset(dataset: Dataset) -> Optional[np.ndarra
    return weight


+def _num_features_for_raw_input(X: _LGBM_ScikitMatrixLike) -> int:


_num_features() was added in 0.24 version:
scikit-learn/scikit-learn@b4d5ad6
I think we can move this into compat.py and try to import _num_features() firstly, then in case of ImportError emulate it with this function.

This approach will benefit from auto upstream updates of _num_features() in future versions.

Very good suggestion, thanks! I attempted this and found that it exposed some other complexity, which I've tried to describe in code comments and these inline GitHub comments:

https://github.com/microsoft/LightGBM/pull/6651/files#r1788923387

https://github.com/microsoft/LightGBM/pull/6651/files#r1788920665

To simplify the implementation a bit, I'm now also proposing:

calling validate_data(reset=True), which will internally call _num_features() on scikit-learn>=1.6

directly and unconditionally importing _num_features() and calling it in the pre-1.6 validate_data() implemented in compat.py (so no separate implementation to maintain!)

raising lightgbm's scikit-learn floor to >=0.24.2 so users will always have a version at runtime with _num_features() defined

The new floor on scikit-learn>=0.24.2 should not impact users much. That version was released in April 2021 and did not have wheels for Python versions newer than 3.9 (PyPi release page), so I think it's unlikely many people will try to be using the next release of lightgbm with such an old version of scikit-learn.

But this is the first time we've had a floor on that dependency, so for awareness: cc @borchero @jmoralez @guolinke @shiyu1994

Sounds reasonable for me!

StrikerRUS · 2024-10-05T17:03:42Z

python-package/lightgbm/sklearn.py

+        # _LGBMModelBase.__sklearn_tags__() cannot be called unconditionally,
+        # because that method isn't defined for scikit-learn<1.6
+        if not hasattr(_LGBMModelBase, "__sklearn_tags__"):
+            from sklearn import __version__ as sklearn_version


I think we can safely import this in compat.py.

__version__ was in __init__.py at least in 2011 year:
https://github.com/scikit-learn/scikit-learn/blob/dacdd3ad7b455a46b5e344ecfeaf5a369b554860/sklearn/__init__.py#L50

I was originally thinking that it'd be good for users to not incur the cost of this import when it's only needed in an error message... but I guess since it's a top-level attribute of sklearn, it will already have been imported anyway by the time any other sklearn imports have run.

I've moved this to compat.py, thanks for the suggestion.

python-package/lightgbm/sklearn.py

tests/python_package_test/test_sklearn.py

…rn_more_tags_deprecation

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

…ting for older versions

python-package/lightgbm/compat.py

jameslamb · 2024-10-06T06:38:16Z

python-package/lightgbm/sklearn.py

@@ -1067,6 +1137,21 @@ def n_features_in_(self) -> int:
            raise LGBMNotFittedError("No n_features_in found. Need to call fit beforehand.")
        return self._n_features_in

+    @n_features_in_.setter


If you pass reset=True to sklearn.utils.validation.validate_data(), it will try to:

set estimator.n_features_in_ (code link)

delete estimator.feature_names_in_ (code link)

We want the "set estimator.n_features_in_" behavior, because without it we have to manually set estimator.n_features_in_ in fit().

Doing that requires determining the number of features in X, which requires either re-implementing something like sklearn.utils.validation._num_features() (as I originally tried to do) or just calling that function directly. But that function can't safely be called directly before calling check_array(), because it raises a TypeError on 1-D inputs, which violates the check_fit1d estimator check (code link).

So here, I'm proposing that we do the following:

add a setter for n_features_in_ and a deleter for feature_names_in_

pass reset=True at fit() time to validate_data()

modify the pre-1.6 implementation of validate_data() in compat.py to match

python-package/lightgbm/sklearn.py

jameslamb · 2024-10-06T06:59:55Z

@StrikerRUS your comments were definitely not "minor", they really helped a lot! I've re-thought a lot of this PR based on trying to implement those suggestions.

This is ready for another review. Thank you for all your reviewing effort here, I know this change has become quite complex and there are many competing constraints it's trying to satisfy.

…rn_more_tags_deprecation

…eiro/LightGBM into fix_sklearn_more_tags_deprecation

StrikerRUS

Hope this time my review comments will be really minor 😄

StrikerRUS · 2024-10-08T13:13:42Z

python-package/lightgbm/compat.py

+
+            # NOTE: check_X_y() calls check_array() internally, so only need to call one or the other of them here
+            if no_val_y:
+                X = check_array(X, accept_sparse=accept_sparse, force_all_finite=ensure_all_finite)


Adds ensure_min_samples=ensure_min_samples,

Suggested change

X = check_array(X, accept_sparse=accept_sparse, force_all_finite=ensure_all_finite)

X = check_array(

X,

accept_sparse=accept_sparse,

force_all_finite=ensure_all_finite,

ensure_min_samples=ensure_min_samples,

)

I intentionally omitted ensure_min_samples. It's already not being passed in the one place it's used on master:

LightGBM/python-package/lightgbm/sklearn.py

Line 1007 in 0643230

X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)

This call to check_array() only happens in predict(), so I also think we should avoid any more validation than absolutely necessary to comply with the scikit-learn API, since applications calling predict() might care more about latency than those calling fit().

OK. Fine with me. However, I think you should be aware that the default argument is 1, not None

I did not realize that! I'm glad you mentioned it, just checked and it looks like omitting this argument from the call still results in that validation being performed.

if ensure_min_samples > 0: n_samples = _num_samples(array) if n_samples < ensure_min_samples: raise ValueError( "Found array with %d sample(s) (shape=%s) while a" " minimum of %d is required%s." % (n_samples, array.shape, ensure_min_samples, context) )

https://github.com/scikit-learn/scikit-learn/blob/be52df50f1e9e9a6546248ccd7160a0a289f482c/sklearn/utils/validation.py#L1125-L1132

If that's the case, then my point about avoiding the overhead at predict() time doesn't matter... we're getting that overhead anyway. I guess the scikit-learn interface is probably not what you'd choose for low-latency predictions anyway... I will change this to pass through ensure_min_samples and set that to 1 in the call in sklearn.py, to make it explicit.

I just made this change in 8ef1deb. Now ensure_min_samples=1 will be passed at predict() time.

Thanks for the suggestion and talking through it with me.

python-package/lightgbm/sklearn.py

tests/python_package_test/test_sklearn.py

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS

LGTM!

That was really challenging!

@vnherdeiro Thank you for starting the work!

@adrinjalali Thanks for your help here!

@jameslamb Thanks a ton for the huge work done here!

jameslamb · 2024-10-09T23:43:58Z

Thanks everyone for the help, and especially @StrikerRUS for a thorough review of a very complex change!

vnherdeiro · 2024-10-10T07:05:59Z

Thanks for all the work @jameslamb Feeling glad this went in!

__sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_

1adb77b

vnherdeiro requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners September 11, 2024 10:42

vnherdeiro added 5 commits September 11, 2024 12:01

fixing tags dict -> dataclass

8ed87d2

fixing wrong import

32ec431

remove type hint

ade9798

remove type hint

2085a12

fix linting

a9ec348

triggering new CI (scikit-learn dev has changed)

fcc4e12

jameslamb mentioned this pull request Sep 15, 2024

[ci] [python-package] scikit-learn compatibility tests fail with scikit-learn 1.6.dev0 #6653

Closed

jameslamb requested changes Sep 15, 2024

View reviewed changes

jameslamb changed the title ~~__sklearn_tags__ replacing sklearn's BaseEstimator._more_tags_~~ [python-package] make scikit-learn tags compatible with scikit-learn>=1.16 Sep 15, 2024

jameslamb added in progress fix labels Sep 15, 2024

jameslamb mentioned this pull request Sep 15, 2024

[ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier #6654

Merged

jameslamb changed the title ~~[python-package] make scikit-learn tags compatible with scikit-learn>=1.16~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16 Sep 15, 2024

bringing back _more_tags, adding convertsion from more_tags to sklear…

3b15646

…n_tags

vnherdeiro changed the title ~~[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.16~~ [python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev Sep 15, 2024

lint fix

34d9eb4

adrinjalali reviewed Sep 15, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

jameslamb previously requested changes Sep 16, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

trivialfis approved these changes Oct 5, 2024

View reviewed changes

StrikerRUS reviewed Oct 5, 2024

View reviewed changes

jameslamb and others added 6 commits October 5, 2024 22:19

Merge branch 'master' of github.com:microsoft/LightGBM into fix_sklea…

722474d

…rn_more_tags_deprecation

Apply suggestions from code review

f2cb2fe

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

move __version__ import to compat.py, test with all ML tasks

86b5ab3

just set the setters and deleters

125f4ea

set floor of scikit-learn>=0.24.2, fix ordering of n_features_in_ set…

4233d70

…ting for older versions

fix conflicts

330df3f

jameslamb reviewed Oct 6, 2024

View reviewed changes

python-package/lightgbm/compat.py Show resolved Hide resolved

jameslamb reviewed Oct 6, 2024

View reviewed changes

jameslamb changed the title ~~[python-package] make scikit-learn estimator tags compatible with scikit-learn>=1.6.0dev~~ [python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev Oct 6, 2024

jameslamb reviewed Oct 6, 2024

View reviewed changes

python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved

Update python-package/lightgbm/sklearn.py

e8e4cdb

jameslamb requested a review from StrikerRUS October 6, 2024 06:49

jameslamb added 4 commits October 6, 2024 16:23

Merge branch 'master' into fix_sklearn_more_tags_deprecation

0b0ea24

forgot to commit ... fix comment

f22e494

Merge branch 'master' of github.com:microsoft/LightGBM into fix_sklea…

b124797

…rn_more_tags_deprecation

Merge branch 'fix_sklearn_more_tags_deprecation' of github.com:vnherd…

beab71c

…eiro/LightGBM into fix_sklearn_more_tags_deprecation

StrikerRUS reviewed Oct 8, 2024

View reviewed changes

jameslamb and others added 3 commits October 8, 2024 14:30

Apply suggestions from code review

c6e6fad

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Merge branch 'master' into fix_sklearn_more_tags_deprecation

e3eabac

predict_proba() shape is (num_data, num_classes) for multi-class

d8762e5

StrikerRUS approved these changes Oct 9, 2024

View reviewed changes

pass ensure_min_samples=1 at predict() time too

8ef1deb

jameslamb removed the awaiting review label Oct 9, 2024

jameslamb merged commit 7eae66a into microsoft:master Oct 9, 2024
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro commented Sep 11, 2024 •

edited by jameslamb

Loading

vnherdeiro commented Sep 11, 2024

adrinjalali commented Sep 12, 2024

vnherdeiro commented Sep 12, 2024 via email

jameslamb commented Sep 15, 2024 •

edited

Loading

jameslamb left a comment

vnherdeiro commented Sep 15, 2024

adrinjalali left a comment

jameslamb left a comment

jameslamb commented Oct 5, 2024

trivialfis left a comment

StrikerRUS left a comment

StrikerRUS Oct 5, 2024

jameslamb Oct 6, 2024

StrikerRUS Oct 7, 2024

StrikerRUS Oct 5, 2024

jameslamb Oct 6, 2024

jameslamb Oct 6, 2024

jameslamb commented Oct 6, 2024

StrikerRUS left a comment

StrikerRUS Oct 8, 2024

jameslamb Oct 8, 2024

StrikerRUS Oct 9, 2024

jameslamb Oct 9, 2024

jameslamb Oct 9, 2024

StrikerRUS left a comment

jameslamb commented Oct 9, 2024

vnherdeiro commented Oct 10, 2024

		@@ -144,6 +147,32 @@ def _get_weight_from_constructed_dataset(dataset: Dataset) -> Optional[np.ndarra
		return weight


		def _num_features_for_raw_input(X: _LGBM_ScikitMatrixLike) -> int:

[python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev #6651

[python-package] require scikit-learn>=0.24.2, make scikit-learn estimators compatible with scikit-learn>=1.6.0dev #6651

Conversation

vnherdeiro commented Sep 11, 2024 • edited by jameslamb Loading

Notes for Reviewers

vnherdeiro commented Sep 11, 2024

adrinjalali commented Sep 12, 2024

vnherdeiro commented Sep 12, 2024 via email

jameslamb commented Sep 15, 2024 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

vnherdeiro commented Sep 15, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 5, 2024

trivialfis left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Oct 6, 2024

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Oct 9, 2024

vnherdeiro commented Oct 10, 2024

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

[python-package] require `scikit-learn>=0.24.2`, make scikit-learn estimators compatible with `scikit-learn>=1.6.0dev` #6651

vnherdeiro commented Sep 11, 2024 •

edited by jameslamb

Loading

jameslamb commented Sep 15, 2024 •

edited

Loading