Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Auto early stopping in Sklearn API #3313

Closed
rohan-gt opened this issue Aug 18, 2020 · 12 comments · May be fixed by #5808
Closed

[Feature Request] Auto early stopping in Sklearn API #3313

rohan-gt opened this issue Aug 18, 2020 · 12 comments · May be fixed by #5808

Comments

@rohan-gt
Copy link

Is it possible to perform early stopping using cross-validation or automatically sampling data from the provided train set without explicitly specifying an eval set?

@guolinke
Copy link
Collaborator

lgb.cv supports early stopping, does it meet your request?

@rohan-gt
Copy link
Author

rohan-gt commented Aug 18, 2020

@guolinke I was actually looking for the same feature within the Sklearn API. Changed the title now

@rohan-gt rohan-gt changed the title [Feature Request] Auto early stopping [Feature Request] Auto early stopping in Sklearn API Aug 19, 2020
@kmedved
Copy link

kmedved commented Aug 21, 2020

This is how sklearn's HistGradientBoostingClassifier performs early stopping (by sampling the training data). There are significant benefits to this in terms of compatibility with the rest of the sklearn ecosystem, since most sklearn tools don't allow for passing validation data, or early stopping rounds.

Enabling this sort of functionality would allow a significant speedup in hyperparameter searching by taking advantage of both of sklearn's cross_val_score or RandomizedSearchCV, which are efficiently multiprocessed and can evaluate either multiple sets of parameters at once, or multiple folds at once. This scales better for many datasets than throwing more cores at LightGBM directly.

Ideally this would be implemented as an option of course, and not replace the existing behavior of course.

@jameslamb
Copy link
Collaborator

For your consideration, we did have a discussion about this with the scikit-learn maintainers in #2270. Using early stopping with a random subset of the data (not a validation set you create yourself) can lead to misleading results, because of information leaking from the training data to the validation data.

That being said...I personally favor adding automatic early stopping to the scikit-learn interface specifically, even if that means that we use train_test_split() like they do and set some early_stopping_rounds to pass through to LightGBM. The goal of the scikit-learn API is to allow people who are using scikit-learn to plug in LightGBM as a possible model in things like GridSearchCV. Even if we disagree with the decision that scikit-learn made about early stopping for HistGradientBoostingClassifier, now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.

But I am not a Python maintainer here, so will defer to @guolinke and others.

@kmedved
Copy link

kmedved commented Aug 21, 2020

Thanks @jameslamb - that's helpful background, and I see the concerns (especially since you can't pass a cv object into HistGradientBoostingClassifier, so are at the mercy of train_test_split).

I would find this functionality helpful despite these drawbacks, but it is obviously not essential.

@rohan-gt
Copy link
Author

rohan-gt commented Nov 9, 2020

@guolinke is it possible to add this functionality like @jameslamb mentioned?

@StrikerRUS
Copy link
Collaborator

now that that decision has been made I think that LightGBM's scikit-learn interface should adapt to it.

But please note that things might change:

Nothing really defined yet, but we're actually trying to go in the reverse direction. ... Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__. Though again, nothing definite for now.
#2966 (comment)

I expect some changes in the sklearn public API in the (near) future.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@ClaudioSalvatoreArcidiacono
Copy link
Contributor

ClaudioSalvatoreArcidiacono commented Mar 26, 2023

I have been working on this feature lately, it would be great if someone could review it :)
Here is the PR Link

@lorenzwalthert
Copy link

Timely PR, was looking for exactly this feature 😄 . @ClaudioSalvatoreArcidiacono seems like your PR passes all CI but is blocked for reviewing until you sign your commits the way the maintainers expect it. Would be great if this PR could pass the finish line before new merge conflicts arrive.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
@jameslamb
Copy link
Collaborator

Sorry, this was locked accidentally. Just unlocked it.

@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants