Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] add support for eval sets and custom eval functions #4101

Merged
merged 110 commits into from
Jun 28, 2021

Conversation

ffineis
Copy link
Contributor

@ffineis ffineis commented Mar 23, 2021

Followup PR regarding #3952 - implements eval_set functionality for lightgbm.dask but without early stopping.

This is implemented this to work with all eval-* parameters:

  • multiple eval sets (i.e. multiple (X, y) pairs in eval_set)
  • eval_names
  • eval_sample_weight
  • eval_class_weight (for DaskLGBMClassifier only)
  • eval_init_score
  • eval_group (for DaskLGBMRanker only)

When an individual eval_set, eval_sample_weight, eval_init_score, or eval_group is the same as (data, label)/sample_weight/init_score/group, just use the latter instead of having to compute the training set/weights/init_score/group multiple times.

This is all that's going on, making little mini eval sets out of delayed parts in a consistent manner:
distributed_eval_sets

Other things to know:

  • Raises warning when a worker does not receive any eval_set parts. This check is now performed prior to client.submit found in _train function. Model training still completes in this scenario, but depending on which worker returns its futures_classifier, best_score_ and evals_result_ attributes can be empty or contain data. Moreover, when a worker is missing eval_set entirely, this will fail out once early_stopping_rounds becomes supported - local worker calls to model.fit(..., eval_data=None, early_stopping_rounds=x) will throw a exception.
  • See comment as to why early_stopping_rounds is explicitly omitted from this PR.
  • This adds 32 tests! I'm happy to trim it down a little bit to 16.

ffineis and others added 30 commits February 8, 2021 23:15
_train_part model.fit args to lines

Co-authored-by: James Lamb <jaylamb20@gmail.com>
_train_part model.fit args to lines, pt2

Co-authored-by: James Lamb <jaylamb20@gmail.com>
_train_part model.fit args to lines pt3

Co-authored-by: James Lamb <jaylamb20@gmail.com>
dask_model.fit args to lines

Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
use is instead of id()

Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
@ffineis
Copy link
Contributor Author

ffineis commented Jun 20, 2021

I believe we should add the following note about custom eval function back to fit() method signature now.

_lgbmmodel_doc_custom_eval_note = """

Ah, you just mean copying the contents of this note, right?

Screen Shot 2021-06-19 at 11 15 23 PM

Happy to duplicate. But could we just copy a link or say "see note for custom eval_metric functions in Sklearn API docs"?

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 20, 2021

@ffineis

Ah, you just mean copying the contents of this note, right?

For this PR I'm totally fine with just one line of a concatenation in Dask code like this one

) + "\n\n" + _lgbmmodel_doc_custom_eval_note

Ideally, I think we can templatize it like other docstrings with shape types later.

Or is it OK to use wording array-like for Dask Arrays?
cc @jameslamb

@ffineis
Copy link
Contributor Author

ffineis commented Jun 21, 2021

@ffineis

Ah, you just mean copying the contents of this note, right?

For this PR I'm totally fine with just one line of a concatenation in Dask code like this one

) + "\n\n" + _lgbmmodel_doc_custom_eval_note

Ideally, I think we can templatize it like other docstrings with shape types later.

Or is it OK to use wording array-like for Dask Arrays?
cc @jameslamb

AH ok thanks, this makes sense. Addressed in 5d4ddc8 unless James thinks the custom eval note should be reformatted like _lgbmmodel_doc_fit in this PR (I think this makes sense for another follow-up, as it will mostly be adding changes to sklearn.py).

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in progress of migrating to f-strings: #4136.

Also, during rendering and checking my current suggestions I noticed that there are no Returns sections for fit() methods of Dask estimators. Is it OK?

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

Or is it OK to use wording array-like for Dask Arrays?

I think that's ok. If it causes confusion in the future we can make it more Dask-specific.

Also, during rendering and checking my current suggestions I noticed that there are no Returns sections for fit() methods of Dask estimators. Is it OK?

I guess they should have a return block similar to those in the equivalent scikit-learn estimators, but there are no return sections for e.g. DaskLGBMClassifier.fit() on master (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.DaskLGBMClassifier.html#lightgbm.DaskLGBMClassifier.fit), so it's not something this PR needs to fix.

ffineis and others added 3 commits June 22, 2021 22:33
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS
Copy link
Collaborator

@jameslamb

I guess they should have a return block similar to those in the equivalent scikit-learn estimators ...

Created #4402.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffineis Thank you so much for all your hard work! Very important enhancement.
LGTM!

@ffineis
Copy link
Contributor Author

ffineis commented Jun 23, 2021

@ffineis Thank you so much for all your hard work! Very important enhancement.
LGTM!

Thanks @StrikerRUS !! Appreciate the thorough vetting.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another review tonight, looks good to me! I noticed one thing but it's very small, so I'm going to approve / merge this and open a follow-up PR for it.

Thank you SO MUCH for your help with this very impactful contribution to the Dask interface.

@jameslamb jameslamb changed the title [Dask] eval_sets [dask] add support for eval sets and custom eval functions Jun 28, 2021
@jameslamb jameslamb merged commit b5502d1 into microsoft:master Jun 28, 2021
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants