-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] [dask] Add DaskLGBMRanker #3708
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry it's taking me so long to review and test this! I'm planning to review it tomorrow.
Hey yeah no problem. Unfortunately, I can't get the test_dask.py tests passing 100% reliably. The errors are random, sometimes passing, they're hard to reproduce. Failures seem limited to GPU builds and the The strange part is that when I execute the container (i.e. instead of through a bash shell), the tests have passed every time. It's only upon running Dockerfile:
Hey @StrikerRUS sorry to drag you in here! Wondering if you might had any thoughts on networking port/socket issues? |
I'm sorry, have no ideas... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking awesome, thank you! Sorry it took so long to get you a first review. Now that I have a testing framework set up for reviewing LightGBM Dask stuff (https://github.com/jameslamb/lightgbm-dask-testing), I'll be able to give you feedback a lot more quickly.
I tested this PR on that setup and everything looks good! I like the way you set up the unit tests, thank you so much for all the effort you put into that!
I left some recommended changes. I also have one more quesiton below.
although it's possible to use Dask arrays if the trainset is built in group-wise chunks. The constraints of the ranking task force us to not break up contiguous groups, which is why chunks/partitions need to be constructed from groups somehow.
Does this mean "one chunk / partition per group" or do you mean "all documents for a group need to be in the same chunk / partition"? Like, if I have 10 groups could I split them into 2 partitions?
"""Helper function taken from dask_ml.metrics: computes coefficient of determination.""" | ||
numerator = ((dy_true - dy_pred) ** 2).sum(axis=0) | ||
denominator = ((dy_true - dy_pred.mean(axis=0)) ** 2).sum(axis=0) | ||
return (1 - numerator / denominator).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was it necessary to replace the dask-ml one with this function?
If that's just to cut out our testing dependency on dask-ml
(which would be much appreciated!), could you make that a separate PR with this change plus removing dask-ml
from here:
Line 71 in f2695da
conda install -q -y -n $CONDA_ENV dask dask-ml distributed joblib matplotlib numpy pandas psutil pytest scikit-learn scipy
I think that PR would be small and non-controversial, so we could review and merge it quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, yeah I'll follow up 3708 with a PR to remove of dask-ml as a test dependency. See reply to comment re: accuracy_score (above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in d6e5209
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @ffineis is it cool if I pick up the "remove testing dependency on dask-ml
" thing?
If you want to make the contribution I'll totally respect that, but if it's just extra work you'd rather not do, I'm happy to make a PR right now for it. Wanna be respectful of your time.
Yep, the latter - you can place whole separate groups into different partitions as long as all documents in a group are in the same partition. The partitioning needs to work like a Dask.DataFrame is amenable to preserving group IDs by using set_index. Dask.array does not seem to have a similar "group partitions according to this column" functionality, hence the use of |
Hmmm ok so latest test issues (only on Azure pipeline):
|
ok I can also try to look into those errors |
Hey so updates on this. I've been able to find is this - dask/dask-ml#611 - looks like the same exception triggered by a timeout during I'm thinking that we just install a wrapper that uses
FWIW I ran this with num_tests = 100 and couldn't reproduce the teardown exception. |
Thanks for the investigation @ffineis . Since this is only an error in fixture teardown, I'm good with your proposal. I want to go as fast as possible on Dask changes between now and LightGBM 3.2.0, so if this helps get this PR through I think it's ok and we can try to fix the upstream issue later as a maintenance task. I'm convinced by what you wrote and my own tests that this wouldn't cover up a legitimate problem with Can you just add a comment or docstring (whichever you want) on that fixture, linking to the dask-ml issue? |
ayyy those changes seem to have worked! Nice idea closing down the cluster. Seeing that that works, maybe in the future I'll just cut every test over to
But now that this seems to be working ok, don't touch it in this PR 😆 I'll give this another review right now. Thanks for working through it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thank you! I really really like the functions you've added for creating a ranking dataset in tests.
I added a few minor comments. After you address those, I'd like to request another review from @StrikerRUS for general Python things.
For the other TODOs that got pulled out of this PR, I've created the following issues:
- use dicts for
parts
: [dask] use dictionaries instead of tuples for parts in Dask training #3795 - factor out testing dependency on
dask-ml
: [dask] remove testing dependency on dask-ml #3796
Haha thx, man I was ready to just stop using the |
Sure! I'd be happy to do this. |
Hey sorry never got back to you on this - yeah totally agree that special dask-GPU support is the way to go! Just a heads up - I got the teardown error on a DaskLGBMRegressor task during a GPU test (logs here) so I added the pytest.skip for all GPU tasks in 56bb2c9. |
Hey can we merge? Just want to unblock the several PR's I feel are waiting in the wings, haha. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work! From my perspective, this is ready to be merged.
However, I'd still like @StrikerRUS to do one more review for general Python things or similarity of patterns between here and the non-Dask sklearn
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ffineis Thank you very much for awesome contribution! Below are just my minor comments, please check.
python-package/lightgbm/dask.py
Outdated
class DaskLGBMRanker(_LGBMModel, LGBMRanker): | ||
"""Docstring is inherited from the lightgbm.LGBMRanker.""" | ||
|
||
def fit(self, X, y=None, sample_weight=None, group=None, client=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb What about init_score
? Is it supported or we should add feature request for it?
LightGBM/python-package/lightgbm/sklearn.py
Lines 396 to 397 in 1c60cfa
init_score : array-like of shape = [n_samples] or None, optional (default=None) | |
Init score of training data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should have a feature request. I'll write it up and add a link here.
@ffineis, could you add init_score=None
here between sample_weight
and group
, so the order matches the sklearn interface for LGBMRanker
? (
LightGBM/python-package/lightgbm/sklearn.py
Line 962 in 1c60cfa
sample_weight=None, init_score=None, group=None, |
fit()
, they won't accidentally have their init_score
interpreted as group
.
And can you just then add a check like this?
if init_score is not None:
raise RuntimeError("init_score is not currently supported in lightgbm.dask")
@StrikerRUS speaking of positional arguments, I'll open another issue where we can discuss how to handle the client
argument. But let's leave that out of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb
Yes, sure! Agree with all your intents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, no prob
@@ -96,6 +234,8 @@ def test_classifier(output, centers, client, listen_port): | |||
assert_eq(y, p2) | |||
assert_eq(p1_proba, p2_proba, atol=0.3) | |||
|
|||
client.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb Are you agree with this unrelated to Ranker itself change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm ok with this. I'm not 100% sure that it is unrelated, even though it's touching non-ranker tests. Since this PR is close to being merged and since there aren't other PRs waiting on this fix (the timeout / shutdown issues in Dask tests have not been as severe elsewhere), I'm ok leaving it.
rnkvec_local = dask_ranker.to_local().predict(X) | ||
|
||
# distributed and to-local scores should be the same. | ||
assert_eq(rnkvec_dask, rnkvec_local) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb I like this test very much! Can we add the same for other estimators? Seems that Classifier tests do not asserts real Dask predict()
equals to .to_local().predict()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the classifier tests do check that
- comparison between Dask
predict()
and non-Daskpredict()
:assert_eq(p1, p2) - comparison between Dask
predict()
and.to_local().predict()
:assert_eq(p1, p2)
I'd be open to a PR that makes it more explicit and cuts out the total number of tests by combining test_classifier_local_predict
(
def test_classifier_local_predict(client, listen_port): |
test_classifier
(def test_classifier(output, centers, client, listen_port): |
Not an action we need on this PR, I'll write it up as a "good first issue".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparison between Dask
predict()
andto_local().predict()
:
Actually this checks to_local().predict(dX)
and local predict()
. It is not the same as to_local().predict(dX)
and Dask predict()
, given that asserts you pointing are placed in two different (in theory unrelated) tests, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's confusing because it's two tests, so I think we should merge the tests in the future like I said above (not anything you need to do on here, @ffineis).
I'm saying that taken together, they show transitively that dask_model.predict() == dask_model.to_local().predict()
, even though neither one does that direct comparison.
- (line 92)
dask_model.predict() == sklearn_model.predict()
- (line 138)
dask_model.to_local().predict() == sklearn_model.predict()
But that's only partially true, since test_local_predict
only does binary classification, and only on array data. So combining them would give more coverage and would show whether that result holds up for multi-class classification or for other input data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's confusing because it's two tests, so I think we should merge the tests in the future like I said above
Totally agree!
I'm saying that taken together, they show transitively that
dask_model.predict() == dask_model.to_local().predict()
Also agree. But two different tests can take different inputs (params, data), can be modified in future separately, etc. So it can be not fully fair comparison. Merging them into one test will solve all issues. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost one my suggestion during previous review about issubclass
and docstring about group
param still doesn't match the same docstring in other places.
Everything else looks good to me. Thanks!
I'm approving this PR to not delay it anymore and let @jameslamb merge it finally.
You can commit my current suggestions right from the web interface. Just click Commit suggestion
button if you agree with suggestion. No need to waste time re-typing them locally and then pushing with git
.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
All the tests are passing! 🎉 I just took one more look at the diff, and all looks good. So I'm going to merge this. Thanks so much for this fantastic contribution @ffineis , and to @StrikerRUS for the help with thorough reviews. |
Thanks @jameslamb and @StrikerRUS for bearing with me!! Glad we could make it work. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Hello,
This is a PR to include support for a DaskLGBMRanker. Previously only
DaskLGBMClassifier
andDaskLGBMRegressor
were supported. This PR was originally for the dask-lightgbm repo, but is migrated here after the incorporation of the recent. lightgbm.dask module. Tests added and were passing from an image built from a modification of dockerfile-python.Note: it's easier to use dask dataframes with the DaskLGBMRanker than it is arrays through the use of
df.set_index
, although it's possible to use Dask arrays if the trainset is built in group-wise chunks. The constraints of the ranking task force us to not break up contiguousgroups
, which is why chunks/partitions need to be constructed fromgroups
somehow. No tests for sparse arrays and DaskLGBMRanker at the moment.Thanks! Excited for this new module.