-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] Support pred_contrib in Dask predict() methods (fixes #3713) #3774
Conversation
Just to clarify: are LightGBM/python-package/lightgbm/sklearn.py Lines 871 to 880 in d2c5545
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for so actively improving Dask module!
Please check my minor comments below.
dask_classifier = dlgbm.DaskLGBMClassifier( | ||
time_out=5, | ||
local_listen_port=listen_port, | ||
tree_learner='data' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add n_estimators=10
and num_leaves=10
?
#3786.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 6428589
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I just added this again (was lost because of a bad merge conflict resolution, sorry)
else: | ||
expected_num_cols = (dX.shape[1] + 1) * num_classes | ||
|
||
if isinstance(dX, dask.dataframe.core.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but according to docs and sources, we can use it without core
part to not depend on inner implementation.
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame
https://github.com/dask/dask/blob/72304a94c98ace592f01df91e3d9e89febda307c/dask/dataframe/__init__.py#L3
if isinstance(dX, dask.dataframe.core.DataFrame): | |
if isinstance(dX, dask.dataframe.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo that's a good idea, let me try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 6428589 and it worked ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok added this back again (now as dd.DataFrame
)
dask_regressor = dlgbm.DaskLGBMRegressor( | ||
time_out=5, | ||
local_listen_port=listen_port, | ||
tree_learner='data' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add n_estimators=10
and num_leaves=10
?
#3786.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 6428589
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added back again
I spent about an hour today trying to get the tests for I'll write up feature requests and link them here and in #2302. I'm not sure if it's that "raw_score and pred_leaf are not supported", or more if I was just making mistakes in the tests. So as of this PR, those parameters will be understood by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems there were some merge conflicts because some my previous comments are not addressed but you said they were.
so weird! Yeah, maybe a bad resolution of a merge conflict, sorry. |
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
…into fix/dask-predcontrib
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
I'm afraid merging this PR will cause conflicts with #3708, so I'm not touching anything.
now that #3708 has been merged, I'll fix merge conflicts here and then merge this |
@jameslamb Can we remove this strange branch? |
yep definitely! I just deleted it, sorry |
@jameslamb Can I remove it? |
WHAT I'm so confused. Yes please remove it, I'm sorry. maybe my remotes are set up wrong on some local clone. |
No problem, thanks!
Perhaps... It is 2281 commits behind |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
The
predict()
method on Dask model objects doesn't correctly handlepred_contrib=True
today. It fails to return the full matrix of feature contributions.Thanks to @pseudotensor for pointing out this bug (#3713).
This PR fixes that.
Notes for Reviewers
I found that the results of
predict(pred_contrib=True)
are different between the Dask interface andsklearn
model objects, ifn_workers
in the Dask cluster is greater than 1. I observed that both feature contribution values and the "base value" in the pred_contrib output are often different. This is true for regression, binary classification, and multi-class classification. The differences are larger than what I think could be attributed to numeric precision issues.@guolinke , should I expect that pred_contrib outputs are different between multi-machine training and single-machine training? I'm unsure if this is because of an issue in the Dask interface or if it's something that is LightGBM-wide.
References
I found these conversations useful while working through this: