Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Support pred_contrib in Dask predict() methods (fixes #3713) #3774

Merged
merged 16 commits into from
Jan 22, 2021

Conversation

jameslamb
Copy link
Collaborator

The predict() method on Dask model objects doesn't correctly handle pred_contrib=True today. It fails to return the full matrix of feature contributions.

Thanks to @pseudotensor for pointing out this bug (#3713).

This PR fixes that.

Notes for Reviewers

I found that the results of predict(pred_contrib=True) are different between the Dask interface and sklearn model objects, if n_workers in the Dask cluster is greater than 1. I observed that both feature contribution values and the "base value" in the pred_contrib output are often different. This is true for regression, binary classification, and multi-class classification. The differences are larger than what I think could be attributed to numeric precision issues.

@guolinke , should I expect that pred_contrib outputs are different between multi-machine training and single-machine training? I'm unsure if this is because of an issue in the Dask interface or if it's something that is LightGBM-wide.

References

I found these conversations useful while working through this:

@jameslamb jameslamb requested a review from StrikerRUS January 18, 2021 18:08
@StrikerRUS
Copy link
Collaborator

@jameslamb

The predict() method on Dask model objects doesn't correctly handle pred_contrib=True today.

Just to clarify: are raw_score=True and pred_leaf=True fully supported now? If yes, I think we need tests, if no then feature requests.

def predict(self, X, raw_score=False, start_iteration=0, num_iteration=None,
pred_leaf=False, pred_contrib=False, **kwargs):
"""Docstring is inherited from the LGBMModel."""
result = self.predict_proba(X, raw_score, start_iteration, num_iteration,
pred_leaf, pred_contrib, **kwargs)
if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
return result
else:
class_index = np.argmax(result, axis=1)
return self._le.inverse_transform(class_index)

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for so actively improving Dask module!
Please check my minor comments below.

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
Comment on lines 106 to 110
dask_classifier = dlgbm.DaskLGBMClassifier(
time_out=5,
local_listen_port=listen_port,
tree_learner='data'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add n_estimators=10 and num_leaves=10?
#3786.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 6428589

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I just added this again (was lost because of a bad merge conflict resolution, sorry)

else:
expected_num_cols = (dX.shape[1] + 1) * num_classes

if isinstance(dX, dask.dataframe.core.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but according to docs and sources, we can use it without core part to not depend on inner implementation.
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame
https://github.com/dask/dask/blob/72304a94c98ace592f01df91e3d9e89febda307c/dask/dataframe/__init__.py#L3

Suggested change
if isinstance(dX, dask.dataframe.core.DataFrame):
if isinstance(dX, dask.dataframe.DataFrame):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo that's a good idea, let me try that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 6428589 and it worked ok

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok added this back again (now as dd.DataFrame)

Comment on lines 229 to 233
dask_regressor = dlgbm.DaskLGBMRegressor(
time_out=5,
local_listen_port=listen_port,
tree_learner='data'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add n_estimators=10 and num_leaves=10?
#3786.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 6428589

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added back again

@jameslamb
Copy link
Collaborator Author

jameslamb commented Jan 19, 2021

Just to clarify: are raw_score=True and pred_leaf=True fully supported now? If yes, I think we need tests, if no then feature requests.

I spent about an hour today trying to get the tests for raw_score and pred_leaf working. Both are only somewhat supported (for some combinations of task and input type). I faced several errors related to what I think is one root cause: when using map_blocks() or map_partitions(), you have to provide some metadata to help Dask understand the shape of the function call result. This is described in the meta param of https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions, for example.

I'll write up feature requests and link them here and in #2302. I'm not sure if it's that "raw_score and pred_leaf are not supported", or more if I was just making mistakes in the tests.

So as of this PR, those parameters will be understood by predict(), but whether or not it succeeds won't be guaranteed.

UPDATE: added #3792 and #3793

@StrikerRUS
Copy link
Collaborator

UPDATE: added #3792 and #3793

Thank you so much!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there were some merge conflicts because some my previous comments are not addressed but you said they were.

tests/python_package_test/test_dask.py Show resolved Hide resolved
python-package/lightgbm/dask.py Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
tests/python_package_test/test_dask.py Show resolved Hide resolved
tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

Seems there were some merge conflicts because some my previous comments are not addressed but you said they were.

so weird! Yeah, maybe a bad resolution of a merge conflict, sorry.

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

I'm afraid merging this PR will cause conflicts with #3708, so I'm not touching anything.

@jameslamb
Copy link
Collaborator Author

now that #3708 has been merged, I'll fix merge conflicts here and then merge this

@jameslamb jameslamb merged commit d9a96c9 into microsoft:master Jan 22, 2021
@jameslamb jameslamb deleted the fix/dask-predcontrib branch January 22, 2021 06:33
@StrikerRUS
Copy link
Collaborator

@jameslamb Can we remove this strange branch?

image

@jameslamb
Copy link
Collaborator Author

yep definitely! I just deleted it, sorry

@StrikerRUS
Copy link
Collaborator

@jameslamb
Ah, it is here again! 😄

Can I remove it?

image

@jameslamb
Copy link
Collaborator Author

WHAT I'm so confused. Yes please remove it, I'm sorry. maybe my remotes are set up wrong on some local clone.

@StrikerRUS
Copy link
Collaborator

No problem, thanks!

maybe my remotes are set up wrong on some local clone.

Perhaps... It is 2281 commits behind master, should be VERY old setup. 🙂

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants