-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid default tokenization in Dask #10398
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes in this file are related to the recent deprecation of dask_cudf.from_dask_dataframe
.
Thank you for the fix! We recommend users to scatter the booster before running prediction, this way we can reuse the same booster already on the workers for multiple prediction calls instead of transferring it repeatedly. Does the PR change that? |
This PR should produce the same behavior as I'm not entirely sure what is causing the "cancelled" error for It may also be the case that deterministic tokenization is working "fine" for |
Thank you for the detailed explanation! |
# TODO: Implement proper tokenization to avoid unnecessary re-computation in | ||
# Dask. However, default tokenzation causes problems after | ||
# https://github.com/dask/dask/pull/10883 | ||
return uuid.uuid4() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two consecutive calls to tokenize() will return two different values, and this can cause problems. I would advise caching the output: dask/dask#11179 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crusaderky ! The purpose of this PR was to effectively roll back behavior to "match" tokenization before 10883. However, I agree that it makes sense to cache the value.
--------- Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
I spent some time investigating a
concurrent.futures._base.CancelledError
error intest_gpu_with_dask.py::TestDistributedGPU::test_categorical
and eventually bisected the cause to dask/dask#10883That PR changed how
Booster
objects are tokenized by dask, and effectively moved from using a randomuuid
-based approach to a deterministic hash. I haven't had a chance to figure out exactly why the new deterministic hash causes problems, but this PR opts out of the new approach for now.cc @trivialfis
Addresses part of #10379