Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid default tokenization in Dask #10398

Merged
merged 6 commits into from
Jun 14, 2024

Conversation

rjzamora
Copy link
Contributor

@rjzamora rjzamora commented Jun 6, 2024

I spent some time investigating a concurrent.futures._base.CancelledError error in test_gpu_with_dask.py::TestDistributedGPU::test_categorical and eventually bisected the cause to dask/dask#10883

That PR changed how Booster objects are tokenized by dask, and effectively moved from using a random uuid-based approach to a deterministic hash. I haven't had a chance to figure out exactly why the new deterministic hash causes problems, but this PR opts out of the new approach for now.

cc @trivialfis

Addresses part of #10379

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in this file are related to the recent deprecation of dask_cudf.from_dask_dataframe.

@trivialfis
Copy link
Member

trivialfis commented Jun 7, 2024

Thank you for the fix! We recommend users to scatter the booster before running prediction, this way we can reuse the same booster already on the workers for multiple prediction calls instead of transferring it repeatedly. Does the PR change that?

@rjzamora
Copy link
Contributor Author

rjzamora commented Jun 7, 2024

We recommend users to scatter the booster before running prediction, this way we can reuse the same booster already on the workers for multiple prediction calls instead of transferring it repeatedly. Does the PR change that?

This PR should produce the same behavior as dask<2024.2.1. You should still be able to scatter the model ahead of time and use the futures for multiple prediction calls. However, if you scatter the same model twice, dask won't recognize that the model hasn't changed, and so the same model will need to be transferred again (with different key names).

I'm not entirely sure what is causing the "cancelled" error for dask>=2024.2.1, but it seems likely that dask's attempt to tokenize Booster objects deterministically is a bit "off". It seems like different/retrained models can be incorrectly hashed to the same token. For example, if a model is retrained, you would want dask to use new keys to represent the corresponding futures. In dask>=2024.2.1, this does not seem to be the case (In test_empty_partition, I added a break point and found that both bst["booster"] and bst_empty["booster"] tokenize to the same value).

It may also be the case that deterministic tokenization is working "fine" for Booster objects, but xgboost happens to be using dask in a way that only works if a newly scattered model is always treated as unique.

@rjzamora rjzamora changed the title [WIP] Avoid default tokenization in Dask Avoid default tokenization in Dask Jun 7, 2024
@rjzamora rjzamora marked this pull request as ready for review June 7, 2024 14:56
@trivialfis
Copy link
Member

Thank you for the detailed explanation!

@trivialfis trivialfis merged commit dc14f98 into dmlc:master Jun 14, 2024
28 of 29 checks passed
# TODO: Implement proper tokenization to avoid unnecessary re-computation in
# Dask. However, default tokenzation causes problems after
# https://github.com/dask/dask/pull/10883
return uuid.uuid4()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two consecutive calls to tokenize() will return two different values, and this can cause problems. I would advise caching the output: dask/dask#11179 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky ! The purpose of this PR was to effectively roll back behavior to "match" tokenization before 10883. However, I agree that it makes sense to cache the value.

trivialfis added a commit to trivialfis/xgboost that referenced this pull request Jun 15, 2024
---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
@rjzamora rjzamora deleted the avoid-default-tokenize branch August 13, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants