Avoid default tokenization in Dask #10398

rjzamora · 2024-06-06T18:32:54Z

I spent some time investigating a concurrent.futures._base.CancelledError error in test_gpu_with_dask.py::TestDistributedGPU::test_categorical and eventually bisected the cause to dask/dask#10883

That PR changed how Booster objects are tokenized by dask, and effectively moved from using a random uuid-based approach to a deterministic hash. I haven't had a chance to figure out exactly why the new deterministic hash causes problems, but this PR opts out of the new approach for now.

cc @trivialfis

Addresses part of #10379

rjzamora · 2024-06-06T18:34:05Z

tests/test_distributed/test_gpu_with_dask/test_gpu_with_dask.py

All changes in this file are related to the recent deprecation of dask_cudf.from_dask_dataframe.

trivialfis · 2024-06-07T04:36:36Z

Thank you for the fix! We recommend users to scatter the booster before running prediction, this way we can reuse the same booster already on the workers for multiple prediction calls instead of transferring it repeatedly. Does the PR change that?

…enize

rjzamora · 2024-06-07T14:32:20Z

We recommend users to scatter the booster before running prediction, this way we can reuse the same booster already on the workers for multiple prediction calls instead of transferring it repeatedly. Does the PR change that?

This PR should produce the same behavior as dask<2024.2.1. You should still be able to scatter the model ahead of time and use the futures for multiple prediction calls. However, if you scatter the same model twice, dask won't recognize that the model hasn't changed, and so the same model will need to be transferred again (with different key names).

I'm not entirely sure what is causing the "cancelled" error for dask>=2024.2.1, but it seems likely that dask's attempt to tokenize Booster objects deterministically is a bit "off". It seems like different/retrained models can be incorrectly hashed to the same token. For example, if a model is retrained, you would want dask to use new keys to represent the corresponding futures. In dask>=2024.2.1, this does not seem to be the case (In test_empty_partition, I added a break point and found that both bst["booster"] and bst_empty["booster"] tokenize to the same value).

It may also be the case that deterministic tokenization is working "fine" for Booster objects, but xgboost happens to be using dask in a way that only works if a newly scattered model is always treated as unique.

trivialfis · 2024-06-09T03:19:07Z

Thank you for the detailed explanation!

crusaderky · 2024-06-14T13:49:54Z

python-package/xgboost/core.py

+        # TODO: Implement proper tokenization to avoid unnecessary re-computation in
+        # Dask. However, default tokenzation causes problems after
+        # https://github.com/dask/dask/pull/10883
+        return uuid.uuid4()


Two consecutive calls to tokenize() will return two different values, and this can cause problems. I would advise caching the output: dask/dask#11179 (comment)

Thanks @crusaderky ! The purpose of this PR was to effectively roll back behavior to "match" tokenization before 10883. However, I agree that it makes sense to cache the value.

--------- Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

rjzamora added 2 commits June 6, 2024 11:20

remove deprecated code and avoid default dask tokenization

74e4c83

remove comment

6bd0645

rjzamora commented Jun 6, 2024

View reviewed changes

trivialfis and others added 2 commits June 7, 2024 14:59

Remove dask pin.

92a440b

Merge remote-tracking branch 'upstream/master' into avoid-default-tok…

e96a57f

…enize

rjzamora changed the title ~~[WIP] Avoid default tokenization in Dask~~ Avoid default tokenization in Dask Jun 7, 2024

rjzamora marked this pull request as ready for review June 7, 2024 14:56

trivialfis added the Blocking label Jun 9, 2024

hcho3 mentioned this pull request Jun 10, 2024

[CI] Update RAPIDS to latest stable #10406

Closed

trivialfis added 2 commits June 13, 2024 02:37

Merge branch 'master' into avoid-default-tokenize

6a1babf

Typing.

1d8aa74

trivialfis mentioned this pull request Jun 14, 2024

Error with the default tokenizer. dask/dask#11179

Closed

trivialfis merged commit dc14f98 into dmlc:master Jun 14, 2024
28 of 29 checks passed

crusaderky reviewed Jun 14, 2024

View reviewed changes

trivialfis added a commit to trivialfis/xgboost that referenced this pull request Jun 15, 2024

Avoid default tokenization in Dask (dmlc#10398)

d4a963e

--------- Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

rjzamora deleted the avoid-default-tokenize branch August 13, 2024 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid default tokenization in Dask #10398

Avoid default tokenization in Dask #10398

rjzamora commented Jun 6, 2024 •

edited

Loading

rjzamora Jun 6, 2024

trivialfis commented Jun 7, 2024 •

edited

Loading

rjzamora commented Jun 7, 2024

trivialfis commented Jun 9, 2024

crusaderky Jun 14, 2024

rjzamora Jun 14, 2024

Avoid default tokenization in Dask #10398

Avoid default tokenization in Dask #10398

Conversation

rjzamora commented Jun 6, 2024 • edited Loading

rjzamora Jun 6, 2024

Choose a reason for hiding this comment

trivialfis commented Jun 7, 2024 • edited Loading

rjzamora commented Jun 7, 2024

trivialfis commented Jun 9, 2024

crusaderky Jun 14, 2024

Choose a reason for hiding this comment

rjzamora Jun 14, 2024

Choose a reason for hiding this comment

rjzamora commented Jun 6, 2024 •

edited

Loading

trivialfis commented Jun 7, 2024 •

edited

Loading