Fix dask prediction. #4941

trivialfis · 2019-10-13T10:18:28Z

No description provided.

RAMitchell · 2019-10-14T01:49:38Z

I am coming up against this assertion on some training runs also assert X_parts.shape[1] == 1

trivialfis · 2019-10-14T07:49:06Z

Em... I can't reproduce what you said. Will keep looking.

trivialfis · 2019-10-14T08:18:57Z

Let me create some benchmark scripts for prediction.

trivialfis · 2019-10-14T11:12:34Z

@RAMitchell

That's the reason I added so many insertions .. Here are 3 insertions you might fail:

inconsistent insertion in this PR: This might due to imbalanced data, you can call data.rechunk or data.repartition if it's a dataframe for both X and y to make it work. But this throws performance out of window. Luckily this don't usually happen in real dataset since you load the label and data altogether with one csv. I added the suggestion into error msg.
cols in (0, c). The number of columns for each partition/block must be the same. Your input is not correct, might due to sparsity in your dataset. Added an error message for that.
X_parts.shape[1] == 1. No idea. Could you give a reproducible script so I can check?

RAMitchell · 2019-10-14T22:27:04Z

Here was the problem, I had the following

partition_size = 1000
X = da.from_array(data.X_train, partition_size)

The reason this fails is the dataset had 2000 features - more than the partition size. When you specify a single value for partition size it actually sets all dimensions to this value not just the first.

This fixes it

partition_size = 1000
X = da.from_array(data.X_train, (partition_size, data.X_train.shape[1]))

I think you can reproduce this bug in your demo (https://github.com/dmlc/xgboost/blob/master/demo/dask/gpu_training.py) by setting n = 1001.

I found this unintuitive, how can we make this better? We could coerce the data into the correct dimensions and log a warning instead, or just fail with a better message. The demos will need updating also.

Maybe something like:
"Warning: Data should be partitioned by row, re-partitioning. To avoid this specify the number of columns for your dask DataFrame/Array explicitly. e.g. chunks=(partition_size, X.shape[1])"

codecov-io · 2019-10-15T02:40:26Z

Codecov Report

❗ No coverage uploaded for pull request base (master@05d4751). Click here to learn what that means.
The diff coverage is 0%.

@@            Coverage Diff            @@
##             master    #4941   +/-   ##
=========================================
  Coverage          ?   71.05%           
=========================================
  Files             ?       11           
  Lines             ?     2301           
  Branches          ?        0           
=========================================
  Hits              ?     1635           
  Misses            ?      666           
  Partials          ?        0

Impacted Files	Coverage Δ
python-package/xgboost/dask.py	`18.5% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05d4751...f4cc1df. Read the comment docs.

Fix dask prediction.

66566d6

trivialfis requested a review from RAMitchell October 13, 2019 10:18

trivialfis added 2 commits October 13, 2019 06:19

Remove print.

764a084

Lint.

a1a702d

RAMitchell approved these changes Oct 13, 2019

View reviewed changes

trivialfis added 3 commits October 14, 2019 10:45

Add hint about shape.

e06c934

Add better error message.

756b78b

Typo.

8976756

trivialfis added 2 commits October 14, 2019 21:35

Check partition column shape.

bedc18d

Fix the error msg.

f4cc1df

trivialfis merged commit 2ebdec8 into dmlc:master Oct 15, 2019

trivialfis deleted the fix-dask-predict branch October 15, 2019 03:19

lock bot locked as resolved and limited conversation to collaborators Jan 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dask prediction. #4941

Fix dask prediction. #4941

trivialfis commented Oct 13, 2019

RAMitchell commented Oct 14, 2019

trivialfis commented Oct 14, 2019

trivialfis commented Oct 14, 2019

trivialfis commented Oct 14, 2019 •

edited

Loading

RAMitchell commented Oct 14, 2019

codecov-io commented Oct 15, 2019

Fix dask prediction. #4941

Fix dask prediction. #4941

Conversation

trivialfis commented Oct 13, 2019

RAMitchell commented Oct 14, 2019

trivialfis commented Oct 14, 2019

trivialfis commented Oct 14, 2019

trivialfis commented Oct 14, 2019 • edited Loading

RAMitchell commented Oct 14, 2019

codecov-io commented Oct 15, 2019

Codecov Report

trivialfis commented Oct 14, 2019 •

edited

Loading