Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dask prediction. #4941

Merged
merged 8 commits into from
Oct 15, 2019
Merged

Fix dask prediction. #4941

merged 8 commits into from
Oct 15, 2019

Conversation

trivialfis
Copy link
Member

No description provided.

@trivialfis trivialfis requested a review from RAMitchell October 13, 2019 10:18
@RAMitchell
Copy link
Member

I am coming up against this assertion on some training runs also assert X_parts.shape[1] == 1

@trivialfis
Copy link
Member Author

Em... I can't reproduce what you said. Will keep looking.

@trivialfis
Copy link
Member Author

Let me create some benchmark scripts for prediction.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 14, 2019

@RAMitchell

That's the reason I added so many insertions .. Here are 3 insertions you might fail:

  • inconsistent insertion in this PR: This might due to imbalanced data, you can call data.rechunk or data.repartition if it's a dataframe for both X and y to make it work. But this throws performance out of window. Luckily this don't usually happen in real dataset since you load the label and data altogether with one csv. I added the suggestion into error msg.
  • cols in (0, c). The number of columns for each partition/block must be the same. Your input is not correct, might due to sparsity in your dataset. Added an error message for that.
  • X_parts.shape[1] == 1. No idea. Could you give a reproducible script so I can check?

@RAMitchell
Copy link
Member

Here was the problem, I had the following

partition_size = 1000
X = da.from_array(data.X_train, partition_size)

The reason this fails is the dataset had 2000 features - more than the partition size. When you specify a single value for partition size it actually sets all dimensions to this value not just the first.

This fixes it

partition_size = 1000
X = da.from_array(data.X_train, (partition_size, data.X_train.shape[1]))

I think you can reproduce this bug in your demo (https://github.com/dmlc/xgboost/blob/master/demo/dask/gpu_training.py) by setting n = 1001.

I found this unintuitive, how can we make this better? We could coerce the data into the correct dimensions and log a warning instead, or just fail with a better message. The demos will need updating also.

Maybe something like:
"Warning: Data should be partitioned by row, re-partitioning. To avoid this specify the number of columns for your dask DataFrame/Array explicitly. e.g. chunks=(partition_size, X.shape[1])"

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@05d4751). Click here to learn what that means.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4941   +/-   ##
=========================================
  Coverage          ?   71.05%           
=========================================
  Files             ?       11           
  Lines             ?     2301           
  Branches          ?        0           
=========================================
  Hits              ?     1635           
  Misses            ?      666           
  Partials          ?        0
Impacted Files Coverage Δ
python-package/xgboost/dask.py 18.5% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05d4751...f4cc1df. Read the comment docs.

@trivialfis trivialfis merged commit 2ebdec8 into dmlc:master Oct 15, 2019
@trivialfis trivialfis deleted the fix-dask-predict branch October 15, 2019 03:19
@lock lock bot locked as resolved and limited conversation to collaborators Jan 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants