[dask] use dictionaries instead of tuples for parts in Dask training #3795

jameslamb · 2021-01-20T06:51:57Z

Summary

The Dask interface allows you to train a LightGBM model on data stored in a Dask Array or Dask DataFrame. As part of this, it zips together corresponding chunks into something called "parts". So for example, if the training data are like this:

X: a Dask DataFrame with data for features
y: a Dask Array with labels

Then that interface might zip together the chunk of X with the first 1000 rows and the chunk of y with labels for those 1000 rows.

If this was just features + labels, using tuples would be ok. But it can also include sample weights and, for learning-to-rank tasks, groups. In the future, it might include init_score as well.

The Dask interface should stitch things together into dictionaries instead, keyed with understandable keys like "data", "labels", "group", etc.

Motivation

This change would reduce the risks of mistakes in the Dask interface and would make the code easier to read and change.

References

@ffineis originally implemented this in #3708, but I asked him to pull it out into a separate PR. See #3708 (comment).

The text was updated successfully, but these errors were encountered:

ffineis · 2021-01-22T14:44:19Z

Dibs please!

jameslamb · 2021-01-22T14:47:45Z

all yours!

StrikerRUS · 2021-01-25T21:06:37Z

Closed via #3853.

jameslamb added feature request dask good first issue labels Jan 20, 2021

jameslamb mentioned this issue Jan 20, 2021

Feature Requests & Voting Hub #2302

Open

ffineis mentioned this issue Jan 25, 2021

[dask] [python] Store co-local data parts as dicts instead of lists #3853

Merged

StrikerRUS closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] use dictionaries instead of tuples for parts in Dask training #3795

[dask] use dictionaries instead of tuples for parts in Dask training #3795

jameslamb commented Jan 20, 2021

ffineis commented Jan 22, 2021

jameslamb commented Jan 22, 2021

StrikerRUS commented Jan 25, 2021

[dask] use dictionaries instead of tuples for parts in Dask training #3795

[dask] use dictionaries instead of tuples for parts in Dask training #3795

Comments

jameslamb commented Jan 20, 2021

Summary

Motivation

References

ffineis commented Jan 22, 2021

jameslamb commented Jan 22, 2021

StrikerRUS commented Jan 25, 2021