Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] use dictionaries instead of tuples for parts in Dask training #3795

Closed
jameslamb opened this issue Jan 20, 2021 · 3 comments
Closed

Comments

@jameslamb
Copy link
Collaborator

Summary

The Dask interface allows you to train a LightGBM model on data stored in a Dask Array or Dask DataFrame. As part of this, it zips together corresponding chunks into something called "parts". So for example, if the training data are like this:

  • X: a Dask DataFrame with data for features
  • y: a Dask Array with labels

Then that interface might zip together the chunk of X with the first 1000 rows and the chunk of y with labels for those 1000 rows.

If this was just features + labels, using tuples would be ok. But it can also include sample weights and, for learning-to-rank tasks, groups. In the future, it might include init_score as well.

The Dask interface should stitch things together into dictionaries instead, keyed with understandable keys like "data", "labels", "group", etc.

Motivation

This change would reduce the risks of mistakes in the Dask interface and would make the code easier to read and change.

References

@ffineis originally implemented this in #3708, but I asked him to pull it out into a separate PR. See #3708 (comment).

@ffineis
Copy link
Contributor

ffineis commented Jan 22, 2021

Dibs please!

@jameslamb
Copy link
Collaborator Author

all yours!

@StrikerRUS
Copy link
Collaborator

Closed via #3853.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants