Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask arrays without distributed? #6652

Closed
Hoeze opened this issue Jan 28, 2021 · 5 comments
Closed

Dask arrays without distributed? #6652

Hoeze opened this issue Jan 28, 2021 · 5 comments

Comments

@Hoeze
Copy link

Hoeze commented Jan 28, 2021

Hi, I'd like to run the XGBRegressor with dask arrays (from xarray zarr dataset) on a Ray cluster.
However, when running:

model = xgboost.XGBRegressor()
model.fit(x_train.data, y_train.data)

I just get this error message:
TypeError: Not supported type for data.<class 'dask.array.core.Array'>

Is there a simple way to achieve this?

  • I am using dask.config.set(scheduler=ray_dask_get) as scheduler for dask.
  • I want to avoid setting up a dask.distributed cluster
  • I want to avoid loading the whole dataset into memory
@hcho3
Copy link
Collaborator

hcho3 commented Jan 28, 2021

No, currently we do not support training with a Dask array unless you set up a distributed cluster. See the example at https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

@trivialfis
Copy link
Member

You will have to load the dataset into memory if using xgboost.

@Hoeze
Copy link
Author

Hoeze commented Jan 28, 2021

Thanks for your answers and making this a feature request :)

@trivialfis
Copy link
Member

We can probably integrate it with the internal IterativeDMatrix .

@trivialfis
Copy link
Member

I want to avoid loading the whole dataset into memory

I looked into this issue again. For using sklearn interface, loading the whole dataset is required.

The only way to avoid loading the whole dataset is using external memory in XGBoost. For a quick start see https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py . I wrote a simple wrapper for dask array during the development of this new interface. The idea is to get the chunks of data by calling to_delayed, then feed a chunk of data into XGBoost during each iteration in the data loading process.

But the external memory is still experimental until #7214 can be merged. Closing this issue as the interface is here now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants