`AttributeError` with fitting model on Dask Array backed by `scipy.sparse.csr_matrix` #7454

jrbourbeau · 2021-11-19T22:17:49Z

I came across a use case where attempting to fit a DaskXGBClassifier on a Dask Array whose partitions are scipy.sparse.csr_matrixs (as is returned by Dask-ML's HashingVectorizer) results in a AttributeError: divisions not found error (full traceback included below).

From doing some initial debugging it appears the underlying issue is that during the fitting process we end up passing a list of sparse matrices to Dask's dd.multi.concat here

xgboost/python-package/xgboost/dask.py

Line 207 in d33854a

return dd.multi.concat(list(value), axis=0)

However, dd.multi.concat expects a list of Dask DataFrames, which is where the AttributeError: divisions not found is coming from (Dask DataFrames have a .divisions attribute which dd.multi.concat assumes exists).

Here's an example code snippet which should reproduce the issue when using the latest xgboost (1.5.0) and dask (2021.11.2) / distributed (2021.11.2) releases:

import dask.dataframe as dd
import dask_ml.feature_extraction.text
import pandas as pd
import sklearn.datasets
from dask.distributed import Client
from xgboost.dask import DaskXGBClassifier

if __name__ == "__main__":

    with Client():
        # Create Dask DataFrame from sklearn 20newsgroups dataset
        bunch = sklearn.datasets.fetch_20newsgroups()
        df = dd.from_pandas(
            pd.DataFrame({"text": bunch.data, "target": bunch.target}), npartitions=25
        )

        # Create features with dask-ml's `HashingVectorizer``
        vect = dask_ml.feature_extraction.text.HashingVectorizer()
        X = vect.fit_transform(df["text"])

        # Format classification labels
        y = df["target"].to_dask_array()

        # Train XGBoost classifier
        clf = DaskXGBClassifier()
        print(f"{X = }")
        print(f"{y = }")
        clf.fit(X, y)  # Results in `AttributeError: divisions not found`

Full traceback:

Traceback (most recent call last):
  File "/Users/james/projects/coiled/evangelism-private/mongodb-with-coiled/test.py", line 28, in <module>
    clf.fit(X, y)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1817, in fit
    return self._client_sync(self._fit_async, **args)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1623, in _client_sync
    return self.client.sync(func, **kwargs, asynchronous=asynchronous)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/client.py", line 865, in sync
    return sync(
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/utils.py", line 327, in sync
    raise exc.with_traceback(tb)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/utils.py", line 310, in f
    result[0] = yield future
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1775, in _fit_async
    results = await self.client.sync(
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 931, in _train_async
    results = await client.gather(futures, asynchronous=True)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/client.py", line 1842, in _gather
    raise exception.with_traceback(traceback)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 867, in dispatched_train
    local_dtrain = _dmatrix_from_list_of_parts(**dtrain_ref)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 800, in _dmatrix_from_list_of_parts
    return _create_dmatrix(**kwargs)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 774, in _create_dmatrix
    _data = concat(data)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 206, in concat
    return dd.multi.concat(list(value), axis=0)
  File "/Users/james/projects/dask/dask/dask/dataframe/multi.py", line 1237, in concat
    if all(
  File "/Users/james/projects/dask/dask/dask/dataframe/multi.py", line 1238, in <genexpr>
    dfs[i].divisions[-1] < dfs[i + 1].divisions[0]
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/scipy/sparse/base.py", line 687, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: divisions not found

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-11-20T09:19:17Z

Thank you for opening the issue. I will work on some tests for sparse and scipy.sparse with dasks.

avriiil · 2022-03-30T11:45:52Z

I'm encountering the same issue as @jrbourbeau with the following package versions:
xgboost: 1.5.1
dask: 2022.02.0
distributed: 2022.02.0

The example code snippet above returns the same error: "AttributeError: divisions not found"

@trivialfis -- were your changes merged into 1.5.1?

avriiil · 2022-06-14T12:43:23Z

@trivialfis - any update on this? I am still encountering this issue while running xgboost 1.5.1

trivialfis · 2022-06-16T16:50:48Z

@rrpelgrim Please update to the latest XGBoost 1.6.1

trivialfis mentioned this issue Nov 21, 2021

Support scipy sparse in dask. #7457

Merged

trivialfis closed this as completed in #7457 Nov 23, 2021

avriiil mentioned this issue Jun 14, 2022

AttributeError when fitting DaskXGBClassifier on scipy.sparse matrix #7990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`AttributeError` with fitting model on Dask Array backed by `scipy.sparse.csr_matrix` #7454

`AttributeError` with fitting model on Dask Array backed by `scipy.sparse.csr_matrix` #7454

jrbourbeau commented Nov 19, 2021

trivialfis commented Nov 20, 2021

avriiil commented Mar 30, 2022 •

edited

Loading

avriiil commented Jun 14, 2022

trivialfis commented Jun 16, 2022

AttributeError with fitting model on Dask Array backed by scipy.sparse.csr_matrix #7454

AttributeError with fitting model on Dask Array backed by scipy.sparse.csr_matrix #7454

Comments

jrbourbeau commented Nov 19, 2021

trivialfis commented Nov 20, 2021

avriiil commented Mar 30, 2022 • edited Loading

avriiil commented Jun 14, 2022

trivialfis commented Jun 16, 2022

`AttributeError` with fitting model on Dask Array backed by `scipy.sparse.csr_matrix` #7454

`AttributeError` with fitting model on Dask Array backed by `scipy.sparse.csr_matrix` #7454

avriiil commented Mar 30, 2022 •

edited

Loading