Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError with fitting model on Dask Array backed by scipy.sparse.csr_matrix #7454

Closed
jrbourbeau opened this issue Nov 19, 2021 · 4 comments · Fixed by #7457
Closed

Comments

@jrbourbeau
Copy link
Contributor

I came across a use case where attempting to fit a DaskXGBClassifier on a Dask Array whose partitions are scipy.sparse.csr_matrixs (as is returned by Dask-ML's HashingVectorizer) results in a AttributeError: divisions not found error (full traceback included below).

From doing some initial debugging it appears the underlying issue is that during the fitting process we end up passing a list of sparse matrices to Dask's dd.multi.concat here

return dd.multi.concat(list(value), axis=0)

However, dd.multi.concat expects a list of Dask DataFrames, which is where the AttributeError: divisions not found is coming from (Dask DataFrames have a .divisions attribute which dd.multi.concat assumes exists).

Here's an example code snippet which should reproduce the issue when using the latest xgboost (1.5.0) and dask (2021.11.2) / distributed (2021.11.2) releases:

import dask.dataframe as dd
import dask_ml.feature_extraction.text
import pandas as pd
import sklearn.datasets
from dask.distributed import Client
from xgboost.dask import DaskXGBClassifier

if __name__ == "__main__":

    with Client():
        # Create Dask DataFrame from sklearn 20newsgroups dataset
        bunch = sklearn.datasets.fetch_20newsgroups()
        df = dd.from_pandas(
            pd.DataFrame({"text": bunch.data, "target": bunch.target}), npartitions=25
        )

        # Create features with dask-ml's `HashingVectorizer``
        vect = dask_ml.feature_extraction.text.HashingVectorizer()
        X = vect.fit_transform(df["text"])

        # Format classification labels
        y = df["target"].to_dask_array()

        # Train XGBoost classifier
        clf = DaskXGBClassifier()
        print(f"{X = }")
        print(f"{y = }")
        clf.fit(X, y)  # Results in `AttributeError: divisions not found`
Full traceback:
Traceback (most recent call last):
  File "/Users/james/projects/coiled/evangelism-private/mongodb-with-coiled/test.py", line 28, in <module>
    clf.fit(X, y)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1817, in fit
    return self._client_sync(self._fit_async, **args)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1623, in _client_sync
    return self.client.sync(func, **kwargs, asynchronous=asynchronous)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/client.py", line 865, in sync
    return sync(
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/utils.py", line 327, in sync
    raise exc.with_traceback(tb)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/utils.py", line 310, in f
    result[0] = yield future
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 1775, in _fit_async
    results = await self.client.sync(
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 931, in _train_async
    results = await client.gather(futures, asynchronous=True)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/distributed/client.py", line 1842, in _gather
    raise exception.with_traceback(traceback)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 867, in dispatched_train
    local_dtrain = _dmatrix_from_list_of_parts(**dtrain_ref)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 800, in _dmatrix_from_list_of_parts
    return _create_dmatrix(**kwargs)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 774, in _create_dmatrix
    _data = concat(data)
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/xgboost/dask.py", line 206, in concat
    return dd.multi.concat(list(value), axis=0)
  File "/Users/james/projects/dask/dask/dask/dataframe/multi.py", line 1237, in concat
    if all(
  File "/Users/james/projects/dask/dask/dask/dataframe/multi.py", line 1238, in <genexpr>
    dfs[i].divisions[-1] < dfs[i + 1].divisions[0]
  File "/Users/james/mambaforge/envs/dask-nlp/lib/python3.9/site-packages/scipy/sparse/base.py", line 687, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: divisions not found
@trivialfis
Copy link
Member

Thank you for opening the issue. I will work on some tests for sparse and scipy.sparse with dasks.

@avriiil
Copy link

avriiil commented Mar 30, 2022

I'm encountering the same issue as @jrbourbeau with the following package versions:
xgboost: 1.5.1
dask: 2022.02.0
distributed: 2022.02.0

The example code snippet above returns the same error: "AttributeError: divisions not found"

@trivialfis -- were your changes merged into 1.5.1?

@avriiil
Copy link

avriiil commented Jun 14, 2022

@trivialfis - any update on this? I am still encountering this issue while running xgboost 1.5.1

@trivialfis
Copy link
Member

@rrpelgrim Please update to the latest XGBoost 1.6.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants