Support for building DMatrix from Apache Arrow data format #7283

zhangzhang10 · 2021-10-02T21:50:57Z

Apache Arrow defines a columnar memory format, which allows zero-copy read and lighting fast data access (link). This PR adds support for building xgboost DMatrix from the Apache Arrow data format.

This is a rework of #5667, which is now closed and all discussion should continue here.

Features:

A new adapter class in C++ that accepts input data in the format defined by the Arrow C data interface .
A new template specialization for the SimpleDMatrix class constructor that creates a DMatrix instance from Arrow data.
No dependence on the Arrow API or the Arrow libraries. It only requires the input data to be in the Arrow format.
Works with any data source that can export data following the Arrow C data interface specification.
The xgboost Python interface supports building DMatrix from pyarrow data source, including pyarrow.Table and pyarrow.dataset. It provides better performance in building DMatrix than using the existing pandas interface.
Supports a data iterator interface to allow input data to be provided in a batch-by-batch fashion. This helps to reduce peak memory consumption when reading large data. This also makes it easier to work with other components that expect input data in mini batches, for example, the JVM packages.
Tested with Arrow 3.0, 4.0, and 5.0.

Performance of building DMatrix (Timing measured on Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz with 56 cores):

Dataset	File format	Data size	From pandas dataframe	From Arrow format (using pyarrow)
HIGGS	csv	2 million rows, 29 features	8.43 sec	0.96 sec
Mortgage	parquet	36 million rows, 47 features	14.31 sec	2.61 sec

Peak memory during DMatrix building (Using a Mortgage dataset with 150 million rows as an example):

Data source	Peak mem
pandas.dataframe	170 GB
pyarrow.Table	152 GB
pyarrow.dataset	105 GB

The final DMatrix memory size is 58 GB.

How to build:

Configure the build by passing -DUSE_ARROW=ON to the cmake command.

Usage (python):

A DMatrix can be built from either a pyarrow.Table or a pyarrow.dataset. For example,

import xgboost as xgb
from pyarrow import csv
import pyarrow.dataset as ds

# Reading CSV input as pyarrow.Table
table = csv.read_csv('/path/to/csv/input')
dmat = xgb.DMatrix(table)

# Reading parquet input as pyarrow.dataset
data = ds.dataset('/path/to/parquet/intput', format='parquet')
dmat = xgb.DMatrix(data)

zhangzhang10 · 2021-10-06T22:25:45Z

@trivialfis, @hcho3, @FelixYBW, Can you please review this PR? Thanks.

trivialfis · 2021-10-09T19:58:41Z

src/data/simple_dmatrix.cc

+
+#if defined(XGBOOST_BUILD_ARROW_SUPPORT)
+template<>
+SimpleDMatrix::SimpleDMatrix(RecordBatchesIterAdapter* adapter,


@RAMitchell Could you please help review this PR? Personally, I don't want to have this new DMatrix constructor and want to keep everything under adapter if possible. Maybe you have a different perspective on this.

@trivialfis I do use an adapter in the implementation. However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation. Specifically, the existing constructors distribute rows across threads. But my implementation distributes batches across threads, which is critical in getting good performance. Data in the Arrow format always arrive as a collection of chunks (batches).

Thanks for the explanation.

However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation.

How critical is the performance of this DMatrix construction? Sorry for the inconvenience but I have to understand the necessity of the optimization better. Aside from the constructor for SimpleDMatrix, this PR creates an entirely new route for DMatrix initialization, disregarding the existing CreateDMatrixFromXXX() + SetInfo. For instance, if we were to add a new meta info we will have a breaking PR right? Also we are on the progress of multi-output regression #7309 (I'm working on some abstractions like multi-array at the moment), do I need to change anything in this PR later? Lastly, feature_weight is also part of the MetaInfo which is used for column sampling.

Having said that, I'm excited to have arrow support in XGBoost, just want to know whether the performance gain in this particular ctor is that important to have a completely new code path. Like how bad it's if we just concatenate all the chunks in Python?

The optimization we have in the new SimpleDMatrix ctor mainly consists of:

A multithreading strategy that is more suitable for arrow data. Arrow data arrives in chunks. The constructor distributes these chunks to threads, with every thread processing one chunk at a time. For example, if there are N cores on the CPU and N threads can be spawned, the constructor will collect a batch of N arrow chunks and give them to the N threads. After the first batch is done, the constructor collects the next batch of N chunks, it keeps going until all chunks are consumed. Because arrow chunks are equal in size, this strategy achieves good load balance across all threads, leading to optimized performance. This strategy is different from what is being used in the existing constructors. The existing strategy works by processing all rows in data in parallel. It is good for cases where data is available all at once in a single batch, but it is not good for the arrow case where data is chunked.

Minimizing data copy. The implementation does not use a staging space to concatenate all chunks before filling data into DMatrix. Instead, the data is copied directly from arrow chunks into DMatrix pages.

This PR does not completely disregard the existing CreateDMatrixFromXXX + SetInfo route. This model is still supported. For example, the user can use the new constructor to ingest arrow data and then call SetInfo separately as usual to set metainfo such as labels. For situations where certain metainfo exists as columns of the arrow table, this PR allows a more convenient route. The user can simply specify the corresponding column names and the metainfo columns and data columns will be processed at the same time. From what I see, it is not uncommon for labels to be part of the input data. But other types of metainfo may not be so. For the purpose of simplifying future maintenance, I am OK to stick with the original CreateDMatrixFromXXX + SetInfo approach. Or, we can probably have a compromise to allow at least the label column to be processed together with data columns, and other metainfo must be processed with SetInfo. What do you think?

Because the original route for metainfo is still supported, we do not break this PR when we add new types of metainfo in the future. Similarly, this PR won't affect multi-output regression because it is still possible to specify labels (once or multiple times) using SetInfo.

Per discussion, I've revised this PR such that we stick with the existing CreateDMatrixFromXXX + SetInfo route for DMatrix initialization. When using the Python API, meta info such as labels and weights are set separately using SetInfo. This is consistent with the sklearn custom.

zhangzhang10 · 2021-10-11T21:04:08Z

@SmirnovEgorRu Can you also please review this PR. Thanks.

zhangzhang10 · 2021-10-25T19:40:37Z

@trivialfis I saw you mark this PR as "Need prioritize in Roadmap". Any estimate about when the prioritization will be done? By the way, some checks failed because of errors in building GPU code, even though this PR doesn't touch GPU code at all. What shall I do to make it pass these checks? Thanks.

trivialfis · 2021-10-25T19:49:43Z

src/data/simple_dmatrix.cc

+
+#if defined(XGBOOST_BUILD_ARROW_SUPPORT)
+template<>
+SimpleDMatrix::SimpleDMatrix(RecordBatchesIterAdapter* adapter,


Thanks for the explanation.

However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation.

How critical is the performance of this DMatrix construction? Sorry for the inconvenience but I have to understand the necessity of the optimization better. Aside from the constructor for SimpleDMatrix, this PR creates an entirely new route for DMatrix initialization, disregarding the existing CreateDMatrixFromXXX() + SetInfo. For instance, if we were to add a new meta info we will have a breaking PR right? Also we are on the progress of multi-output regression #7309 (I'm working on some abstractions like multi-array at the moment), do I need to change anything in this PR later? Lastly, feature_weight is also part of the MetaInfo which is used for column sampling.

Having said that, I'm excited to have arrow support in XGBoost, just want to know whether the performance gain in this particular ctor is that important to have a completely new code path. Like how bad it's if we just concatenate all the chunks in Python?

trivialfis · 2021-10-25T19:50:06Z

python-package/xgboost/core.py

@@ -640,6 +701,44 @@ def __init__(
        if feature_types is not None:
            self.feature_types = feature_types

+    def _init_from_arrow(


Can this be part of data.py?

Please see my comment below.

trivialfis · 2021-10-25T19:50:20Z

python-package/xgboost/core.py


        if _is_iter(data):
            self._init_from_iter(data, enable_categorical)
            assert self.handle is not None
            return

+        if _is_arrow(data):


Can this be part of data.py?

If this is made part of data.py, then I would have to use dispatch_data_backend to handle arrow data. But dispatch_data_backend does not process metadata such as labels. However, for arrow data, it is beneficial to ingest both data and metadata (e.g. labels) at the same time, because:

It is common that some metadata, especially labels, exist as columns in the arrow table. It is a more convenient and simpler usage model if the user only needs to specify column names for the metadata. The user does not need to create additional objects to separately ingest metadata columns.

Arrow data arrives in chunks. If I process metadata columns and data columns separately then I would have to first collect metadata columns from every chunk and then use set_info to set it into DMatrix. This is an extra bookkeeping cost. This also affects performance because now I'm not processing all columns in parallel; instead, I'm processing data columns then metadata columns, a sequential fashion.

For cases where metadata is indeed provided separately from data, this PR doesn't change a thing for the existing usage model. This PR still allows the use of set_info to set metadata, which the user must provide as array-like objects.

Actually, when users use XGBoost, they often go with scikit-learn interface for its simplicity and the utilities provided by sklearn (like grid search, calibration, etc). It would be great if you can keep the existing interface of data handling. I'm fine with converting arrow table to pandas dataframe for labels and weights considering they are quite small compared to X.

@zhangzhang10 Do you think we can work on the smaller subset of this PR first like getting the predictor X?

@zhangzhang10 Do you think we can work on the smaller subset of this PR first like getting the predictor X?

By a smaller subset, do you mean as the first step we only support the predictor X (i.e. the feature columns) as arrow format, and treat metainfo columns as array-like objects? If so, then yes I agree. I'm working on this. Thanks.

@trivialfis Done. This PR currently only builds predictor DMatrix X from arrow data.

trivialfis · 2021-10-25T19:51:15Z

include/xgboost/data.h

@@ -190,7 +190,11 @@ struct Entry {
  /*! \brief feature value */
  bst_float fvalue;
  /*! \brief default constructor */
+#if defined(XGBOOST_BUILD_ARROW_SUPPORT)


I don't think this is necessary. A few seconds on large data set is fine.

trivialfis · 2021-10-25T20:02:42Z

I think for step 1 we need to decouple the predictor X from the rest of the meta info. Feel free to share your thoughts.

trivialfis · 2021-10-25T20:23:39Z

I think for step 1 we need to decouple the predictor X from the rest of the meta info.

I think arrow is increasingly important in the future with its performance advantage and is likely to be adopted by more distributed computing frameworks.

I'm fine with merging the new ctor of SimpleDMatrix that's optimized for batched data, it's unlikely we need to update that part of code in the future frequently and the changes can be localized. But meta info needs to be seperated.

some checks failed because of errors in building GPU code, even though this PR doesn't touch GPU code at all

If you meant this error:

/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/scan_by_key.h(453): error: identifier "xgboost::Entry::Entry" is undefined in device code

Just remove the new ctor in Entry.

src/data/simple_dmatrix.cc

trivialfis · 2021-10-25T20:33:09Z

src/c_api/c_api.cc

+
+XGB_DLL int XGDMatrixCreateFromArrowCallback(
+    XGDMatrixCallbackNext *next,
+    float missing,


We are moving toward using JSON string for API to avoid breaking changes. I can help with that by creating PR on your branch once we have the basic structure settled.

Unifying API toward using JSON is a great idea. I'd love to have a discussion on this when ready.

trivialfis · 2021-10-25T20:35:09Z

src/data/adapter.h

+  const uint8_t* bitmap_;
+};
+
+// only columns of primitive types are supported


Do we need inheritance if that's the only type we support?

We probably don't need inheritance here. I will make a change.

Do we need inheritance if that's the only type we support?

I thought about it again. Actually, I do need this inheritance. class ArrowColumnarBatch contains a collection of pointers to PrimitiveColumns. PrimitiveColumn itself is a class template, which means these columns can be of different data types. I need these pointers to be polymorphistic. Defining a base class for all PrimitiveColumns helps to achieve this.

Thanks for the explanation. Would be great to have it in code comment. ;-)

I've added some comment for clarification.

* Integrate with Arrow C data API. * Support Arrow dataset. * Support Arrow table.

* Add arrow-cdi.h whose content is copied from https://arrow.apache.org/docs/format/CDataInterface.html * Release memory allocated by Arrow after exported C data is consumed.

- Set labels, weights, base_margins, and qids from Arrow table or dataset if available - Fix lint issues - Add pyarrow to ci_build Docker image for cpu_test

…dapter

zhangzhang10 · 2021-11-12T19:53:49Z

I think for step 1 we need to decouple the predictor X from the rest of the meta info. Feel free to share your thoughts.

Yes. This step is done.

trivialfis · 2021-11-12T20:53:43Z

Thanks! I will look into the details next week. Really appreciate your excellent work!

trivialfis · 2021-11-16T04:03:22Z

I have restarted the CI.

FelixYBW · 2021-12-01T03:46:25Z

@trivialfis any work we need to do?

trivialfis · 2021-12-02T14:28:43Z

I haven't looked into your new code yet, but from the CI:

[2021-11-15T22:21:09.849Z] tests/python/test_updaters.py::TestTreeMethod::test_hist_degenerate_case PASSED

[2021-11-15T22:21:10.105Z] tests/python/test_with_arrow.py::TestArrowTable::test_arrow_dataset_from_csv FAILED

[2021-11-16T01:39:20.757Z] tests/python/test_with_arrow.py::TestArrowTable::test_arrow_survival Sending interrupt signal to process

[2021-11-16T01:39:31.604Z] script returned exit code 143

xwu99 · 2021-12-16T02:41:06Z

Hi @trivialfis
Sorry for late reply. @zhangzhang10 has handed the work over to me. I will create a seperate PR to work on this.

pseudotensor · 2022-02-02T00:35:51Z

How's progress @xwu99 ?

hcho3 · 2022-02-02T00:40:19Z

@pseudotensor Work is being continued in #7512

zhangzhang10 mentioned this pull request Oct 2, 2021

Add support for Apache Arrow #5667

Closed

trivialfis reviewed Oct 9, 2021

View reviewed changes

zhangzhang10 requested a review from trivialfis October 25, 2021 19:33

trivialfis reviewed Oct 25, 2021

View reviewed changes

Zhang Zhang added 12 commits October 28, 2021 18:21

Support building SimpleDMatrix from Arrow data format

ff97a05

* Integrate with Arrow C data API. * Support Arrow dataset. * Support Arrow table.

Add the header file for Arrow C data interface

32cb62c

* Add arrow-cdi.h whose content is copied from https://arrow.apache.org/docs/format/CDataInterface.html * Release memory allocated by Arrow after exported C data is consumed.

Fix Arrow Schema importing

32f66de

Set DMatrix metadata from Arrow

dcd0daa

- Set labels, weights, base_margins, and qids from Arrow table or dataset if available - Fix lint issues - Add pyarrow to ci_build Docker image for cpu_test

Fix SimpleDMatrix performance issues

a7c5980

Remove unreachable code

dbb2743

Update python unit tests

97685d3

Add support for labels_lower_bound and labels_upper_bound for Arrow a…

fbedf2a

…dapter

Update Python unit tests

f397a2a

Minor fixes in core.py

e0d9a11

Fix whitespace issues found by code linting

631dd04

Fix a few issues found by clang-tidy

f9ba6d9

zhangzhang10 force-pushed the arrow-support branch from 719d28b to f9ba6d9 Compare October 28, 2021 18:41

Zhang Zhang added 2 commits November 9, 2021 07:12

Align meta info processing with sklearn custom

5f2f89f

Fix lint warning in core.py

851dd5a

Add more comment to class template PrimitiveColumn

2e800e1

xwu99 mentioned this pull request Dec 16, 2021

[FOLLOW-UP] Support for building DMatrix from Apache Arrow data format #7512

Merged

hcho3 closed this Feb 2, 2022

natmod mentioned this pull request Jul 12, 2022

Add zero-copy DMatrix creation with Arrow ray-project/xgboost_ray#224

Open

AlenkaF mentioned this pull request Oct 18, 2023

Creating DMatrix from pyarrow.Table vs. pandas.DataFrame #9692

Closed

Support for building DMatrix from Apache Arrow data format #7283

Support for building DMatrix from Apache Arrow data format #7283

Conversation

zhangzhang10 commented Oct 2, 2021 • edited Loading

Features:

Performance of building DMatrix (Timing measured on Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz with 56 cores):

Peak memory during DMatrix building (Using a Mortgage dataset with 150 million rows as an example):

How to build:

Usage (python):

zhangzhang10 commented Oct 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangzhang10 commented Oct 11, 2021

zhangzhang10 commented Oct 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangzhang10 Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Oct 25, 2021

trivialfis commented Oct 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangzhang10 commented Nov 12, 2021

trivialfis commented Nov 12, 2021

trivialfis commented Nov 16, 2021

FelixYBW commented Dec 1, 2021

trivialfis commented Dec 2, 2021

xwu99 commented Dec 16, 2021

pseudotensor commented Feb 2, 2022

hcho3 commented Feb 2, 2022

zhangzhang10 commented Oct 2, 2021 •

edited

Loading

zhangzhang10 Oct 27, 2021 •

edited

Loading

trivialfis Oct 28, 2021 •

edited

Loading