Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for building DMatrix from Apache Arrow data format #7283

Closed
wants to merge 15 commits into from

Conversation

zhangzhang10
Copy link
Contributor

@zhangzhang10 zhangzhang10 commented Oct 2, 2021

Apache Arrow defines a columnar memory format, which allows zero-copy read and lighting fast data access (link). This PR adds support for building xgboost DMatrix from the Apache Arrow data format.

This is a rework of #5667, which is now closed and all discussion should continue here.

Features:

  • A new adapter class in C++ that accepts input data in the format defined by the Arrow C data interface .
  • A new template specialization for the SimpleDMatrix class constructor that creates a DMatrix instance from Arrow data.
  • No dependence on the Arrow API or the Arrow libraries. It only requires the input data to be in the Arrow format.
  • Works with any data source that can export data following the Arrow C data interface specification.
  • The xgboost Python interface supports building DMatrix from pyarrow data source, including pyarrow.Table and pyarrow.dataset. It provides better performance in building DMatrix than using the existing pandas interface.
  • Supports a data iterator interface to allow input data to be provided in a batch-by-batch fashion. This helps to reduce peak memory consumption when reading large data. This also makes it easier to work with other components that expect input data in mini batches, for example, the JVM packages.
  • Tested with Arrow 3.0, 4.0, and 5.0.

Performance of building DMatrix (Timing measured on Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz with 56 cores):

Dataset File format Data size From pandas dataframe From Arrow format (using pyarrow)
HIGGS csv 2 million rows, 29 features 8.43 sec 0.96 sec
Mortgage parquet 36 million rows, 47 features 14.31 sec 2.61 sec

Peak memory during DMatrix building (Using a Mortgage dataset with 150 million rows as an example):

Data source Peak mem
pandas.dataframe 170 GB
pyarrow.Table 152 GB
pyarrow.dataset 105 GB

The final DMatrix memory size is 58 GB.

How to build:

  • Configure the build by passing -DUSE_ARROW=ON to the cmake command.

Usage (python):

A DMatrix can be built from either a pyarrow.Table or a pyarrow.dataset. For example,

import xgboost as xgb
from pyarrow import csv
import pyarrow.dataset as ds

# Reading CSV input as pyarrow.Table
table = csv.read_csv('/path/to/csv/input')
dmat = xgb.DMatrix(table)

# Reading parquet input as pyarrow.dataset
data = ds.dataset('/path/to/parquet/intput', format='parquet')
dmat = xgb.DMatrix(data)

@zhangzhang10
Copy link
Contributor Author

@trivialfis, @hcho3, @FelixYBW, Can you please review this PR? Thanks.


#if defined(XGBOOST_BUILD_ARROW_SUPPORT)
template<>
SimpleDMatrix::SimpleDMatrix(RecordBatchesIterAdapter* adapter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell Could you please help review this PR? Personally, I don't want to have this new DMatrix constructor and want to keep everything under adapter if possible. Maybe you have a different perspective on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis I do use an adapter in the implementation. However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation. Specifically, the existing constructors distribute rows across threads. But my implementation distributes batches across threads, which is critical in getting good performance. Data in the Arrow format always arrive as a collection of chunks (batches).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation.

How critical is the performance of this DMatrix construction? Sorry for the inconvenience but I have to understand the necessity of the optimization better. Aside from the constructor for SimpleDMatrix, this PR creates an entirely new route for DMatrix initialization, disregarding the existing CreateDMatrixFromXXX() + SetInfo. For instance, if we were to add a new meta info we will have a breaking PR right? Also we are on the progress of multi-output regression #7309 (I'm working on some abstractions like multi-array at the moment), do I need to change anything in this PR later? Lastly, feature_weight is also part of the MetaInfo which is used for column sampling.

Having said that, I'm excited to have arrow support in XGBoost, just want to know whether the performance gain in this particular ctor is that important to have a completely new code path. Like how bad it's if we just concatenate all the chunks in Python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization we have in the new SimpleDMatrix ctor mainly consists of:

  • A multithreading strategy that is more suitable for arrow data. Arrow data arrives in chunks. The constructor distributes these chunks to threads, with every thread processing one chunk at a time. For example, if there are N cores on the CPU and N threads can be spawned, the constructor will collect a batch of N arrow chunks and give them to the N threads. After the first batch is done, the constructor collects the next batch of N chunks, it keeps going until all chunks are consumed. Because arrow chunks are equal in size, this strategy achieves good load balance across all threads, leading to optimized performance. This strategy is different from what is being used in the existing constructors. The existing strategy works by processing all rows in data in parallel. It is good for cases where data is available all at once in a single batch, but it is not good for the arrow case where data is chunked.
  • Minimizing data copy. The implementation does not use a staging space to concatenate all chunks before filling data into DMatrix. Instead, the data is copied directly from arrow chunks into DMatrix pages.

This PR does not completely disregard the existing CreateDMatrixFromXXX + SetInfo route. This model is still supported. For example, the user can use the new constructor to ingest arrow data and then call SetInfo separately as usual to set metainfo such as labels. For situations where certain metainfo exists as columns of the arrow table, this PR allows a more convenient route. The user can simply specify the corresponding column names and the metainfo columns and data columns will be processed at the same time. From what I see, it is not uncommon for labels to be part of the input data. But other types of metainfo may not be so. For the purpose of simplifying future maintenance, I am OK to stick with the original CreateDMatrixFromXXX + SetInfo approach. Or, we can probably have a compromise to allow at least the label column to be processed together with data columns, and other metainfo must be processed with SetInfo. What do you think?

Because the original route for metainfo is still supported, we do not break this PR when we add new types of metainfo in the future. Similarly, this PR won't affect multi-output regression because it is still possible to specify labels (once or multiple times) using SetInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion, I've revised this PR such that we stick with the existing CreateDMatrixFromXXX + SetInfo route for DMatrix initialization. When using the Python API, meta info such as labels and weights are set separately using SetInfo. This is consistent with the sklearn custom.

@zhangzhang10
Copy link
Contributor Author

@SmirnovEgorRu Can you also please review this PR. Thanks.

@zhangzhang10
Copy link
Contributor Author

@trivialfis I saw you mark this PR as "Need prioritize in Roadmap". Any estimate about when the prioritization will be done? By the way, some checks failed because of errors in building GPU code, even though this PR doesn't touch GPU code at all. What shall I do to make it pass these checks? Thanks.


#if defined(XGBOOST_BUILD_ARROW_SUPPORT)
template<>
SimpleDMatrix::SimpleDMatrix(RecordBatchesIterAdapter* adapter,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation.

How critical is the performance of this DMatrix construction? Sorry for the inconvenience but I have to understand the necessity of the optimization better. Aside from the constructor for SimpleDMatrix, this PR creates an entirely new route for DMatrix initialization, disregarding the existing CreateDMatrixFromXXX() + SetInfo. For instance, if we were to add a new meta info we will have a breaking PR right? Also we are on the progress of multi-output regression #7309 (I'm working on some abstractions like multi-array at the moment), do I need to change anything in this PR later? Lastly, feature_weight is also part of the MetaInfo which is used for column sampling.

Having said that, I'm excited to have arrow support in XGBoost, just want to know whether the performance gain in this particular ctor is that important to have a completely new code path. Like how bad it's if we just concatenate all the chunks in Python?

@@ -640,6 +701,44 @@ def __init__(
if feature_types is not None:
self.feature_types = feature_types

def _init_from_arrow(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be part of data.py?

Copy link
Contributor Author

@zhangzhang10 zhangzhang10 Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment below.


if _is_iter(data):
self._init_from_iter(data, enable_categorical)
assert self.handle is not None
return

if _is_arrow(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be part of data.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is made part of data.py, then I would have to use dispatch_data_backend to handle arrow data. But dispatch_data_backend does not process metadata such as labels. However, for arrow data, it is beneficial to ingest both data and metadata (e.g. labels) at the same time, because:

  • It is common that some metadata, especially labels, exist as columns in the arrow table. It is a more convenient and simpler usage model if the user only needs to specify column names for the metadata. The user does not need to create additional objects to separately ingest metadata columns.
  • Arrow data arrives in chunks. If I process metadata columns and data columns separately then I would have to first collect metadata columns from every chunk and then use set_info to set it into DMatrix. This is an extra bookkeeping cost. This also affects performance because now I'm not processing all columns in parallel; instead, I'm processing data columns then metadata columns, a sequential fashion.

For cases where metadata is indeed provided separately from data, this PR doesn't change a thing for the existing usage model. This PR still allows the use of set_info to set metadata, which the user must provide as array-like objects.

Copy link
Member

@trivialfis trivialfis Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, when users use XGBoost, they often go with scikit-learn interface for its simplicity and the utilities provided by sklearn (like grid search, calibration, etc). It would be great if you can keep the existing interface of data handling. I'm fine with converting arrow table to pandas dataframe for labels and weights considering they are quite small compared to X.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangzhang10 Do you think we can work on the smaller subset of this PR first like getting the predictor X?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangzhang10 Do you think we can work on the smaller subset of this PR first like getting the predictor X?

By a smaller subset, do you mean as the first step we only support the predictor X (i.e. the feature columns) as arrow format, and treat metainfo columns as array-like objects? If so, then yes I agree. I'm working on this. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Done. This PR currently only builds predictor DMatrix X from arrow data.

@@ -190,7 +190,11 @@ struct Entry {
/*! \brief feature value */
bst_float fvalue;
/*! \brief default constructor */
#if defined(XGBOOST_BUILD_ARROW_SUPPORT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. A few seconds on large data set is fine.

@trivialfis
Copy link
Member

I think for step 1 we need to decouple the predictor X from the rest of the meta info. Feel free to share your thoughts.

@trivialfis
Copy link
Member

I think for step 1 we need to decouple the predictor X from the rest of the meta info.

I think arrow is increasingly important in the future with its performance advantage and is likely to be adopted by more distributed computing frameworks.

I'm fine with merging the new ctor of SimpleDMatrix that's optimized for batched data, it's unlikely we need to update that part of code in the future frequently and the changes can be localized. But meta info needs to be seperated.

some checks failed because of errors in building GPU code, even though this PR doesn't touch GPU code at all

If you meant this error:

/usr/local/cuda/bin/../targets/x86_64-linux/include/thrust/system/cuda/detail/scan_by_key.h(453): error: identifier "xgboost::Entry::Entry" is undefined in device code

Just remove the new ctor in Entry.

src/data/simple_dmatrix.cc Show resolved Hide resolved

XGB_DLL int XGDMatrixCreateFromArrowCallback(
XGDMatrixCallbackNext *next,
float missing,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving toward using JSON string for API to avoid breaking changes. I can help with that by creating PR on your branch once we have the basic structure settled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unifying API toward using JSON is a great idea. I'd love to have a discussion on this when ready.

const uint8_t* bitmap_;
};

// only columns of primitive types are supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need inheritance if that's the only type we support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need inheritance here. I will make a change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need inheritance if that's the only type we support?

I thought about it again. Actually, I do need this inheritance. class ArrowColumnarBatch contains a collection of pointers to PrimitiveColumns. PrimitiveColumn itself is a class template, which means these columns can be of different data types. I need these pointers to be polymorphistic. Defining a base class for all PrimitiveColumns helps to achieve this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Would be great to have it in code comment. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comment for clarification.

Zhang Zhang added 12 commits October 28, 2021 18:21
* Integrate with Arrow C data API.
* Support Arrow dataset.
* Support Arrow table.
* Add arrow-cdi.h whose content is copied from
https://arrow.apache.org/docs/format/CDataInterface.html
* Release memory allocated by Arrow after exported C data is consumed.
- Set labels, weights, base_margins, and qids from Arrow table or
  dataset if available
- Fix lint issues
- Add pyarrow to ci_build Docker image for cpu_test
@zhangzhang10
Copy link
Contributor Author

I think for step 1 we need to decouple the predictor X from the rest of the meta info. Feel free to share your thoughts.

Yes. This step is done.

@trivialfis
Copy link
Member

Thanks! I will look into the details next week. Really appreciate your excellent work!

@trivialfis
Copy link
Member

I have restarted the CI.

@FelixYBW
Copy link
Contributor

FelixYBW commented Dec 1, 2021

@trivialfis any work we need to do?

@trivialfis
Copy link
Member

I haven't looked into your new code yet, but from the CI:

[2021-11-15T22:21:09.849Z] tests/python/test_updaters.py::TestTreeMethod::test_hist_degenerate_case PASSED

[2021-11-15T22:21:10.105Z] tests/python/test_with_arrow.py::TestArrowTable::test_arrow_dataset_from_csv FAILED

[2021-11-16T01:39:20.757Z] tests/python/test_with_arrow.py::TestArrowTable::test_arrow_survival Sending interrupt signal to process

[2021-11-16T01:39:31.604Z] script returned exit code 143

@xwu99
Copy link
Contributor

xwu99 commented Dec 16, 2021

Hi @trivialfis
Sorry for late reply. @zhangzhang10 has handed the work over to me. I will create a seperate PR to work on this.

@pseudotensor
Copy link
Contributor

How's progress @xwu99 ?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 2, 2022

@pseudotensor Work is being continued in #7512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants