-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for building DMatrix from Apache Arrow data format #7283
Conversation
@trivialfis, @hcho3, @FelixYBW, Can you please review this PR? Thanks. |
|
||
#if defined(XGBOOST_BUILD_ARROW_SUPPORT) | ||
template<> | ||
SimpleDMatrix::SimpleDMatrix(RecordBatchesIterAdapter* adapter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RAMitchell Could you please help review this PR? Personally, I don't want to have this new DMatrix constructor and want to keep everything under adapter if possible. Maybe you have a different perspective on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis I do use an adapter in the implementation. However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation. Specifically, the existing constructors distribute rows across threads. But my implementation distributes batches across threads, which is critical in getting good performance. Data in the Arrow format always arrive as a collection of chunks (batches).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation.
How critical is the performance of this DMatrix construction? Sorry for the inconvenience but I have to understand the necessity of the optimization better. Aside from the constructor for SimpleDMatrix
, this PR creates an entirely new route for DMatrix initialization, disregarding the existing CreateDMatrixFromXXX()
+ SetInfo
. For instance, if we were to add a new meta info we will have a breaking PR right? Also we are on the progress of multi-output regression #7309 (I'm working on some abstractions like multi-array at the moment), do I need to change anything in this PR later? Lastly, feature_weight
is also part of the MetaInfo
which is used for column sampling.
Having said that, I'm excited to have arrow support in XGBoost, just want to know whether the performance gain in this particular ctor is that important to have a completely new code path. Like how bad it's if we just concatenate all the chunks in Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization we have in the new SimpleDMatrix
ctor mainly consists of:
- A multithreading strategy that is more suitable for arrow data. Arrow data arrives in chunks. The constructor distributes these chunks to threads, with every thread processing one chunk at a time. For example, if there are
N
cores on the CPU andN
threads can be spawned, the constructor will collect a batch ofN
arrow chunks and give them to theN
threads. After the first batch is done, the constructor collects the next batch ofN
chunks, it keeps going until all chunks are consumed. Because arrow chunks are equal in size, this strategy achieves good load balance across all threads, leading to optimized performance. This strategy is different from what is being used in the existing constructors. The existing strategy works by processing all rows in data in parallel. It is good for cases where data is available all at once in a single batch, but it is not good for the arrow case where data is chunked. - Minimizing data copy. The implementation does not use a staging space to concatenate all chunks before filling data into DMatrix. Instead, the data is copied directly from arrow chunks into DMatrix pages.
This PR does not completely disregard the existing CreateDMatrixFromXXX
+ SetInfo
route. This model is still supported. For example, the user can use the new constructor to ingest arrow data and then call SetInfo
separately as usual to set metainfo such as labels. For situations where certain metainfo exists as columns of the arrow table, this PR allows a more convenient route. The user can simply specify the corresponding column names and the metainfo columns and data columns will be processed at the same time. From what I see, it is not uncommon for labels to be part of the input data. But other types of metainfo may not be so. For the purpose of simplifying future maintenance, I am OK to stick with the original CreateDMatrixFromXXX
+ SetInfo
approach. Or, we can probably have a compromise to allow at least the label column to be processed together with data columns, and other metainfo must be processed with SetInfo
. What do you think?
Because the original route for metainfo is still supported, we do not break this PR when we add new types of metainfo in the future. Similarly, this PR won't affect multi-output regression because it is still possible to specify labels (once or multiple times) using SetInfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion, I've revised this PR such that we stick with the existing CreateDMatrixFromXXX
+ SetInfo
route for DMatrix initialization. When using the Python API, meta info such as labels and weights are set separately using SetInfo
. This is consistent with the sklearn custom.
@SmirnovEgorRu Can you also please review this PR. Thanks. |
@trivialfis I saw you mark this PR as "Need prioritize in Roadmap". Any estimate about when the prioritization will be done? By the way, some checks failed because of errors in building GPU code, even though this PR doesn't touch GPU code at all. What shall I do to make it pass these checks? Thanks. |
|
||
#if defined(XGBOOST_BUILD_ARROW_SUPPORT) | ||
template<> | ||
SimpleDMatrix::SimpleDMatrix(RecordBatchesIterAdapter* adapter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
However, I also need a specialized DMatrix constructor because the other constructors do not support the threading strategy in my implementation.
How critical is the performance of this DMatrix construction? Sorry for the inconvenience but I have to understand the necessity of the optimization better. Aside from the constructor for SimpleDMatrix
, this PR creates an entirely new route for DMatrix initialization, disregarding the existing CreateDMatrixFromXXX()
+ SetInfo
. For instance, if we were to add a new meta info we will have a breaking PR right? Also we are on the progress of multi-output regression #7309 (I'm working on some abstractions like multi-array at the moment), do I need to change anything in this PR later? Lastly, feature_weight
is also part of the MetaInfo
which is used for column sampling.
Having said that, I'm excited to have arrow support in XGBoost, just want to know whether the performance gain in this particular ctor is that important to have a completely new code path. Like how bad it's if we just concatenate all the chunks in Python?
python-package/xgboost/core.py
Outdated
@@ -640,6 +701,44 @@ def __init__( | |||
if feature_types is not None: | |||
self.feature_types = feature_types | |||
|
|||
def _init_from_arrow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be part of data.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment below.
python-package/xgboost/core.py
Outdated
|
||
if _is_iter(data): | ||
self._init_from_iter(data, enable_categorical) | ||
assert self.handle is not None | ||
return | ||
|
||
if _is_arrow(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be part of data.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is made part of data.py
, then I would have to use dispatch_data_backend
to handle arrow data. But dispatch_data_backend
does not process metadata such as labels. However, for arrow data, it is beneficial to ingest both data and metadata (e.g. labels) at the same time, because:
- It is common that some metadata, especially labels, exist as columns in the arrow table. It is a more convenient and simpler usage model if the user only needs to specify column names for the metadata. The user does not need to create additional objects to separately ingest metadata columns.
- Arrow data arrives in chunks. If I process metadata columns and data columns separately then I would have to first collect metadata columns from every chunk and then use
set_info
to set it into DMatrix. This is an extra bookkeeping cost. This also affects performance because now I'm not processing all columns in parallel; instead, I'm processing data columns then metadata columns, a sequential fashion.
For cases where metadata is indeed provided separately from data, this PR doesn't change a thing for the existing usage model. This PR still allows the use of set_info
to set metadata, which the user must provide as array-like objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, when users use XGBoost, they often go with scikit-learn interface for its simplicity and the utilities provided by sklearn (like grid search, calibration, etc). It would be great if you can keep the existing interface of data handling. I'm fine with converting arrow table to pandas dataframe for labels and weights considering they are quite small compared to X
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangzhang10 Do you think we can work on the smaller subset of this PR first like getting the predictor X
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangzhang10 Do you think we can work on the smaller subset of this PR first like getting the predictor
X
?
By a smaller subset, do you mean as the first step we only support the predictor X
(i.e. the feature columns) as arrow format, and treat metainfo columns as array-like objects? If so, then yes I agree. I'm working on this. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trivialfis Done. This PR currently only builds predictor DMatrix X
from arrow data.
include/xgboost/data.h
Outdated
@@ -190,7 +190,11 @@ struct Entry { | |||
/*! \brief feature value */ | |||
bst_float fvalue; | |||
/*! \brief default constructor */ | |||
#if defined(XGBOOST_BUILD_ARROW_SUPPORT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. A few seconds on large data set is fine.
I think for step 1 we need to decouple the predictor |
I think arrow is increasingly important in the future with its performance advantage and is likely to be adopted by more distributed computing frameworks. I'm fine with merging the new ctor of
If you meant this error:
Just remove the new ctor in |
|
||
XGB_DLL int XGDMatrixCreateFromArrowCallback( | ||
XGDMatrixCallbackNext *next, | ||
float missing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving toward using JSON string for API to avoid breaking changes. I can help with that by creating PR on your branch once we have the basic structure settled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unifying API toward using JSON is a great idea. I'd love to have a discussion on this when ready.
src/data/adapter.h
Outdated
const uint8_t* bitmap_; | ||
}; | ||
|
||
// only columns of primitive types are supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need inheritance if that's the only type we support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need inheritance here. I will make a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need inheritance if that's the only type we support?
I thought about it again. Actually, I do need this inheritance. class ArrowColumnarBatch
contains a collection of pointers to PrimitiveColumn
s. PrimitiveColumn
itself is a class template, which means these columns can be of different data types. I need these pointers to be polymorphistic. Defining a base class for all PrimitiveColumn
s helps to achieve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. Would be great to have it in code comment. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comment for clarification.
* Integrate with Arrow C data API. * Support Arrow dataset. * Support Arrow table.
* Add arrow-cdi.h whose content is copied from https://arrow.apache.org/docs/format/CDataInterface.html * Release memory allocated by Arrow after exported C data is consumed.
- Set labels, weights, base_margins, and qids from Arrow table or dataset if available - Fix lint issues - Add pyarrow to ci_build Docker image for cpu_test
719d28b
to
f9ba6d9
Compare
Yes. This step is done. |
Thanks! I will look into the details next week. Really appreciate your excellent work! |
I have restarted the CI. |
@trivialfis any work we need to do? |
I haven't looked into your new code yet, but from the CI:
|
Hi @trivialfis |
How's progress @xwu99 ? |
@pseudotensor Work is being continued in #7512 |
Apache Arrow defines a columnar memory format, which allows zero-copy read and lighting fast data access (link). This PR adds support for building xgboost DMatrix from the Apache Arrow data format.
This is a rework of #5667, which is now closed and all discussion should continue here.
Features:
SimpleDMatrix
class constructor that creates a DMatrix instance from Arrow data.pyarrow.Table
andpyarrow.dataset
. It provides better performance in building DMatrix than using the existing pandas interface.Performance of building DMatrix (Timing measured on Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz with 56 cores):
Peak memory during DMatrix building (Using a Mortgage dataset with 150 million rows as an example):
The final DMatrix memory size is 58 GB.
How to build:
-DUSE_ARROW=ON
to the cmake command.Usage (python):
A DMatrix can be built from either a
pyarrow.Table
or apyarrow.dataset
. For example,