You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, XGBoost has a GHistIndexMatrix which stores data in CSR format; however, if the data is completely dense (no missing data such as NaNs), it optimizes this away by not reserving the extra space for the index arrays. This is not true of the EllpackPageImpl, which allocates the space and creates the indices, regardless of whether is_dense is set or not.
We currently use XGBoost to build trees on large amounts of almost completely dense data, which can be quantized to 1 byte per float. This means that the sparsity optimizations actually balloon the size of the memory required, with the GPU memory being used being almost exactly 3 bytes per float, as one would expect when storing dense data in Ellpack. One can observe with the just GHistIndexMatrix that the memory consumption increases by 3-4x when a single NaN is added to the data, making the is_dense false.
The feature request is to simply allow the user to turn off CSR/Ellpack format and store densely with a manual config option in QuantileDMatrix.
I've tried reading the code, and I don't understand it fully, so I can't be of much help with coming up with a roadmap.
The text was updated successfully, but these errors were encountered:
Currently, XGBoost has a GHistIndexMatrix which stores data in CSR format; however, if the data is completely dense (no missing data such as NaNs), it optimizes this away by not reserving the extra space for the index arrays. This is not true of the EllpackPageImpl, which allocates the space and creates the indices, regardless of whether
is_dense
is set or not.We currently use XGBoost to build trees on large amounts of almost completely dense data, which can be quantized to 1 byte per float. This means that the sparsity optimizations actually balloon the size of the memory required, with the GPU memory being used being almost exactly 3 bytes per float, as one would expect when storing dense data in Ellpack. One can observe with the just GHistIndexMatrix that the memory consumption increases by 3-4x when a single NaN is added to the data, making the is_dense false.
The feature request is to simply allow the user to turn off CSR/Ellpack format and store densely with a manual config option in QuantileDMatrix.
I've tried reading the code, and I don't understand it fully, so I can't be of much help with coming up with a roadmap.
The text was updated successfully, but these errors were encountered: