Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain CSR matrix from DMatrix. #8269

Merged
merged 9 commits into from
Sep 29, 2022
Merged

Obtain CSR matrix from DMatrix. #8269

merged 9 commits into from
Sep 29, 2022

Conversation

trivialfis
Copy link
Member

  • Obtain CSR matrix from DMatrix.
  • Obtain gradient index from Quantile DMatrix.

This is mostly for testing at a higher level. Right now we rely on training a booster to infer that the DMatrix is correctly constructed. With this PR, we can ease the testing process. Also, this has been requested before.

The return value from Quantile DMatrix is histogram index instead of cut values as we shift the cut value to include min_values, which is not very useful for language bindings.

Unlike most of XGBoost C functions, the caller of C API is required to allocate the memory itself instead of using thread local memory from XGBoost. This is to avoid allocating a huge memory buffer that can not be freed until exiting the thread.

External memory is not supported.

Close #4759 .

@trivialfis trivialfis mentioned this pull request Sep 25, 2022
4 tasks
* Obtain CSR matrix from DMatrix.
* Obtain gradient index from Quantile DMatrix.

This is mostly for testing at higher level. Right now we rely on training a booster to
infer that the DMatrix is correctly constructed. With this PR, we can ease the testing
process. Also, this has been requested before.

The return value from Quantile DMatrix is histogram index instead of cut values as we
shift the cut value to include min_values, which is not very useful for language bindings.

Unlike most of XGBoost C functions, caller of C API is required to allocate the memory
itself instead of using thread local memory from XGBoost. This is to avoid allocating a
huge memory buffer that can not be freed until exiting the thread.

External memory is not supported.
XGB_DLL int XGDMatrixNumNonMissing(DMatrixHandle handle, bst_ulong *out);

/*!
* \brief Get the predictors from DMatrix as CSR matrix. If this is a quantized DMatrix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be useful to return quantised float values instead of integers?

We could return values such that bst.predict(quantile_dmat) == bst.predict(xgb.DMatrix(quantile_dmat.get_data()))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. That's a good argument. I skipped restoring the values because it contains artificially created min and max values, which might be confusing.

Copy link
Member Author

@trivialfis trivialfis Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the return value to be cut values along with the addition of suggested tests.

@trivialfis
Copy link
Member Author

@RAMitchell Please take another look when you are available.

@trivialfis trivialfis merged commit 55cf24c into dmlc:master Sep 29, 2022
@trivialfis trivialfis deleted the to-csr branch September 29, 2022 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants