Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for one hot split. #5949

Closed
wants to merge 1 commit into from

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jul 28, 2020

This PR aims to have a working pipeline for categorical data, but the support is very limited at current form. To run tests, one needs to:

  • Use Python interface.
  • Use gbtree.
  • Use gpu_hist tree method. Other tree methods are coming.
  • Use DMatrix, DeviceQuantileDMatrix is not yet supported.
  • Specify enable_categorical for DMatrix.
  • Do not use weights.
  • Use pandas with categorical feature type.
  • Set gpu_predictor explicitly.
  • Use JSON for model persistent.
  • Specify enable_experimental_json_serialization even if you don't use pickle.

Limitations

  • The support is limited to 1 vs rest categorical split. Other categorical specializations are coming.
  • There's no mapping between categorical value and histogram bin. So memory usage might be sub-optimal when categories are sparse.

include/xgboost/feature_map.h Outdated Show resolved Hide resolved
python-package/xgboost/data.py Outdated Show resolved Hide resolved
src/common/categorical.h Outdated Show resolved Hide resolved
src/common/quantile.cu Outdated Show resolved Hide resolved
src/tree/gpu_hist/histogram.cu Outdated Show resolved Hide resolved
src/tree/updater_gpu_hist.cu Outdated Show resolved Hide resolved
tests/cpp/tree/gpu_hist/test_evaluate_splits.cu Outdated Show resolved Hide resolved
tests/cpp/predictor/test_predictor.cc Show resolved Hide resolved
tests/python/testing.py Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Jul 28, 2020

@hcho3 Right now I'm reusing the split cond in RegTree for categorical split. But once we go beyond one hot split, the split condition can be a vector containing multiple categories. So a better structure with JSON schema is required. We need to have more discussion around this otherwise the model format might subject to change.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 28, 2020

@trivialfis We will want to add additional fields to the JSON schema to indicate categorical splits. For example, LightGBM stores decision_type, cat_threshold and cat_boundaries fields. The decision_type[i] tells us whether the i-th internal node is a categorical or a numerical split. The cat_threshold and cat_boundaries together store categories associated with the left child of each categorical split.
https://github.com/dmlc/treelite/blob/7f01e631da8687189473ad6b177ba0615b19496b/src/frontend/lightgbm.cc#L516-L528

It should not be too difficult to add new fields to the current JSON schema. (I'm assuming that vector with multiple categories will be JSON only and won't support legacy binary serialization.)

@trivialfis
Copy link
Member Author

trivialfis commented Jul 28, 2020

@hcho3

will be JSON only and won't support legacy binary serialization

You are right. But at the same time we can't break the binary format. For example, we can't add anything to RegTree::Node. Also before this PR is merged, I think we need to set JSON as the default pickle format, as Python interface goes through a serialization at the end of training to release GPU memory.

It should not be too difficult to add new fields to the current JSON schema.

I think so. Just to make sure that we have considered enough different cases.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 28, 2020

Got it. Let's discuss about how to add the necessary fields without touching RegTree::Node. One idea is to relegate RegTree::Node as an external facing interface and convert from RegTree::Node to the real node structure that includes the new fields. This will slow down deserializing binary models.

Another idea is to set the info_ field of RegTree::Node to NaN, to indicate a categorical split, and use the payload field of NaN to indicate where the extra information can be looked up.

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2020

Codecov Report

Merging #5949 into master will increase coverage by 0.00%.
The diff coverage is 84.61%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #5949   +/-   ##
=======================================
  Coverage   78.49%   78.49%           
=======================================
  Files          12       12           
  Lines        3013     3018    +5     
=======================================
+ Hits         2365     2369    +4     
- Misses        648      649    +1     
Impacted Files Coverage Δ
python-package/xgboost/core.py 77.73% <ø> (ø)
python-package/xgboost/data.py 58.56% <84.61%> (+0.25%) ⬆️
python-package/xgboost/dask.py 76.38% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8599f87...a4795d1. Read the comment docs.

@trivialfis
Copy link
Member Author

@hcho3 I changed the categories in tree into bitfield.

@trivialfis trivialfis force-pushed the categorical-split branch 2 times, most recently from bbb81dd to d8ac122 Compare August 6, 2020 14:53
@trivialfis
Copy link
Member Author

I reverted change of min value to reduce the size of this PR.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general approach looks good. I am looking forward to reviewing specifics once this PR gets broken up into smaller PRs.

include/xgboost/span.h Outdated Show resolved Hide resolved
src/tree/tree_model.cc Outdated Show resolved Hide resolved
src/tree/tree_model.cc Outdated Show resolved Hide resolved

auto is_cat = candidate.split.is_cat;
if (is_cat) {
auto cat = common::AsCat(candidate.split.fvalue);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: in one-hot encoded setting, there is only one matching category in every categorical split. However, the split_categories_ structure can later store multiple matching categories per split.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reminder to myself: Treelite must support JSON format of XGBoost.

@trivialfis
Copy link
Member Author

All merged.

@trivialfis trivialfis deleted the categorical-split branch October 10, 2020 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants