Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple data types (will break up into smaller pull requests) #196

Closed
wants to merge 41 commits into from

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Aug 29, 2020

Addresses #95 and #111.

Trying again, since #130 failed. This time, I made the Model class to be polymorphic. This way, the amount of pointer indirection is minimized.

Summary: Model is an opaque container that wraps the polymorphic handle ModelImpl<ThresholdType, LeafOutputType>. The handle in turn stores the list of trees Tree<ThresholdType, LeafOutputType>. To unbox the Model container and obtain ModelImpl<ThresholdType, LeafOutputType>, use Model::Dispatch(<lambda expression>).

Also, upgrade to C++14 to access the generic lambda feature, which proved to be very useful in the dispatching logic for the polymorphic Model class.

EDIT. I will break up this PR into smaller PRs, once I get the whole system working together correctly.

TODOs

  • Turn the Model and Tree classes into template classes
  • Revise the string templates so that correct data types are used in the generated C code
  • Rewrite the model builder class
  • Revise the zero-copy serializer
  • Create an abstract matrix class that supports multiple data types (float32, float64 for now).
  • Redesign the C runtime API, using the abstract matrix class.
  • Ensure accuracy of scikit-learn models. To achieve the best results, use float32 for the input matrix and float64 for the split thresholds and leaf outputs.
  • Revise the JVM runtime.

@hcho3 hcho3 marked this pull request as draft August 29, 2020 07:54
@hcho3 hcho3 force-pushed the multi_type_support2 branch 3 times, most recently from 4458ec7 to 322acc8 Compare August 29, 2020 08:06
@hcho3 hcho3 force-pushed the multi_type_support2 branch from 322acc8 to 2e823eb Compare August 29, 2020 22:24
@hcho3 hcho3 force-pushed the multi_type_support2 branch from ad3314a to c0d1abd Compare August 31, 2020 09:52
@hcho3 hcho3 force-pushed the multi_type_support2 branch from 5ec551a to acc20be Compare August 31, 2020 18:58
@hcho3 hcho3 force-pushed the multi_type_support2 branch from 3f3ec0d to 77477e7 Compare August 31, 2020 21:51
@hcho3 hcho3 force-pushed the multi_type_support2 branch from ab6277d to b846f6a Compare August 31, 2020 23:30
@hcho3 hcho3 force-pushed the multi_type_support2 branch from a658f01 to 4515d9d Compare September 2, 2020 04:43
@hcho3 hcho3 force-pushed the multi_type_support2 branch from 4515d9d to 914469b Compare September 3, 2020 02:37
@hcho3 hcho3 marked this pull request as ready for review September 10, 2020 07:40
@hcho3 hcho3 closed this Sep 10, 2020
@hcho3 hcho3 reopened this Sep 10, 2020
…mlc#196) (dmlc#199)

* New prediction runtime C API, to support multiple data types

* Address reviewer's feedback
) (dmlc#201)

* Upgrade C++ standard to C++14.
* Split struct Model into class Model and class ModelImpl. The ModelImpl class will soon become a template class in order to hold Tree objects with uint32, float32, or float64 type. The Model class will become an abstract class so as to avoid exposing ModelImpl to external interface. (It's very hard to pass template classes through a FFI boundary.)
* Change signature of methods that return Model, since Model is now an abstract class. These functions now return std::unique_ptr<Model>.
* Move bodies of tiny methods from tree_impl.h to tree.h. This will reduce verbosity once ModelImpl becomes a template class.
hcho3 and others added 2 commits September 25, 2020 01:04
… data types (Part of dmlc#196) (dmlc#198)

* Create template classes for model representations to support multiple data types

* Update include/treelite/tree.h

Co-authored-by: William Hicks <wphicks@users.noreply.github.com>

* Address review comments from @canonizer

* Address more comments from @canonizer

Co-authored-by: William Hicks <wphicks@users.noreply.github.com>
Co-authored-by: William Hicks <wphicks@users.noreply.github.com>
Co-authored-by: Andy Adinets <aadinets@nvidia.com>
@hcho3 hcho3 force-pushed the multi_type_support2 branch from 65d83ee to d9173df Compare September 25, 2020 08:47
@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 25, 2020

#198, #199, #201, and #203 have been merged into a separate branch multi_type_support_refactor. Once the XGBoost JSON parser (#202) is merged into the mainline, the refactor branch will be also merged into the mainline.

@hcho3 hcho3 closed this Sep 25, 2020
@hcho3 hcho3 deleted the multi_type_support2 branch September 25, 2020 09:37
hcho3 added a commit that referenced this pull request Oct 9, 2020
…196) (#199)

* The runtime now queries the data type for the model it loads, via QueryThresholdType() and QueryLeafOutputType(). Every compiled model now embeds the type information.
* Implement a full-fledged data matrix class DMatrix in the runtime, to replace DenseBatch and SparseBatch. The former *Batch classes assumed float32 data, whereas the new DMatrix class is able to handle both float32 and float64.
* Some API functions like PredictBatch() now takes void* pointers to accommodate multiple data types.

Co-authored-by: Yuta Hinokuma <higumachan@users.noreply.github.com>
hcho3 added a commit that referenced this pull request Oct 9, 2020
…#201)

* Upgrade C++ standard to C++14.
* Split struct Model into class Model and class ModelImpl. The ModelImpl class will soon become a template class in order to hold Tree objects with uint32, float32, or float64 type. The Model class will become an abstract class so as to avoid exposing ModelImpl to external interface. (It's very hard to pass template classes through a FFI boundary.)
* Change signature of methods that return Model, since Model is now an abstract class. These functions now return std::unique_ptr<Model>.
* Move bodies of tiny methods from tree_impl.h to tree.h. This will reduce verbosity once ModelImpl becomes a template class.

Co-authored-by: Andy Adinets <aadinets@nvidia.com>
hcho3 added a commit that referenced this pull request Oct 9, 2020
… data types (Part of #196) (#198)

* The ModelImpl class (created in #201) becomes the template class ModelImpl<ThresholdType, LeafOutputType>.
* Implement template classes ModelImpl<ThresholdType, LeafOutputType> and TreeImpl<ThresholdType, LeafOutputType> that contain the details of the tree ensemble model. The template classes are parameterized by the types of the thresholds and leaf outputs. Currently the following combinations are allowed:

| Threshold type | Leaf output type |
|----------------|------------------|
| float32        | float32          |
| float32        | uint32           |
| float64        | float64          |
| float64        | uint32           |
|----------------|------------------|

* Revise the zero-copy serialization protocol, to prepend the type information (threshold_type, leaf_output_type) to the serialized types so that the recipient will choose the correct ModelImpl<ThresholdType, LeafOutputType> to deserialize to.
* A run-time type dispatching system using the enum type TypeInfo. Users are able to dispatch a correct version of ModelImpl<ThresholdType, LeafOutputType> by specifying a pair of TypeInfo values. We also implement a set of convenient functions, such as InferTypeInfoOf<T> that converts the template arg T into TypeInfo enum.

Co-authored-by: William Hicks <wphicks@users.noreply.github.com>
Co-authored-by: Andy Adinets <aadinets@nvidia.com>
hcho3 added a commit that referenced this pull request Oct 9, 2020
Addresses #95 and #111.
Follow-up to #198, #199, #201

Trying again, since #130 failed. This time, I made the Model class to be polymorphic. This way, the amount of pointer indirection is minimized.

Summary: Model is an opaque container that wraps the polymorphic handle ModelImpl<ThresholdType, LeafOutputType>. The handle in turn stores the list of trees Tree<ThresholdType, LeafOutputType>. To unbox the Model container and obtain ModelImpl<ThresholdType, LeafOutputType>, use Model::Dispatch(<lambda expression>).

Also, upgrade to C++14 to access the generic lambda feature, which proved to be very useful in the dispatching logic for the polymorphic Model class.

* Turn the Model and Tree classes into template classes
* Revise the string templates so that correct data types are used in the generated C code
* Rewrite the model builder class
* Revise the zero-copy serializer
* Create an abstract matrix class that supports multiple data types (float32, float64 for now).
* Move the DMatrix class to the runtime.
* Extend the DMatrix class so that it can hold float32 and float64.
* Redesign the C runtime API using the DMatrix class.
* Ensure accuracy of scikit-learn models. To achieve the best results, use float32 for the input matrix and float64 for the split thresholds and leaf outputs.
* Revise the JVM runtime.
@hcho3 hcho3 mentioned this pull request Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants