-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TreeEnsemble speed up #17449
TreeEnsemble speed up #17449
Conversation
@xadupre, perhaps this is most of interest for you? |
978f650
to
5355a20
Compare
It is possible to know the number of trees and the depth of the ensembles used in the benchmark? |
The
NYC: 100 trees, maximum depth is 13. |
/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
It looks like the Linux arm64 build CI is not operational (https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1134836&view=logs&jobId=1e0830bb-fd74-5d0a-5029-1c63b4266d7b). I don't have the ability to retrigger the CI unfortunately. |
/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline |
/azp run orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 2 pipeline(s). |
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline |
Azure Pipelines successfully started running 2 pipeline(s). |
Azure Pipelines successfully started running 4 pipeline(s). |
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 5 pipeline(s). |
730e86e
to
4150436
Compare
4150436
to
4bf61f7
Compare
/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline |
/azp run MacOS CI Pipeline,ONNX Runtime Web CI Pipeline,Windows ARM64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline |
/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run Windows ARM64 QNN CI Pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
CI looks green @xadupre. Would it be possible to add this to the merge queue? |
The list of PRs scheduled to be merged into 1.16 was already frozen. But these changes are currently available through the nightly build. |
Thanks for the clarification; albeit a disappointing one :(. Is there some public information about the logistics involving the upcoming release (code freeze dates, the mentioned list of PRs, etc.)? |
### Description This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the `TreeNodeElement` pointer would be incremented or decremented to the appropriate child node - I assume this was because the `truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member were overloaded for two purposes. In this PR, we now assign the true branch pointer. We also initialise `nodes_` in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored. I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch. NYC Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 | | 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 | | 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 | Airline Dataset --------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 | | 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 | | 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 | mtpl2 Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 | | 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 | | 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 | These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000. ### Motivation and Context Performance improvements.
Cherry-pick the following PRs to the release branch: Fix: Fail to skip disabledmodel in winml (#17728) Move dotnet build and test into docker in Linux CPU CI (#17417) Run Nuget_Test_Linux_GPU in container (#17452) Run Final_Jar_Testing_Linux_GPU in docker (#17533) TreeEnsemble speed up (#17449) Remove onnxruntime extensions from list of gitmodules (#17615) Include onnxruntime_float16.h in the package. (#17637) Fix static quantization for QDQ and Percentile distribution (#17649) [TensorRT EP] Back out the PerThreadContext (#17690) Update nodejs to 18.x (#17657) Update linux-wasm-ci.yml: remove the ln command (#17735)
### Description This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the `TreeNodeElement` pointer would be incremented or decremented to the appropriate child node - I assume this was because the `truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member were overloaded for two purposes. In this PR, we now assign the true branch pointer. We also initialise `nodes_` in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored. I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch. NYC Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 | | 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 | | 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 | Airline Dataset --------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 | | 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 | | 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 | mtpl2 Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 | | 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 | | 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 | These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000. ### Motivation and Context Performance improvements.
Description
This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the
TreeNodeElement
pointer would be incremented or decremented to the appropriate child node - I assume this was because thetruenode_inc_or_first_weight
andfalsenode_inc_or_n_weights
member were overloaded for two purposes.In this PR, we now assign the true branch pointer. We also initialise
nodes_
in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored and should lead to somewhat better cache locality.I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch.
NYC Dataset
Airline Dataset
mtpl2 Dataset
These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000.
Motivation and Context
Performance improvements.