TreeEnsemble speed up #17449

adityagoel4512 · 2023-09-07T12:39:05Z

Description

This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the TreeNodeElement pointer would be incremented or decremented to the appropriate child node - I assume this was because the truenode_inc_or_first_weight and falsenode_inc_or_n_weights member were overloaded for two purposes.

In this PR, we now assign the true branch pointer. We also initialise nodes_ in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored and should lead to somewhat better cache locality.

I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch.

NYC Dataset

Number of threads	Baseline	Pointer assignment	Pre-ordered initialisation	Pointer assignment % improvement	Pre-ordered initialisation % improvement
1	176.539	155.709	145.119	11.7989	17.7976
4	59.9015	51.9652	50.0884	13.2488	16.382
8	34.5561	31.3024	28.2535	9.41581	18.2387

Airline Dataset

Number of threads	Baseline	Pointer assignment	Pre-ordered initialisation	Pointer assignment % improvement	Pre-ordered initialisation % improvement
1	2127.34	1389.7	920.373	34.6745	56.736
4	723.307	481.634	310.618	33.4122	57.0558
8	420.722	278.397	185.265	33.8286	55.9651

mtpl2 Dataset

Number of threads	Baseline	Pointer assignment	Pre-ordered initialisation	Pointer assignment % improvement	Pre-ordered initialisation % improvement
1	1143.62	1020.04	998.171	10.8055	13.0988
4	386.153	339.905	328.061	11.9764	14.3729
8	225.995	200.665	199.057	11.2084	13.4408

These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000.

Motivation and Context

Performance improvements.

adityagoel4512 · 2023-09-07T12:39:40Z

@xadupre, perhaps this is most of interest for you?

onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h

onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h

xadupre · 2023-09-07T12:51:38Z

It is possible to know the number of trees and the depth of the ensembles used in the benchmark?

…Element

adityagoel4512 · 2023-09-09T01:28:37Z

The TreeNodeElement struct is now the same size as before. We now initialise nodes_ in a pre-order fashion, so that the false child node can be resolved statically (in the next position in the array) and so a reference doesn't need to be stored. This also gives some further improved performance gains. I have updated the PR with an additional column illustrating this.

It is possible to know the number of trees and the depth of the ensembles used in the benchmark?

NYC: 100 trees, maximum depth is 13.
Airline: 100 trees, maximum depth is 15. These trees have quite a few set memberships which get rolled out by ONNX.
mtpl2: 1000 trees, maximum depth is 20.

onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h

xadupre · 2023-09-10T10:40:59Z

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2023-09-10T10:41:28Z

Azure Pipelines successfully started running 7 pipeline(s).

xadupre · 2023-09-10T13:40:13Z

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

azure-pipelines · 2023-09-10T13:40:44Z

Azure Pipelines successfully started running 7 pipeline(s).

adityagoel4512 · 2023-09-10T18:52:30Z

It looks like the Linux arm64 build CI is not operational (https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1134836&view=logs&jobId=1e0830bb-fd74-5d0a-5029-1c63b4266d7b). I don't have the ability to retrigger the CI unfortunately.

xadupre · 2023-09-10T21:42:52Z

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline

xadupre · 2023-09-10T21:43:08Z

/azp run orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2023-09-10T21:43:21Z

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines · 2023-09-10T21:43:21Z

Azure Pipelines successfully started running 7 pipeline(s).

xadupre · 2023-09-10T22:01:14Z

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline

azure-pipelines · 2023-09-10T22:01:38Z

Azure Pipelines successfully started running 7 pipeline(s).

xadupre · 2023-09-10T22:04:28Z

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline

azure-pipelines · 2023-09-10T22:04:42Z

Azure Pipelines successfully started running 4 pipeline(s).

xadupre · 2023-09-11T11:37:51Z

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2023-09-11T11:38:07Z

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines · 2023-09-11T12:38:23Z

Azure Pipelines successfully started running 4 pipeline(s).

xadupre · 2023-09-11T12:39:04Z

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2023-09-11T12:39:29Z

Azure Pipelines successfully started running 5 pipeline(s).

xadupre · 2023-09-11T13:25:34Z

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

xadupre · 2023-09-11T13:25:43Z

/azp run MacOS CI Pipeline,ONNX Runtime Web CI Pipeline,Windows ARM64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

xadupre · 2023-09-11T13:25:52Z

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2023-09-11T13:26:02Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2023-09-11T13:26:04Z

Azure Pipelines successfully started running 2 pipeline(s).

xadupre · 2023-09-11T13:26:05Z

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline

azure-pipelines · 2023-09-11T13:26:07Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2023-09-11T13:26:30Z

Azure Pipelines successfully started running 7 pipeline(s).

xadupre · 2023-09-11T14:33:51Z

/azp run orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

azure-pipelines · 2023-09-11T14:34:07Z

Azure Pipelines successfully started running 2 pipeline(s).

xadupre · 2023-09-11T16:27:49Z

/azp run Windows ARM64 QNN CI Pipeline

azure-pipelines · 2023-09-11T16:27:58Z

Azure Pipelines successfully started running 1 pipeline(s).

adityagoel4512 · 2023-09-12T08:25:20Z

CI looks green @xadupre. Would it be possible to add this to the merge queue?

cbourjau · 2023-09-14T08:01:55Z

Might it be possible to get these changes into the upcoming 1.16 release, @xadupre @snnn ? The TreeEnsembleRegressor is one of the main working horses in our deployments while the changes of this PR seem isolated from the rest of the project.

xadupre · 2023-09-14T08:32:57Z

The list of PRs scheduled to be merged into 1.16 was already frozen. But these changes are currently available through the nightly build.

cbourjau · 2023-09-14T09:32:14Z

Thanks for the clarification; albeit a disappointing one :(. Is there some public information about the logistics involving the upcoming release (code freeze dates, the mentioned list of PRs, etc.)?

### Description This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the `TreeNodeElement` pointer would be incremented or decremented to the appropriate child node - I assume this was because the `truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member were overloaded for two purposes. In this PR, we now assign the true branch pointer. We also initialise `nodes_` in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored. I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch. NYC Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 | | 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 | | 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 | Airline Dataset --------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 | | 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 | | 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 | mtpl2 Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 | | 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 | | 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 | These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000. ### Motivation and Context Performance improvements.

Cherry-pick the following PRs to the release branch： Fix: Fail to skip disabledmodel in winml (#17728) Move dotnet build and test into docker in Linux CPU CI (#17417) Run Nuget_Test_Linux_GPU in container (#17452) Run Final_Jar_Testing_Linux_GPU in docker (#17533) TreeEnsemble speed up (#17449) Remove onnxruntime extensions from list of gitmodules (#17615) Include onnxruntime_float16.h in the package. (#17637) Fix static quantization for QDQ and Percentile distribution (#17649) [TensorRT EP] Back out the PerThreadContext (#17690) Update nodejs to 18.x (#17657) Update linux-wasm-ci.yml: remove the ln command (#17735)

### Description This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the `TreeNodeElement` pointer would be incremented or decremented to the appropriate child node - I assume this was because the `truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member were overloaded for two purposes. In this PR, we now assign the true branch pointer. We also initialise `nodes_` in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored. I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch. NYC Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 | | 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 | | 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 | Airline Dataset --------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 | | 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 | | 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 | mtpl2 Dataset -------------- | Number of threads | Baseline | Pointer assignment | Pre-ordered initialisation | Pointer assignment % improvement | Pre-ordered initialisation % improvement | |--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:| | 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 | | 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 | | 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 | These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000. ### Motivation and Context Performance improvements.

unionv1

b0ba16b

Remove end if

5355a20

adityagoel4512 force-pushed the union branch from 978f650 to 5355a20 Compare September 7, 2023 12:42

xadupre reviewed Sep 7, 2023

View reviewed changes

onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h Show resolved Hide resolved

xadupre reviewed Sep 7, 2023

View reviewed changes

onnxruntime/core/providers/cpu/ml/tree_ensemble_aggregator.h Show resolved Hide resolved

Initialise nodes_ with preorder traversal and reduce size of TreeNode…

eca75d8

…Element

Remove stack import

ff40a75

mszhanyi mentioned this pull request Sep 9, 2023

Move Linux python test into docker #17479

Merged

xadupre reviewed Sep 9, 2023

View reviewed changes

onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h Outdated Show resolved Hide resolved

xadupre reviewed Sep 9, 2023

View reviewed changes

onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h Outdated Show resolved Hide resolved

Use static_cast in error message

e8be1e0

Use size_t

82dcb34

adityagoel4512 force-pushed the union branch from 730e86e to 4150436 Compare September 11, 2023 12:46

Lint

4bf61f7

adityagoel4512 force-pushed the union branch from 4150436 to 4bf61f7 Compare September 11, 2023 13:23

xadupre approved these changes Sep 11, 2023

View reviewed changes

xadupre merged commit db558ef into microsoft:main Sep 12, 2023
66 checks passed

adityagoel4512 deleted the union branch September 12, 2023 08:27

faxu added the release:1.16.1 label Sep 26, 2023

snnn removed the release:1.16.1 label Oct 2, 2023

TreeEnsemble speed up #17449

TreeEnsemble speed up #17449

Conversation

adityagoel4512 commented Sep 7, 2023 • edited Loading

Description

NYC Dataset

Airline Dataset

mtpl2 Dataset

Motivation and Context

adityagoel4512 commented Sep 7, 2023

xadupre commented Sep 7, 2023

adityagoel4512 commented Sep 9, 2023 • edited Loading

xadupre commented Sep 10, 2023

azure-pipelines bot commented Sep 10, 2023

xadupre commented Sep 10, 2023

azure-pipelines bot commented Sep 10, 2023

adityagoel4512 commented Sep 10, 2023

xadupre commented Sep 10, 2023

xadupre commented Sep 10, 2023

azure-pipelines bot commented Sep 10, 2023

azure-pipelines bot commented Sep 10, 2023

xadupre commented Sep 10, 2023

azure-pipelines bot commented Sep 10, 2023

xadupre commented Sep 10, 2023

azure-pipelines bot commented Sep 10, 2023

xadupre commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

xadupre commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

xadupre commented Sep 11, 2023

xadupre commented Sep 11, 2023

xadupre commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

xadupre commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

xadupre commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

xadupre commented Sep 11, 2023

azure-pipelines bot commented Sep 11, 2023

adityagoel4512 commented Sep 12, 2023

cbourjau commented Sep 14, 2023

xadupre commented Sep 14, 2023

cbourjau commented Sep 14, 2023

adityagoel4512 commented Sep 7, 2023 •

edited

Loading

adityagoel4512 commented Sep 9, 2023 •

edited

Loading