Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TreeEnsemble speed up #17449

Merged
merged 7 commits into from
Sep 12, 2023
Merged

TreeEnsemble speed up #17449

merged 7 commits into from
Sep 12, 2023

Conversation

adityagoel4512
Copy link
Contributor

@adityagoel4512 adityagoel4512 commented Sep 7, 2023

Description

This PR proposes a change that should speed up inference for the TreeEnsemble* kernels. Previously, when traversing a decision tree, the TreeNodeElement pointer would be incremented or decremented to the appropriate child node - I assume this was because the truenode_inc_or_first_weight and falsenode_inc_or_n_weights member were overloaded for two purposes.

In this PR, we now assign the true branch pointer. We also initialise nodes_ in a pre-order traversal which means that the false branch's position can be resolved statically and does not need to be stored and should lead to somewhat better cache locality.

I observe the following speed ups. The benchmarks used are derived from those in https://github.com/siboehm/lleaves/tree/master/benchmarks and the baseline is the main branch.

NYC Dataset

Number of threads Baseline Pointer assignment Pre-ordered initialisation Pointer assignment % improvement Pre-ordered initialisation % improvement
1 176.539 155.709 145.119 11.7989 17.7976
4 59.9015 51.9652 50.0884 13.2488 16.382
8 34.5561 31.3024 28.2535 9.41581 18.2387

Airline Dataset

Number of threads Baseline Pointer assignment Pre-ordered initialisation Pointer assignment % improvement Pre-ordered initialisation % improvement
1 2127.34 1389.7 920.373 34.6745 56.736
4 723.307 481.634 310.618 33.4122 57.0558
8 420.722 278.397 185.265 33.8286 55.9651

mtpl2 Dataset

Number of threads Baseline Pointer assignment Pre-ordered initialisation Pointer assignment % improvement Pre-ordered initialisation % improvement
1 1143.62 1020.04 998.171 10.8055 13.0988
4 386.153 339.905 328.061 11.9764 14.3729
8 225.995 200.665 199.057 11.2084 13.4408

These were run using an M2 Pro with 16GB of RAM. All times are in milliseconds and averages over 10 runs with a batch size of 100,000.

Motivation and Context

Performance improvements.

@adityagoel4512
Copy link
Contributor Author

@xadupre, perhaps this is most of interest for you?

@xadupre
Copy link
Member

xadupre commented Sep 7, 2023

It is possible to know the number of trees and the depth of the ensembles used in the benchmark?

@adityagoel4512
Copy link
Contributor Author

adityagoel4512 commented Sep 9, 2023

The TreeNodeElement struct is now the same size as before. We now initialise nodes_ in a pre-order fashion, so that the false child node can be resolved statically (in the next position in the array) and so a reference doesn't need to be stored. This also gives some further improved performance gains. I have updated the PR with an additional column illustrating this.

It is possible to know the number of trees and the depth of the ensembles used in the benchmark?

NYC: 100 trees, maximum depth is 13.
Airline: 100 trees, maximum depth is 15. These trees have quite a few set memberships which get rolled out by ONNX.
mtpl2: 1000 trees, maximum depth is 20.

@xadupre
Copy link
Member

xadupre commented Sep 10, 2023

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 10, 2023

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@adityagoel4512
Copy link
Contributor Author

It looks like the Linux arm64 build CI is not operational (https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1134836&view=logs&jobId=1e0830bb-fd74-5d0a-5029-1c63b4266d7b). I don't have the ability to retrigger the CI unfortunately.

@xadupre
Copy link
Member

xadupre commented Sep 10, 2023

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline

@xadupre
Copy link
Member

xadupre commented Sep 10, 2023

/azp run orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 10, 2023

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 10, 2023

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run MacOS CI Pipeline,ONNX Runtime Web CI Pipeline,Windows ARM64 QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run Linux CPU CI Pipeline,Windows CPU CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows GPU TensorRT CI Pipeline,orttraining-linux-ci-pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@xadupre
Copy link
Member

xadupre commented Sep 11, 2023

/azp run Windows ARM64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@adityagoel4512
Copy link
Contributor Author

CI looks green @xadupre. Would it be possible to add this to the merge queue?

@xadupre xadupre merged commit db558ef into microsoft:main Sep 12, 2023
66 checks passed
@cbourjau
Copy link
Contributor

Might it be possible to get these changes into the upcoming 1.16 release, @xadupre @snnn ? The TreeEnsembleRegressor is one of the main working horses in our deployments while the changes of this PR seem isolated from the rest of the project.

@xadupre
Copy link
Member

xadupre commented Sep 14, 2023

The list of PRs scheduled to be merged into 1.16 was already frozen. But these changes are currently available through the nightly build.

@cbourjau
Copy link
Contributor

Thanks for the clarification; albeit a disappointing one :(. Is there some public information about the logistics involving the upcoming release (code freeze dates, the mentioned list of PRs, etc.)?

snnn pushed a commit that referenced this pull request Sep 29, 2023
### Description
This PR proposes a change that should speed up inference for the
TreeEnsemble* kernels. Previously, when traversing a decision tree, the
`TreeNodeElement` pointer would be incremented or decremented to the
appropriate child node - I assume this was because the
`truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member
were overloaded for two purposes.

In this PR, we now assign the true branch pointer. We also initialise
`nodes_` in a pre-order traversal which means that the false branch's
position can be resolved statically and does not need to be stored.

I observe the following speed ups. The benchmarks used are derived from
those in https://github.com/siboehm/lleaves/tree/master/benchmarks and
the baseline is the main branch.

NYC Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 |
| 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 |
| 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 |

Airline Dataset
---------------

| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 |
| 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 |
| 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 |

mtpl2 Dataset
--------------

| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 |
| 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 |
| 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 |

These were run using an M2 Pro with 16GB of RAM. All times are in
milliseconds and averages over 10 runs with a batch size of 100,000.

### Motivation and Context
Performance improvements.
snnn added a commit that referenced this pull request Oct 2, 2023
Cherry-pick the following PRs to the release branch:

Fix: Fail to skip disabledmodel in winml (#17728) 
Move dotnet build and test into docker in Linux CPU CI (#17417) 
Run Nuget_Test_Linux_GPU in container (#17452) 
Run Final_Jar_Testing_Linux_GPU in docker (#17533) 
TreeEnsemble speed up (#17449) 
Remove onnxruntime extensions from list of gitmodules (#17615) 
Include onnxruntime_float16.h in the package. (#17637) 
Fix static quantization for QDQ and Percentile distribution (#17649) 
[TensorRT EP] Back out the PerThreadContext (#17690) 
Update nodejs to 18.x (#17657) 
Update linux-wasm-ci.yml: remove the ln command (#17735)
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
This PR proposes a change that should speed up inference for the
TreeEnsemble* kernels. Previously, when traversing a decision tree, the
`TreeNodeElement` pointer would be incremented or decremented to the
appropriate child node - I assume this was because the
`truenode_inc_or_first_weight` and `falsenode_inc_or_n_weights` member
were overloaded for two purposes.

In this PR, we now assign the true branch pointer. We also initialise
`nodes_` in a pre-order traversal which means that the false branch's
position can be resolved statically and does not need to be stored.

I observe the following speed ups. The benchmarks used are derived from
those in https://github.com/siboehm/lleaves/tree/master/benchmarks and
the baseline is the main branch.

NYC Dataset
--------------
| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 176.539 | 155.709 | 145.119 | 11.7989 | 17.7976 |
| 4 | 59.9015 | 51.9652 | 50.0884 | 13.2488 | 16.382 |
| 8 | 34.5561 | 31.3024 | 28.2535 | 9.41581 | 18.2387 |

Airline Dataset
---------------

| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 2127.34 | 1389.7 | 920.373 | 34.6745 | 56.736 |
| 4 | 723.307 | 481.634 | 310.618 | 33.4122 | 57.0558 |
| 8 | 420.722 | 278.397 | 185.265 | 33.8286 | 55.9651 |

mtpl2 Dataset
--------------

| Number of threads | Baseline | Pointer assignment | Pre-ordered
initialisation | Pointer assignment % improvement | Pre-ordered
initialisation % improvement |

|--------------------:|-----------:|---------------------:|-----------------------------:|-----------------------------------:|-------------------------------------------:|
| 1 | 1143.62 | 1020.04 | 998.171 | 10.8055 | 13.0988 |
| 4 | 386.153 | 339.905 | 328.061 | 11.9764 | 14.3729 |
| 8 | 225.995 | 200.665 | 199.057 | 11.2084 | 13.4408 |

These were run using an M2 Pro with 16GB of RAM. All times are in
milliseconds and averages over 10 runs with a batch size of 100,000.

### Motivation and Context
Performance improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants