Skip to content

Commit

Permalink
[Doc] Add limitation about TLS optimization (taichi-dev#4877)
Browse files Browse the repository at this point in the history
* [Doc] Add limitation about TLS optimization

* Add link to reduction sum benchmark

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: Haidong Lan <turbo0628g@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored and k-ye committed May 5, 2022
1 parent 18fd02e commit 71a9c84
Showing 1 changed file with 7 additions and 3 deletions.
10 changes: 7 additions & 3 deletions docs/lang/articles/advanced/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,14 +153,18 @@ Additionally, the last atomic add to the global memory `s[None]` is optimized us
CUDA's warp-level intrinsics, further reducing the number of required atomic adds.

Currently, Taichi supports TLS optimization for these reduction operators: `add`,
`sub`, `min` and `max`. [Here](https://github.com/taichi-dev/taichi/pull/2956) is
a benchmark comparison when running a global max reduction on a 1-D Taichi field
`sub`, `min` and `max` on **0D** scalar/vector/matrix `ti.field`s. It is not yet
supported on `ti.ndarray`s. [Here](https://github.com/taichi-dev/taichi/pull/2956)
is a benchmark comparison when running a global max reduction on a 1-D Taichi field
of 8M floats on an Nvidia GeForce RTX 3090 card:

* TLS disabled: 5.2 x 1e3 us
* TLS enabled: 5.7 x 1e1 us

TLS has led to an approximately 100x speedup.
TLS has led to an approximately 100x speedup. We also show that TLS reduction sum
achieves comparable performance with CUDA implementations, see
[benchmark](https://github.com/taichi-dev/taichi_benchmark/tree/main/reduce_sum) for
details.

### Block Local Storage (BLS)

Expand Down

0 comments on commit 71a9c84

Please sign in to comment.