[Doc] Add limitation about TLS optimization (taichi-dev#4877)

* [Doc] Add limitation about TLS optimization * Add link to reduction sum benchmark * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Haidong Lan <turbo0628g@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
k-ye · May 5, 2022 · 71a9c84 · 71a9c84
1 parent 18fd02e
commit 71a9c84
Showing 1 changed file with 7 additions and 3 deletions.
diff --git a/docs/lang/articles/advanced/performance.md b/docs/lang/articles/advanced/performance.md
@@ -153,14 +153,18 @@ Additionally, the last atomic add to the global memory `s[None]` is optimized us
 CUDA's warp-level intrinsics, further reducing the number of required atomic adds.
 
 Currently, Taichi supports TLS optimization for these reduction operators: `add`,
-`sub`, `min` and `max`. [Here](https://github.com/taichi-dev/taichi/pull/2956) is
-a benchmark comparison when running a global max reduction on a 1-D Taichi field
+`sub`, `min` and `max` on **0D** scalar/vector/matrix `ti.field`s. It is not yet
+supported on `ti.ndarray`s. [Here](https://github.com/taichi-dev/taichi/pull/2956)
+is a benchmark comparison when running a global max reduction on a 1-D Taichi field
 of 8M floats on an Nvidia GeForce RTX 3090 card:
 
 * TLS disabled: 5.2 x 1e3 us
 * TLS enabled: 5.7 x 1e1 us
 
-TLS has led to an approximately 100x speedup.
+TLS has led to an approximately 100x speedup. We also show that TLS reduction sum
+achieves comparable performance with CUDA implementations, see
+[benchmark](https://github.com/taichi-dev/taichi_benchmark/tree/main/reduce_sum) for
+details.
 
 ### Block Local Storage (BLS)