Potential performance improvements for reduce-then-scan algorithm #1792

danhoeflinger · 2024-08-22T20:57:33Z

This issue represents potential performance improvements for the reduce-then-scan algorithm, which have come out of discussions about the PRs:
#1762, #1763, #1764, #1765

Upgrade single workgroup scan (and copy_if) with ideas from reduce-then-scan. We lowered our threshold of when to select single workgroup scan because it does not perform as well as reduce-then-scan. However, there is no reason we should not be able to do better in a single kernel if we fit by use similar strategies to reduce-then-scan. Also, we should be able to relax our requirements for known identity, and open up single workgroup scan to more input types / operations.
For Make Unique family of APIs use reduce_then_scan #1765,
a) For each input element, we load the element and the previous element from the input sequence to compare for uniqueness. This amounts to two dereferences from the input sequence (global memory). It may benefit us to load first to SLM, then read from SLM during the input processing. It must be studied to see if this is handled implicitly by the caching system or if this would be a benefit. We would need special infrastructure in the kernel to handle kernels which would benefit from the SLM load, as the other families should not benefit from this change.
b) We shift the scan by 1 and handle the 0th element specially. This may cause issues in storing to unaligned output. We should investigate if we can do better by masking the 0th overall element, but it may be difficult to avoid additional branches when doing this.
For Add reduce then scan algorithm for transform scan family #1762, we currently work around the in-place exclusive scan by copying it to another buffer, then doing an out of place exclusive scan. This is only required for multi-pass scan, not single_wg scan and not reduce-then-scan algorithms. We should push this workaround lower into the code, and only apply it for algorithms which require it.

danhoeflinger added the enhancement label Aug 22, 2024

danhoeflinger self-assigned this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance improvements for reduce-then-scan algorithm #1792

Potential performance improvements for reduce-then-scan algorithm #1792

danhoeflinger commented Aug 22, 2024

Potential performance improvements for reduce-then-scan algorithm #1792

Potential performance improvements for reduce-then-scan algorithm #1792

Comments

danhoeflinger commented Aug 22, 2024