You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue represents potential performance improvements for the reduce-then-scan algorithm, which have come out of discussions about the PRs: #1762, #1763, #1764, #1765
Upgrade single workgroup scan (and copy_if) with ideas from reduce-then-scan. We lowered our threshold of when to select single workgroup scan because it does not perform as well as reduce-then-scan. However, there is no reason we should not be able to do better in a single kernel if we fit by use similar strategies to reduce-then-scan. Also, we should be able to relax our requirements for known identity, and open up single workgroup scan to more input types / operations.
For Make Unique family of APIs use reduce_then_scan #1765,
a) For each input element, we load the element and the previous element from the input sequence to compare for uniqueness. This amounts to two dereferences from the input sequence (global memory). It may benefit us to load first to SLM, then read from SLM during the input processing. It must be studied to see if this is handled implicitly by the caching system or if this would be a benefit. We would need special infrastructure in the kernel to handle kernels which would benefit from the SLM load, as the other families should not benefit from this change.
b) We shift the scan by 1 and handle the 0th element specially. This may cause issues in storing to unaligned output. We should investigate if we can do better by masking the 0th overall element, but it may be difficult to avoid additional branches when doing this.
For Add reduce then scan algorithm for transform scan family #1762, we currently work around the in-place exclusive scan by copying it to another buffer, then doing an out of place exclusive scan. This is only required for multi-pass scan, not single_wg scan and not reduce-then-scan algorithms. We should push this workaround lower into the code, and only apply it for algorithms which require it.
The text was updated successfully, but these errors were encountered:
This issue represents potential performance improvements for the reduce-then-scan algorithm, which have come out of discussions about the PRs:
#1762, #1763, #1764, #1765
Upgrade single workgroup scan (and copy_if) with ideas from reduce-then-scan. We lowered our threshold of when to select single workgroup scan because it does not perform as well as reduce-then-scan. However, there is no reason we should not be able to do better in a single kernel if we fit by use similar strategies to reduce-then-scan. Also, we should be able to relax our requirements for known identity, and open up single workgroup scan to more input types / operations.
For Make Unique family of APIs use reduce_then_scan #1765,
a) For each input element, we load the element and the previous element from the input sequence to compare for uniqueness. This amounts to two dereferences from the input sequence (global memory). It may benefit us to load first to SLM, then read from SLM during the input processing. It must be studied to see if this is handled implicitly by the caching system or if this would be a benefit. We would need special infrastructure in the kernel to handle kernels which would benefit from the SLM load, as the other families should not benefit from this change.
b) We shift the scan by 1 and handle the 0th element specially. This may cause issues in storing to unaligned output. We should investigate if we can do better by masking the 0th overall element, but it may be difficult to avoid additional branches when doing this.
For Add reduce then scan algorithm for transform scan family #1762, we currently work around the in-place exclusive scan by copying it to another buffer, then doing an out of place exclusive scan. This is only required for multi-pass scan, not single_wg scan and not reduce-then-scan algorithms. We should push this workaround lower into the code, and only apply it for algorithms which require it.
The text was updated successfully, but these errors were encountered: