Implement parallel prefix sum / parallel scan #262

mratsim · 2024-02-02T10:01:19Z

Discovered by Axiom in the process of porting Scroll/Geometry LogUp (Multivariate Lookups) to their own lib and kindly pointed out to us when we're preparing to upstream the work.

Scroll+Geometry PRs:

Axiom / @jonathanpwang comment:

The MV Lookup PR assumes that parallelize chunks the iteration domain into same sized chunks.
This has been removed in #186 to fix load imbalance:

Currently if we take 40 items divided into 12 threads (AMD Ryzen 7800X, Apple M2 Pro or Intel i5-12600) the partitioning will lead to 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7 = 3*11 + 7 = 40. The “remainder” thread will have 2.33x more work to do.

This can be quite extreme for example if we have 351 items to split on 32 cores, 351/32 = 10.96, rounded to 10 with integer division. 10*31 = 310, the last core needs to process 41 items, 4.1x more than the others.

Instead we need a load-balanced prefix sum algorithm / parallel scan.

This is a common parallelization problem due to being well constrained but difficult enough with several approaches to teach CS student about parallelization algorithms and also cache concerns.

There is an excellent overview here on 3 different algorithms https://github.com/matteomazza91/parallel-prefix-sum/blob/master/report.pdf

Summing i and i+1, iterating i += 2 until N
Summing i and 2i, iterating i +=1 until N/2
Multi-stage processing that is a bit too complex to describe

An another:

https://www.cs.columbia.edu/~kar/pubsk/ADMS2020.pdf

And High-Performance Computing courses:

And GPU techniques (which are quite relevant now that CPUs have 100+ cores)

Also, unsure yet of how MV-lookups are implemented but it seems like currently the code builds a big table and do a prefix sum once everything is there. An alternative, streaming, approach could be to use Fenwick Trees so the prefix sum is updated on-the-fly. Note that this only make sense if the data isn't available at once but cumulated over many parts.

The text was updated successfully, but these errors were encountered:

mratsim · 2024-02-02T10:20:05Z

Looking into the rayon ecosystem and issues:

There is a solution via: https://crates.io/crates/rayon-scan
though a quick glance at it and I'm concerned at:

using linked-lists to store vectors, this likely is very cache unfriendly: https://github.com/rayon-rs/rayon/pull/1036/files#diff-f5ebef52187fef7eca19eed303a5ea89c0156aec0ea28238feae78e3cdd5305dR61-R88 which probably explains the perf issues the author found with normal integers but it might be okay for field elements though this is disappointing on a 12-core machine:

With a delay time of 100ns, the parallel speedup is 2.99
With a delay time of 10000ns, the parallel speedup is 5.27
This part about rescanning might be O(n²) behavior and isn't mentioned in any of the algorithm from the litterature or the GPU implementations AFAIK. https://github.com/rayon-rs/rayon/pull/1036/files#diff-f5ebef52187fef7eca19eed303a5ea89c0156aec0ea28238feae78e3cdd5305dR18-R35

So it can be used as a stop-gap (but then again we can re-add the old parallelize as a stopgap as well) but if this is a bottleneck, we likely might still want an optimized implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel prefix sum / parallel scan #262

Implement parallel prefix sum / parallel scan #262

mratsim commented Feb 2, 2024

mratsim commented Feb 2, 2024 •

edited

Loading

Implement parallel prefix sum / parallel scan #262

Implement parallel prefix sum / parallel scan #262

Comments

mratsim commented Feb 2, 2024

mratsim commented Feb 2, 2024 • edited Loading

mratsim commented Feb 2, 2024 •

edited

Loading