Skip to content

v0.3

Compare
Choose a tag to compare
@shwestrick shwestrick released this 26 Jun 20:43
7f59464

MPL-v0.3 Release Notes (June 26, 2022)

Over the past year and a half, we have made a number of improvements to MPL. In addition to various bug-fixes and performance improvements, there are three major changes: entanglement detection, CGC-chaining, and a new block allocator.

Entanglement detection

In v0.3, the MPL runtime system now dynamically monitors reads and writes to check for entanglement. If entanglement is detected, then MPL terminates the program with an error message.

Entanglement detection ensures that MPL is safe for all programs: if it type-checks, then it is safe to run. Specifically, when you run a program, MPL v0.3 guarantees that either (a) the program runs to completion in a disentangled manner, or (b) the program terminates with an entanglement error. The entanglement detector is precise: no false alarms!

With entanglement detection, you can now safely use in-place updates to improve efficiency. You can even use in-place updates in a non-deterministic manner, which is essential for good performance in some cases. In particular, many well-known parallel programming techniques improve efficiency by interleaving atomic in-place updates and accesses to shared memory. These techniques are inherently non-deterministic, but desirable for performance optimization. A key advantage of disentanglement (in comparison to purely functional programming, where in-place updates are disallowed) is that it allows for in-place updates to be utilized in this manner.

For example, with v0.3, you can safely use atomic read-modify-write operations to efficiently traverse a graph and reduce contention, or use deterministic reservations to ensure determinism via non-deterministic speculation, or implement lock-free data structures to allow threads to communicate on-the-fly. We have multiple benchmarks using these techniques. Our breadth-first search benchmark interleaves atomic CAS operations to visit vertices, and our low-diameter decomposition uses a similar technique to compute the boundary between adjacent clusters. Betweenness centrality aggregates sums of weights via priority updates. In Delaunay triangulation, triangles atomically reserve access to their neighbors to resolve update conflicts.

Using in-place updates for efficiency, our benchmarks are able to compete in performance with low-level code written in C/C++. For example, our breadth-first search and Delaunay triangulation benchmarks are only ~2x slower than state-of-the-art PBBS implementations, while using a similar amount of memory.

It is important to note that, because MPL allows for non-determinism, entanglement detection is execution-dependent. If an execution of a program exhibits entanglement, then MPL v0.3 will detect the entanglement and terminate. But on the exact same program—even on the same input—it is possible that a different execution might not exhibit entanglement, in which case MPL will allow the execution to complete. This is possible because the outcome of a race condition might determine whether or not entanglement occurs during execution. Race conditions are not necessarily bugs, however. All of the non-deterministic programming techniques we described above have race conditions, but these race conditions are "benign" in the sense of being intentional, useful for efficiency, and disentangled (with some care).

In terms of time and space overhead, don't worry: entanglement detection has nearly zero overhead. See the table, below. The space overhead is essentially zero across the board (one exception: 20% for tinykaboom on 1 processor, but the overall memory usage for this benchmark is low.) The largest time overhead is 7%, and a majority of benchmarks have 2% overhead or less, with a geomean across of approximately 1% across all benchmarks.

When there is overhead, it is predictable and easy to optimize away. The only operations affected are mutable dereference operations, e.g. Array.sub and Ref.!, so it is easy to avoid overhead from entanglement detection by removing unnecessary mutation. Purely functional code has no overhead.

Overhead of Entanglement Detection
                  TIME          SPACE
--------------------------------------------
     Benchmark   P=1    P=72   P=1     P=72
============================================
      bfs-tree  +3.0%  +2.1%  +0.0%   -0.1%
    centrality  -1.2%  -1.4%  -0.0%   -0.0%
 dedup-strings  +3.2%  +7.0%  -0.0%   +4.8%
      delaunay  +5.3%  +3.7%  -0.1%   +2.6%
  dense-matmul  +0.4%  +0.0%  -2.1%   +0.2%
  game-of-life  -4.4%  +0.0%  -0.0%   +0.1%
          grep  -1.9%  +0.0%  -0.0%   +0.1%
  low-d-decomp  +2.6%  -3.4%  -0.0%   -10.5%
 msort-strings  +2.6%  -3.3%  +0.1%   +1.5%
  nearest-nbrs  +0.0%  +2.2%  +0.0%   -0.1%
       nqueens  +0.6%  +3.6%  -0.3%   +0.6%
    palindrome  +4.3%  +3.2%  +2.5%   +0.2%
        primes  +1.9%  -2.4%  -0.3%   -0.0%
     quickhull  +5.6%  +6.8%  +0.0%   +2.9%
   range-query  +6.9%  -0.8%  -0.0%   +3.0%
     raytracer  -5.7%  +0.0%  +1.2%   +2.1%
        reverb  -2.2%  -2.4%  +0.0%   -0.1%
    seam-carve  +0.0%  +2.4%  -0.1%   -0.2%
       skyline  +3.6%  +0.4%  +0.1%   -1.2%
  suffix-array  +3.6%  +1.8%  -0.1%   -5.1%
    tinykaboom  +0.4%  +0.0%  +20.0%  +0.4%
        tokens  -5.1%  -2.3%  +0.0%   -0.0%
triangle-count  -3.6%  -0.5%  +0.0%   -2.2%

These benchmarks are available in our benchmark suite. To reproduce these experiments, see https://github.com/MPLLang/entanglement-detect/tree/icfp22-artifact

CGC-chaining

This change improves the CGC collector by placing collections off the critical path.

Some background: MPL's collector performs collections along spines in the task tree. Each spine collection is divided into two pieces, and the two pieces are collected with two different algorithms.

  • LGC: The lower portion of each spine is collected via a Cheney-style compacting copying collector. We call this "local GC" (LGC for short) because the lower portion of each spine is private to a single processor.
  • CGC: The upper portion of each spine might be shared between multiple processors. This portion is collected by a mark-and-sweep non-moving concurrent collector, so we refer to this algorithm as "concurrent GC", or CGC for short.

This approach requires almost no synchronization. Furthermore, GC is both parallel and concurrent: we get parallelism across spines, and we allow for GC and mutators to run concurrently. To perform a single spine collection, we only have to pause one mutator, to facilitate LGC. In this way, MPL has no stop-the-world pauses. This is the case for both v0.2 as well as v0.3.

The spines of the program can be unbalanced in their memory sizes. In particular, the spine that owns the root heap is often significantly larger than the other spines (which are short-lived). In v0.2, because spine collections are scheduled on the critical path, a single large collection might cause a noticeable decrease in throughput.

In v0.3, MPL spawns CGC tasks and schedules these exactly like normal tasks. Therefore, within a spine, multiple CGCs might be executed on different processors, if the scheduler deems it profitable. To keep CGC'ed heaps separate from the main spine heaps, we extended each heap with a "chain" of secondary heaps that are undergoing CGC. The chain can grow as multiple CGCs are spawned on different portions on the same heap (as the program forks and joins concurrently with ongoing CGCs). When a CGC task completes, the corresponding heap is merged back into the spine. CGC tasks are spawned at forks in the program, and are prioritized for shallow heaps (which are generally larger). LGC is still performed along the critical path, but each LGC is fast and only needs to pause a single mutator.

With this approach, in v0.3, the collector is better parallelized and has less impact on the critical path of the computation.

Hoard-style block allocation

In #135, we developed a new block-allocation mechanism based in part on the Hoard algorithm. This is the central allocation mechanism in MPL: all major allocations performed by the rest of the system are backed by blocks.

The results are quite impressive, to say the least. Due to reduced mmap pressure and better reuse of memory, we get as much as 30% improvement in time and 80% improvement in max residency. See #135 for more details.

Summary of PRs and contributions

  • (#131) more efficient hierarchical heaps and union-find heap lookup
  • (#132) coins example (thanks @ckoparkar!)
  • (#135) Hoard-inspired block allocator
  • (#141) parallelization of "forgotten set" (snapshot reachability)
  • (#142) CGC-chaining (major performance improvement for CGC)
  • (#142) down-pointer pinning
  • (#153) upstream merge from MLton (thanks @MatthewFluet!)
  • (#147, #150, #160) CC work list
  • (#142, #145, #154, #159) entanglement detection

Most of the implementation work done by @shwestrick and @typerSniper. @larry98 helped develop the order maintenance component of the entanglement detector, and @ckoparkar ported a new example from Haskell. Big thanks to @umutacar for helping guide this project, and to @MatthewFluet for his constant support (and keeping us in-sync with MLton!)

The entanglement detection algorithm is designed by Sam Westrick, Jatin Arora, and Umut Acar. Details will be available in an upcoming publication. Entanglement detection is supported by the the DePa algorithm for SP-order maintenance which was designed and implemented by Sam Westrick, Larry Wang, and Umut Acar [5]. The MPL memory management system is based on the theory of disentanglement [1,2,3,4].

[1] Hierarchical Memory Management for Parallel Programs. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. ICFP 2016.
[2] Hierarchical Memory Management for Mutable State. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut Acar, and Matthew Fluet. PPoPP 2018.
[3] Disentanglement in Nested-Parallel Programs. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. POPL 2020.
[4] Provably Space-Efficient Parallel Functional Programming. Jatin Arora, Sam Westrick, and Umut A. Acar. POPL 2021.
[5] DePa: Simple, Provably Efficient, and Practical Order Maintenance for Task Parallelism. Sam Westrick, Larry Wang, and Umut A. Acar. arXiv 2022.