Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic parallelism management #180

Merged
merged 198 commits into from
Feb 19, 2024
Merged

automatic parallelism management #180

merged 198 commits into from
Feb 19, 2024

Conversation

shwestrick
Copy link
Collaborator

@shwestrick shwestrick commented Nov 21, 2023

This is a (work-in-progress) merge of our POPL'24 work, Automatic Parallelism Management.

The version used in the paper is available here: shwestrick/mpl:heartbeat-joinstack-primitives. This was forked a while ago, off of MPL v0.3. The main challenge in this patch is to merge with entanglement management, which was developed concurrently and made it into mainline MPL between v0.3 and v0.4.

What is Automatic Parallelism Management?

To be brief: the gist is that we've made progress on the granularity control problem. We developed a version of the par primitive which embeds a "potentially parallel task" directly into the call-stack (using just two additional stack slots) and therefore avoids the cost of task creation by default. This new par primitive has nearly zero cost, allowing the programmer to use par liberally without worrying much about overhead. Then, during execution, the run-time system uses a heartbeat-based strategy to expose only as much parallelism as is actually needed.

In other words, the compiler and run-time system work together to "automatically manage parallelism", ensuring that the cost of parallelism (in particular the cost of task creation) does not outweigh its benefits. Please see our POPL'24 paper for more details!

Current Status and TODO

This initial merge compiles, and it can successfully run in a few cases, but there are still a few things broken.

The main items:

  • (basis-library/schedulers/par-pcall) add support for entanglement management, specifically "clearing suspects" which is managed by the scheduler rather than directly in the run-time system.
    • first step: sequential suspect clearing
    • generalize it to parallel clearing
  • Question: should we now ONLY support the pcall-based scheduler? I.e., are we dumping basis-library/schedulers/shh? Or, should we adapt this to continue to work properly, letting the user pick between the old and new schedulers?
    • I (Sam) have decided: let's only support the new scheduler, for now. We can revive the other scheduler if needed...

Some small compatibility things:

  • (basis-library/schedulers/par-pcall) implement ForkJoin.idleTimeSoFar
  • (basis-library/schedulers/par-pcall) implement ForkJoin.workTimeSoFar
  • (basis-library/schedulers/par-pcall) get rid of the need for passing -mlb-path-var 'PICK_FJ ...' at compile-time. This was used for some experiments, but isn't needed any more.

Now can control both (a) the heartbeat interval, and (b) the
number of tokens generated per heartbeat.

The name of these controls are `heartbeat-us` and `heartbeat-tokens`.

Example usage:

```
$ ./program @mpl procs 32 heartbeat-us 500 heartbeat-tokens 30 --
```
I believe commit 378bfa1 introduced a subtle bug... a bad interaction
between (a) clearing suspects in parallel, and (b) CGC.

This commit fixes the bug (AFAICT), and appears to have performance
benefits as well.

I don't have a complete grasp on what the bug was, but I believe it
went something like this:
  - At a join point, we merge threads and then enter into the
    `maybeParClearSuspects...` call
  - This call starts by forking, and the fork spawns a CGC
  - While clearing suspects, the CGC simultaneously reclaims
    objects that are referenced by the suspect set!
  - The result is a dangling pointer; specifically, when clearing
    suspects, we get an error where the suspect objptr points to
    reclaimed bits.

To fix this, I changed `maybeParClearSuspects...` so that it cannot
spawn a CGC. Specifically, I revived the code for doing eager forks,
and created an option for eager forking while disallowing CGC.

After making this change, it occurred to me that the specialized
eager forking code is likely to be more efficient than the previous
method of eager forking. (Previously, to do an eager fork, we would
do a pcall and then immediately trigger a promotion.)

So, I modified `par` to drop into the specialized eager forking
code in the case where a promotion token is available. This seems
to have a mild performanc benefit in some cases.
… of new threads

This has been a pain point: what decheck-id should we assign to the
initial chunks for a new runnable thread? These chunks store the thread
object itself, and the call-stack of that thread. We store them in a
pseudoheap immediately above the depth where the thread begins execution.

In 22152e6, I attempted a simplification:
just assign these chunks a bogus id. However, this has a problem: if we
ever return to these chunks to continue allocations, then the new allocations
will implicitly be assigned the bogus id; this in turn trips the entanglement
management system, and we get a crash.

So, in this patch, I'm trying a new approach which seems more robust. The
idea is to assign these chunks the same decheck-id as the parent task.
This makes sense in the hierarchical heap structure: because these chunks
are in a pseudoheap of the parent, it makes sense that they should have
a decheck-id which matches the parent.

AFAICT, this works. We'll see how it holds up moving forward.
Combines three separate runtime calls into one:
  - promoting chunks into the parent heap
  - updating the thread's current depth
  - updating the decheck state
Combines four separate runtime calls into one:
  - merge left-side and right-side threads
  - promote chunks into the parent heap
  - update the thread's current depth
  - update the decheck state
Note also that this patch may have fixed a bug...? I added decheck
forks and joins for spawnGC and syncGC. This ensures that the
decheck-id paths always correspond to hierarchical heap depths,
which is important for entanglement management, because we compute
unpin-depths using the decheck-ids (see lcaHeapDepth in decheck.c).
(1) don't use SIGALRMs when we have a relayer
(2) relayer no longer calls `nanosleep` between broadcasts
(3) when there is no relayer, instead of first redirecting
    SIGALRMs to proc 0, just immediately broadcast

I've measured that these changes seem to improve the consistency
of heartbeat deliveries. (Using `@mpl heartbeat-stats --`)
This is just a small thing that has bugged me for a while. We
have a max fork-depth constraint due to the current implementation
of the DePa-based decheck algorithm. The max fork depth is 31.
But, the deques have had a capacity of ~1000 for years, which is
just wasted space. This patch just brings the deque capacity
down to 64.
Previously, LGC was attempting to unpin objects and then unmark them
as suspects. This is not correct -- it's possible for a suspect to
be in scope of LGC, and it should remain a suspect, because LGC can
have one or more ancestor heaps in scope. (So, for example, an
ancestor object that contains a down-pointer would be marked as
suspect, could be in-scope of the LGC, and should remain a suspect.)
@shwestrick
Copy link
Collaborator Author

Current status: using this PR as an opportunity to clean up some of the entanglement management code and fix any bugs we can find. Auto par management is stressing new code paths which presents an opportunity to find nasty bugs :)

This patch appears to fix #156. The solution is different than what was
proposed in the discussion there, and also different from what was
proposed in #178. The idea here is to restrict what range of the
work-stealing deque is accessed by LGC, to ensure that the only slots
accessed are those that are in-scope of the LGC. To implement this,
I created a new `foreachObjptrInSequenceSlice` function which traces
the roots within an index range of a sequence object.
@shwestrick shwestrick changed the title (WIP) merge: automatic parallelism management automatic parallelism management Feb 19, 2024
@shwestrick shwestrick merged commit e5950f3 into master Feb 19, 2024
@shwestrick shwestrick deleted the auto-par-management branch February 19, 2024 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants