automatic parallelism management #180

shwestrick · 2023-11-21T20:41:53Z

This is a (work-in-progress) merge of our POPL'24 work, Automatic Parallelism Management.

The version used in the paper is available here: shwestrick/mpl:heartbeat-joinstack-primitives. This was forked a while ago, off of MPL v0.3. The main challenge in this patch is to merge with entanglement management, which was developed concurrently and made it into mainline MPL between v0.3 and v0.4.

What is Automatic Parallelism Management?

To be brief: the gist is that we've made progress on the granularity control problem. We developed a version of the par primitive which embeds a "potentially parallel task" directly into the call-stack (using just two additional stack slots) and therefore avoids the cost of task creation by default. This new par primitive has nearly zero cost, allowing the programmer to use par liberally without worrying much about overhead. Then, during execution, the run-time system uses a heartbeat-based strategy to expose only as much parallelism as is actually needed.

In other words, the compiler and run-time system work together to "automatically manage parallelism", ensuring that the cost of parallelism (in particular the cost of task creation) does not outweigh its benefits. Please see our POPL'24 paper for more details!

Current Status and TODO

This initial merge compiles, and it can successfully run in a few cases, but there are still a few things broken.

The main items:

(basis-library/schedulers/par-pcall) add support for entanglement management, specifically "clearing suspects" which is managed by the scheduler rather than directly in the run-time system.
- first step: sequential suspect clearing
- generalize it to parallel clearing
Question: should we now ONLY support the pcall-based scheduler? I.e., are we dumping basis-library/schedulers/shh? Or, should we adapt this to continue to work properly, letting the user pick between the old and new schedulers?
- I (Sam) have decided: let's only support the new scheduler, for now. We can revive the other scheduler if needed...

Some small compatibility things:

(basis-library/schedulers/par-pcall) implement ForkJoin.idleTimeSoFar
(basis-library/schedulers/par-pcall) implement ForkJoin.workTimeSoFar
(basis-library/schedulers/par-pcall) get rid of the need for passing -mlb-path-var 'PICK_FJ ...' at compile-time. This was used for some experiments, but isn't needed any more.

…ad, including in its heap

…t perf sucks

…ebugging sanity checks.

@mpl

Now can control both (a) the heartbeat interval, and (b) the number of tokens generated per heartbeat. The name of these controls are `heartbeat-us` and `heartbeat-tokens`. Example usage: ``` $ ./program @mpl procs 32 heartbeat-us 500 heartbeat-tokens 30 -- ```

I believe commit 378bfa1 introduced a subtle bug... a bad interaction between (a) clearing suspects in parallel, and (b) CGC. This commit fixes the bug (AFAICT), and appears to have performance benefits as well. I don't have a complete grasp on what the bug was, but I believe it went something like this: - At a join point, we merge threads and then enter into the `maybeParClearSuspects...` call - This call starts by forking, and the fork spawns a CGC - While clearing suspects, the CGC simultaneously reclaims objects that are referenced by the suspect set! - The result is a dangling pointer; specifically, when clearing suspects, we get an error where the suspect objptr points to reclaimed bits. To fix this, I changed `maybeParClearSuspects...` so that it cannot spawn a CGC. Specifically, I revived the code for doing eager forks, and created an option for eager forking while disallowing CGC. After making this change, it occurred to me that the specialized eager forking code is likely to be more efficient than the previous method of eager forking. (Previously, to do an eager fork, we would do a pcall and then immediately trigger a promotion.) So, I modified `par` to drop into the specialized eager forking code in the case where a promotion token is available. This seems to have a mild performanc benefit in some cases.

… of new threads This has been a pain point: what decheck-id should we assign to the initial chunks for a new runnable thread? These chunks store the thread object itself, and the call-stack of that thread. We store them in a pseudoheap immediately above the depth where the thread begins execution. In 22152e6, I attempted a simplification: just assign these chunks a bogus id. However, this has a problem: if we ever return to these chunks to continue allocations, then the new allocations will implicitly be assigned the bogus id; this in turn trips the entanglement management system, and we get a crash. So, in this patch, I'm trying a new approach which seems more robust. The idea is to assign these chunks the same decheck-id as the parent task. This makes sense in the hierarchical heap structure: because these chunks are in a pseudoheap of the parent, it makes sense that they should have a decheck-id which matches the parent. AFAICT, this works. We'll see how it holds up moving forward.

Combines three separate runtime calls into one: - promoting chunks into the parent heap - updating the thread's current depth - updating the decheck state

Combines four separate runtime calls into one: - merge left-side and right-side threads - promote chunks into the parent heap - update the thread's current depth - update the decheck state

Note also that this patch may have fixed a bug...? I added decheck forks and joins for spawnGC and syncGC. This ensures that the decheck-id paths always correspond to hierarchical heap depths, which is important for entanglement management, because we compute unpin-depths using the decheck-ids (see lcaHeapDepth in decheck.c).

(1) don't use SIGALRMs when we have a relayer (2) relayer no longer calls `nanosleep` between broadcasts (3) when there is no relayer, instead of first redirecting SIGALRMs to proc 0, just immediately broadcast I've measured that these changes seem to improve the consistency of heartbeat deliveries. (Using `@mpl heartbeat-stats --`)

This is just a small thing that has bugged me for a while. We have a max fork-depth constraint due to the current implementation of the DePa-based decheck algorithm. The max fork depth is 31. But, the deques have had a capacity of ~1000 for years, which is just wasted space. This patch just brings the deque capacity down to 64.

Previously, LGC was attempting to unpin objects and then unmark them as suspects. This is not correct -- it's possible for a suspect to be in scope of LGC, and it should remain a suspect, because LGC can have one or more ancestor heaps in scope. (So, for example, an ancestor object that contains a down-pointer would be marked as suspect, could be in-scope of the LGC, and should remain a suspect.)

shwestrick · 2024-02-15T16:08:38Z

Current status: using this PR as an opportunity to clean up some of the entanglement management code and fix any bugs we can find. Auto par management is stressing new code paths which presents an opportunity to find nasty bugs :)

This patch appears to fix #156. The solution is different than what was proposed in the discussion there, and also different from what was proposed in #178. The idea here is to restrict what range of the work-stealing deque is accessed by LGC, to ensure that the only slots accessed are those that are in-scope of the LGC. To implement this, I created a new `foreachObjptrInSequenceSlice` function which traces the roots within an index range of a sequence object.

…s carefully

…awns, joins, etc

shwestrick added 30 commits January 19, 2022 14:11

split scheduler par definition into spawn and sync

ce6e31e

introduce local-continuation-based forking, but no heartbeats yet

32a0240

split forkGC into spawn/sync, like par

7b2b418

put all logic into spawn and sync

fa8d9ce

add par primitive via wrapper

916229c

allow spawn to fail and return NONE (e.g. at deque capacity)

b6abec4

activator-based par

fe1c753

simple activation stack implementation in scheduler

dec0662

attempt to play with calling SML from C

9a33b70

nested ML -> C -> ML -> ... seems to be working

e44dc19

initial signal handling working

9ce0b83

run signal handler as though it were executed by the interrupted thre…

87f2d7c

…ad, including in its heap

copy more thread details for signal handler

27fd6fb

pass interrupted thread as argument to signal handler

a792ad3

fix some errors in runtime and use signal-based activation by default

439f26a

invariants for debuggin

f96a9d6

add granularity control to fib and ray for testing

592323c

bugfix: spawnGC created heap task with wrong thread

e75b623

bugfix for astack management (so messy); now seems to be working

5beacea

relay SIGALRM as SIGUSR1, and fix activation stack handling

9ba6e69

ensure atomic sections for heap manipulation. seems to be working! bu…

94491a5

…t perf sucks

relay signals immediately (instead of at next ML limit check)

dabe965

fix atomic sections. go back to prefer enter/leave (WIP)

119156a

atomic section between activator cancel and sync? seems necessary

95f34dc

fix race between relay (pthread_kill) and termination

8174474

missed atomicEnd in sync

6ea413f

merge master into heartbeat

453cd80

Merge branch 'objptr-fun-fix' of github.com:mpllang/mpl into heartbeat

e2ae4f7

fix LGC missing root with savedThreadDuringSignalHandler. also some d…

1076770

…ebugging sanity checks.

fix CC bug: missing roots

b5beb98

shwestrick added 13 commits February 5, 2024 11:34

bugfix: missing clear suspects after join in syncGC

a7054ec

scheduler cleanup: remove older unused code

f11da88

new scheduler primitive: joinIntoParentBeforeFastClone

58d4815

Combines three separate runtime calls into one: - promoting chunks into the parent heap - updating the thread's current depth - updating the decheck state

new scheduler primitive: joinIntoParent

cd19e26

Combines four separate runtime calls into one: - merge left-side and right-side threads - promote chunks into the parent heap - update the thread's current depth - update the decheck state

whoops: missing gc_state.{c,h} functions

5f8a951

some cleanup and defensive programming in read/write barriers

5023c0e

shwestrick added 11 commits February 16, 2024 17:05

remove questionable function try_disentangle_object and consider case…

317caea

…s carefully

remove no-longer-used function try_clear_suspect

05dfa11

cleanup warnings about unused variables

63603a2

fix broken assertion

27116ab

remove debugging print

a5d4627

fix cgc stat reporting

67f419d

add tracedetect runtime, for tracing

6a1da46

overhaul tracing and add trace events for scheduler working, idle, sp…

ab9fa46

…awns, joins, etc

add tracing for scheduler sleep and entanglement management regions

57e4027

clean up scheduler code

32b51ff

shwestrick changed the title ~~(WIP) merge: automatic parallelism management~~ automatic parallelism management Feb 19, 2024

shwestrick merged commit e5950f3 into master Feb 19, 2024

shwestrick deleted the auto-par-management branch February 19, 2024 03:42

This was referenced Feb 19, 2024

GC bug on skyline benchmark? #156

Closed

bugfix: clear work-stealing queue after returning to scheduler #178

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatic parallelism management #180

automatic parallelism management #180

shwestrick commented Nov 21, 2023 •

edited

Loading

shwestrick commented Feb 15, 2024

automatic parallelism management #180

automatic parallelism management #180

Conversation

shwestrick commented Nov 21, 2023 • edited Loading

What is Automatic Parallelism Management?

Current Status and TODO

shwestrick commented Feb 15, 2024

shwestrick commented Nov 21, 2023 •

edited

Loading