-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatic parallelism management #180
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ad, including in its heap
…ebugging sanity checks.
Now can control both (a) the heartbeat interval, and (b) the number of tokens generated per heartbeat. The name of these controls are `heartbeat-us` and `heartbeat-tokens`. Example usage: ``` $ ./program @mpl procs 32 heartbeat-us 500 heartbeat-tokens 30 -- ```
I believe commit 378bfa1 introduced a subtle bug... a bad interaction between (a) clearing suspects in parallel, and (b) CGC. This commit fixes the bug (AFAICT), and appears to have performance benefits as well. I don't have a complete grasp on what the bug was, but I believe it went something like this: - At a join point, we merge threads and then enter into the `maybeParClearSuspects...` call - This call starts by forking, and the fork spawns a CGC - While clearing suspects, the CGC simultaneously reclaims objects that are referenced by the suspect set! - The result is a dangling pointer; specifically, when clearing suspects, we get an error where the suspect objptr points to reclaimed bits. To fix this, I changed `maybeParClearSuspects...` so that it cannot spawn a CGC. Specifically, I revived the code for doing eager forks, and created an option for eager forking while disallowing CGC. After making this change, it occurred to me that the specialized eager forking code is likely to be more efficient than the previous method of eager forking. (Previously, to do an eager fork, we would do a pcall and then immediately trigger a promotion.) So, I modified `par` to drop into the specialized eager forking code in the case where a promotion token is available. This seems to have a mild performanc benefit in some cases.
… of new threads This has been a pain point: what decheck-id should we assign to the initial chunks for a new runnable thread? These chunks store the thread object itself, and the call-stack of that thread. We store them in a pseudoheap immediately above the depth where the thread begins execution. In 22152e6, I attempted a simplification: just assign these chunks a bogus id. However, this has a problem: if we ever return to these chunks to continue allocations, then the new allocations will implicitly be assigned the bogus id; this in turn trips the entanglement management system, and we get a crash. So, in this patch, I'm trying a new approach which seems more robust. The idea is to assign these chunks the same decheck-id as the parent task. This makes sense in the hierarchical heap structure: because these chunks are in a pseudoheap of the parent, it makes sense that they should have a decheck-id which matches the parent. AFAICT, this works. We'll see how it holds up moving forward.
Combines three separate runtime calls into one: - promoting chunks into the parent heap - updating the thread's current depth - updating the decheck state
Combines four separate runtime calls into one: - merge left-side and right-side threads - promote chunks into the parent heap - update the thread's current depth - update the decheck state
Note also that this patch may have fixed a bug...? I added decheck forks and joins for spawnGC and syncGC. This ensures that the decheck-id paths always correspond to hierarchical heap depths, which is important for entanglement management, because we compute unpin-depths using the decheck-ids (see lcaHeapDepth in decheck.c).
(1) don't use SIGALRMs when we have a relayer (2) relayer no longer calls `nanosleep` between broadcasts (3) when there is no relayer, instead of first redirecting SIGALRMs to proc 0, just immediately broadcast I've measured that these changes seem to improve the consistency of heartbeat deliveries. (Using `@mpl heartbeat-stats --`)
This is just a small thing that has bugged me for a while. We have a max fork-depth constraint due to the current implementation of the DePa-based decheck algorithm. The max fork depth is 31. But, the deques have had a capacity of ~1000 for years, which is just wasted space. This patch just brings the deque capacity down to 64.
Previously, LGC was attempting to unpin objects and then unmark them as suspects. This is not correct -- it's possible for a suspect to be in scope of LGC, and it should remain a suspect, because LGC can have one or more ancestor heaps in scope. (So, for example, an ancestor object that contains a down-pointer would be marked as suspect, could be in-scope of the LGC, and should remain a suspect.)
Current status: using this PR as an opportunity to clean up some of the entanglement management code and fix any bugs we can find. Auto par management is stressing new code paths which presents an opportunity to find nasty bugs :) |
This patch appears to fix #156. The solution is different than what was proposed in the discussion there, and also different from what was proposed in #178. The idea here is to restrict what range of the work-stealing deque is accessed by LGC, to ensure that the only slots accessed are those that are in-scope of the LGC. To implement this, I created a new `foreachObjptrInSequenceSlice` function which traces the roots within an index range of a sequence object.
shwestrick
changed the title
(WIP) merge: automatic parallelism management
automatic parallelism management
Feb 19, 2024
This was referenced Feb 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a (work-in-progress) merge of our POPL'24 work, Automatic Parallelism Management.
The version used in the paper is available here: shwestrick/mpl:heartbeat-joinstack-primitives. This was forked a while ago, off of MPL v0.3. The main challenge in this patch is to merge with entanglement management, which was developed concurrently and made it into mainline MPL between v0.3 and v0.4.
What is Automatic Parallelism Management?
To be brief: the gist is that we've made progress on the granularity control problem. We developed a version of the
par
primitive which embeds a "potentially parallel task" directly into the call-stack (using just two additional stack slots) and therefore avoids the cost of task creation by default. This newpar
primitive has nearly zero cost, allowing the programmer to usepar
liberally without worrying much about overhead. Then, during execution, the run-time system uses a heartbeat-based strategy to expose only as much parallelism as is actually needed.In other words, the compiler and run-time system work together to "automatically manage parallelism", ensuring that the cost of parallelism (in particular the cost of task creation) does not outweigh its benefits. Please see our POPL'24 paper for more details!
Current Status and TODO
This initial merge compiles, and it can successfully run in a few cases, but there are still a few things broken.
The main items:
basis-library/schedulers/par-pcall
) add support for entanglement management, specifically "clearing suspects" which is managed by the scheduler rather than directly in the run-time system.basis-library/schedulers/shh
? Or, should we adapt this to continue to work properly, letting the user pick between the old and new schedulers?Some small compatibility things:
basis-library/schedulers/par-pcall
) implementForkJoin.idleTimeSoFar
basis-library/schedulers/par-pcall
) implementForkJoin.workTimeSoFar
basis-library/schedulers/par-pcall
) get rid of the need for passing-mlb-path-var 'PICK_FJ ...'
at compile-time. This was used for some experiments, but isn't needed any more.