-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Parallel garbage collector #5362
base: master
Are you sure you want to change the base?
Conversation
Crashes frequently in FOT (:183,:256), so use `-H:-UseRememberedSet`
Simple test works on 4 threads, HyperAlloc on 1 thread
…and aligned chunk
@petermz : thanks for looking into that. Your latest changes improve the situation a bit but I am still seeing a lot of failures, e.g.:
|
@christianhaeubl: Crazy. I've run complete Dacapo & Renaissance sets on four different machines over the night, and I don't see any single GC failure (there are a few unrelated ones however). Do you use @SergejIsbrecht: I'm now running those original benchmarks plus a bunch of other ones |
@petermz : assertions and heap verification tend to hide race conditions. So, I would suggest to run without assertions/verification and with all optimizations enabled. It can also help if you add some artificial load to your machine (e.g., via Some of the crash logs look as if there is still some some race in the reference handling (at least, I see old gen A lot of the failures only happen with specific workloads (e.g., internal test cases) or specific build-time or run-time options but I don't see any conclusive pattern, so it is likely that those errors are just caused by a general race condition that is fairly sensitive to timing or that needs a particular heap shape to occur. Usually, multiple GC worker threads fail during the same GC (either with a segfault or
|
@christianhaeubl I've identified some potential races and fixed them in commit 59c56b0. They could indeed affect reference processing, and lead to uninitialized object headers being read. I was never able to reproduce any crash, but hopefully these changes could improve the situation. |
@petermz : thanks - I ran some of the tests with your latest fixes and stability looks good so far. I will let you know if I run into any issues. |
@petermz: I ran more tests over night and encountered a few more crashes. I spent some time debugging those issues and fixed the following crashes:
I will run the tests again with my fixes in place and I will let you know if I see any further crashes. |
@petermz : seems that this solved most of the issues. I am still seeing crashes on AArch64 though: after a GC, there are invalid references in the heap (most likely references that were missed or not properly adjusted). I would assume that this error is only visible on AArch64 because of the different memory model, or it is a race that is just very hard to reproduce on AMD64. |
@christianhaeubl I've made one more fix, this time in promotion of unaligned chunks. The remembered set bit was out of sync which could lead to an assertion failing in |
I ran the tests again. Unfortunately, the AArch64 crashes that I mentioned in my last comment are still happening. |
I've been reading the Parallel GC implementation. Thank you, @petermz for your efforts! I've got a few comments. (1) When copying an object between two spaces, the parallel copy synchronizes only when installing the forwarding header bytes. The contents of the object are copied later to the new space, by the thread that successfully installed the forwarding header. (2) The entries of the array of |
Hi @koutheir , thank you for reviewing the code! (1) I believe the situation you describe should never happen. Worker threads only scan chunks popped from the chunk queue, and chunks are queued when they are full (or once scan is complete). The thread local allocation chunk of the winning thread cannot be queued before copying finishes, because we know it does have room for the object being copied (the copy memory is allocated prior to forwarding header installation). The thread must finish copying object data first, then proceed to the next object, then it might figure out there's no room to copy this new object, at which time the chunk is queued and becomes available for other worker threads to scan. OTOH worker threads that encounter an object with forwarding pointer installed, never scan it. They just use the pointer value to update the reference to the object. (2) I actually had this false sharing problem in mind and I even did some experimenting, I think I was using 2048 bytes instead of |
2048 bytes is too large. Cache line sizes are usually 64 or 128 bytes in today's ARM machines. |
I agree. My goal was just to measure effects of false sharing, if any,
without caring about memory consumption. I was running on Intel BTW.
|
Hi --
I'm submitting this work in progress in the hope that someone would take a look and provide some feedback, ideas etc.
Description
The "parallel" GC currently runs single phase -- scanning grey objects -- in parallel. It is enabled using the
-XX:+UseParallelGC
option at image build time, and the number of worker threads can be set at runtime using the-XX:ParallelGCWorkers
option. Worker threads are started early during application startup. They are marked asTHREAD_CRASHED
so that they ignore safepoint requests and can run during a safepoint. The number of threads started is actually one less thanParallelGCWorkers
because the thread that has caused GC is reused as one of the workers.Worker threads allocate memory in TLABs for speed. A TLAB is an aligned chunk (1Mb by default). When a TLAB is filled up, it is pushed to a
ChunkBuffer
which is a stack of chunks to be scanned. When a whole chunk, aligned or unaligned, is promoted, it is also pushed to the chunk buffer.Worker threads pop chunks from the chunk buffer (this is a synchronized operation, but because number of chunks is rather small, it should not cause much contention), and scan them for pointers to live objects to be promoted. For each object being promoted, they allocate memory in their TLABs, then compete to install forwarding pointer in it. The winning thread then proceeds to copy object bits. Losing threads retract the allocated memory, and proceed to the next object.
Limitations
OnlyCompletely
collection policy is enforced. To lift this restriction, I have to make remembered sets thread safe.Performance
Measured by natively compiling and running BigRamTester and HyperAlloc benchmarks. Numbers in the charts below are GC pause times in ms. All benchmarks are run on Ubuntu, 8-core i7 with 8 worker threads and incremental collection turned off.