Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Parallel garbage collector #5362

Open
wants to merge 137 commits into
base: master
Choose a base branch
from
Open

Conversation

petermz
Copy link
Contributor

@petermz petermz commented Nov 4, 2022

Hi --
I'm submitting this work in progress in the hope that someone would take a look and provide some feedback, ideas etc.

Description

The "parallel" GC currently runs single phase -- scanning grey objects -- in parallel. It is enabled using the -XX:+UseParallelGC option at image build time, and the number of worker threads can be set at runtime using the -XX:ParallelGCWorkers option. Worker threads are started early during application startup. They are marked as THREAD_CRASHED so that they ignore safepoint requests and can run during a safepoint. The number of threads started is actually one less than ParallelGCWorkers because the thread that has caused GC is reused as one of the workers.

Worker threads allocate memory in TLABs for speed. A TLAB is an aligned chunk (1Mb by default). When a TLAB is filled up, it is pushed to a ChunkBuffer which is a stack of chunks to be scanned. When a whole chunk, aligned or unaligned, is promoted, it is also pushed to the chunk buffer.

Worker threads pop chunks from the chunk buffer (this is a synchronized operation, but because number of chunks is rather small, it should not cause much contention), and scan them for pointers to live objects to be promoted. For each object being promoted, they allocate memory in their TLABs, then compete to install forwarding pointer in it. The winning thread then proceeds to copy object bits. Losing threads retract the allocated memory, and proceed to the next object.

Limitations

  • Incremental collections are not yet supported, so effectively OnlyCompletely collection policy is enforced. To lift this restriction, I have to make remembered sets thread safe.
  • Only a single GC phase runs in parallel. There are others that are potentially beneficial, such as root scanning and post-GC space cleanup.
  • The HyperAlloc benchmark runs for hours with heap verification on without any problems, but I haven't done much testing for corner cases such as reference queues, very large objects, or long linked lists.

Performance

Measured by natively compiling and running BigRamTester and HyperAlloc benchmarks. Numbers in the charts below are GC pause times in ms. All benchmarks are run on Ubuntu, 8-core i7 with 8 worker threads and incremental collection turned off.

bigram-Xmx12g
bigram-Xmx32g
hyperalloc-Xmx2g
hyperalloc-Xmx32g
hyperalloc-Xmx512m

Crashes frequently in FOT (:183,:256), so use `-H:-UseRememberedSet`
Simple test works on 4 threads, HyperAlloc on 1 thread
@christianhaeubl
Copy link
Member

christianhaeubl commented May 17, 2023

@petermz : thanks for looking into that. Your latest changes improve the situation a bit but I am still seeing a lot of failures, e.g.:

  • NullPointerExceptions in fj-kmeans (renaissance)
  • Random segfaults (e.g., while an object is getting promoted by a GC worker thread)
  • VMError.shouldNotReachHere() in InteriorObjRefWalker.java:91 because objects have an invalid hub
  • SIGBUS errors (on macOS/AMD64 and Linux/AArch64)

@petermz
Copy link
Contributor Author

petermz commented May 19, 2023

@christianhaeubl: Crazy. I've run complete Dacapo & Renaissance sets on four different machines over the night, and I don't see any single GC failure (there are a few unrelated ones however). Do you use mx benchmark to build and run the benchmarks? Any custom settings such as VerifyHeap? I'm just enabling assertions and parallel GC.

@SergejIsbrecht: I'm now running those original benchmarks plus a bunch of other ones

@christianhaeubl
Copy link
Member

christianhaeubl commented May 24, 2023

@petermz : assertions and heap verification tend to hide race conditions. So, I would suggest to run without assertions/verification and with all optimizations enabled. It can also help if you add some artificial load to your machine (e.g., via stress -c ...). Besides that, you can try to reduce the Java heap size to increase the number of GCs.

Some of the crash logs look as if there is still some some race in the reference handling (at least, I see old gen java.lang.ref.WeakReference objects in the registers when the crash is happening), so you might want to do some stress testing of that code part. Other crashes are just related to pretty arbitrary objects in the old generation.

A lot of the failures only happen with specific workloads (e.g., internal test cases) or specific build-time or run-time options but I don't see any conclusive pattern, so it is likely that those errors are just caused by a general race condition that is fairly sensitive to timing or that needs a particular heap shape to occur.

Usually, multiple GC worker threads fail during the same GC (either with a segfault or Fatal error: Object with invalid hub type.). Here is the relevant information from one of the crash logs:

  SubstrateSegfaultHandler caught a segfault

  RAX 0x0000000000000000 
  RBX 0x0000000118000000 is the heap base
  RCX 0x0000000000000000 
  RDX 0x000000000007f7d0 is an unknown value
  RBP 0x000000011dd80830 is an unknown value
  RSI 0x0000000000b60900 is an unknown value
  RDI 0x000000011dd80000 is an unknown value
  RSP 0x000070000acba940 is an unknown value
  R8  0x00000000000030a0 is an unknown value
  R9  0x0000000000000000 
  R10 0x00007fff6ffd1bf5 is an unknown value
  R11 0x0000000000000246 is an unknown value
  R12 0x000000011acfffe8 points into the old generation
    is an object of type java.lang.ref.WeakReference
  R13 0x0000000000000000 
  R14 0x0000000118000000 is the heap base
  R15 0x0000000000000000 
  EFL 0x0000000000010202 is an unknown value
  RIP 0x00000001006fe283 points into AOT compiled code 

Stacktrace for the failing thread 0x000000010670b080 (A=AOT compiled, J=JIT compiled, D=deoptimized, i=inlined):
  A  SP 0x000070000acba940 IP 0x00000001006fe283 size=1200  [image code] missing metadata
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.GCImpl.promoteObject(GCImpl.java:1136)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.GreyToBlackObjRefVisitor.visitObjectReferenceInline(GreyToBlackObjRefVisitor.java:112)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.hub.InteriorObjRefWalker.callVisitor(InteriorObjRefWalker.java:166)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.hub.InteriorObjRefWalker.walkObjectArray(InteriorObjRefWalker.java:155)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.hub.InteriorObjRefWalker.walkObjectInline(InteriorObjRefWalker.java:88)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.GreyToBlackObjectVisitor.visitObjectInline(GreyToBlackObjectVisitor.java:62)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.HeapChunk.callVisitor(HeapChunk.java:324)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.HeapChunk.walkObjectsFromInline(HeapChunk.java:314)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.UnalignedHeapChunk.walkObjectsInline(UnalignedHeapChunk.java:150)
  A  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.scanChunk(ParallelGC.java:329)
  A  SP 0x000070000acbaf10 IP 0x000000010057a79e size=64    [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.work0(ParallelGC.java:319)
  A  SP 0x000070000acbaf50 IP 0x000000010057a5ff size=32    [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.work(ParallelGC.java:293)
  i  SP 0x000070000acbaf70 IP 0x0000000100500430 size=80    [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.gcWorkerRun(ParallelGC.java:282)
  A  SP 0x000070000acbaf70 IP 0x0000000100500430 size=80    [image code] com.oracle.svm.core.code.IsolateEnterStub.ParallelGC_gcWorkerRun_9884278c303fd1db3c85aacc047c185ed9f7552d(IsolateEnterStub.java:0)

@petermz
Copy link
Contributor Author

petermz commented Jun 20, 2023

@christianhaeubl I've identified some potential races and fixed them in commit 59c56b0. They could indeed affect reference processing, and lead to uninitialized object headers being read. I was never able to reproduce any crash, but hopefully these changes could improve the situation.

@fniephaus fniephaus added this to the Planned for the Future milestone Jun 21, 2023
@christianhaeubl
Copy link
Member

@petermz : thanks - I ran some of the tests with your latest fixes and stability looks good so far. I will let you know if I run into any issues.

@christianhaeubl
Copy link
Member

@petermz: I ran more tests over night and encountered a few more crashes. I spent some time debugging those issues and fixed the following crashes:

  • A GC worker thread could encounter a broken DynamicHub while scanning the already promoted objects in its own allocation chunk: it turned out that retractAllocationParallel(...) used an incorrect size (originalSize instead of copySize), .
  • A GC worker thread could fail to promote objects: with compressed references enabled, the forwarding pointer could destroy the array length, which broke the object size computation.

I will run the tests again with my fixes in place and I will let you know if I see any further crashes.

@christianhaeubl
Copy link
Member

christianhaeubl commented Jun 29, 2023

@petermz : seems that this solved most of the issues.

I am still seeing crashes on AArch64 though: after a GC, there are invalid references in the heap (most likely references that were missed or not properly adjusted). I would assume that this error is only visible on AArch64 because of the different memory model, or it is a race that is just very hard to reproduce on AMD64.

@petermz
Copy link
Contributor Author

petermz commented Jul 20, 2023

@christianhaeubl I've made one more fix, this time in promotion of unaligned chunks. The remembered set bit was out of sync which could lead to an assertion failing in YoungGeneration::contains, and also failure to mark a card. I've also cleaned up Space::promoteAlignedChunk since in fact it doesn't need the parallel counterpart.

@christianhaeubl
Copy link
Member

I ran the tests again. Unfortunately, the AArch64 crashes that I mentioned in my last comment are still happening.

@koutheir
Copy link
Contributor

koutheir commented Sep 6, 2023

I've been reading the Parallel GC implementation. Thank you, @petermz for your efforts! I've got a few comments.

(1) When copying an object between two spaces, the parallel copy synchronizes only when installing the forwarding header bytes. The contents of the object are copied later to the new space, by the thread that successfully installed the forwarding header.
Another thread that observes a forwarded object, then immediately scans the object contents, might observe invalid data, because the winning thread wasn't done copying and synchronizing the object contents. Is this a condition that will never happen?, or should the synchronization point be moved after the object copy finishes?

(2) The entries of the array of GCWorkerThreadState are read/written by multiple threads concurrently. Each entry is smaller than a cache line, so this will probably cause cache contention and data dependency between GC threads even when there is logically no need for that to happen (an artifact of cache coherence mechanisms).
Separating these entries, or padding them to a cache line size (usually 64 bytes) should avoid this situation.

@petermz
Copy link
Contributor Author

petermz commented Sep 20, 2023

Hi @koutheir , thank you for reviewing the code!

(1) I believe the situation you describe should never happen. Worker threads only scan chunks popped from the chunk queue, and chunks are queued when they are full (or once scan is complete). The thread local allocation chunk of the winning thread cannot be queued before copying finishes, because we know it does have room for the object being copied (the copy memory is allocated prior to forwarding header installation). The thread must finish copying object data first, then proceed to the next object, then it might figure out there's no room to copy this new object, at which time the chunk is queued and becomes available for other worker threads to scan.

OTOH worker threads that encounter an object with forwarding pointer installed, never scan it. They just use the pointer value to update the reference to the object.

(2) I actually had this false sharing problem in mind and I even did some experimenting, I think I was using 2048 bytes instead of sizeof(thread data). Curiously enough, that worsened performance noticeably in my testing, I hadn't figured out why. Maybe I should check again at some point.

@koutheir
Copy link
Contributor

Hi @koutheir , thank you for reviewing the code!
(2) I actually had this false sharing problem in mind and I even did some experimenting, I think I was using 2048 bytes instead of sizeof(thread data). Curiously enough, that worsened performance noticeably in my testing, I hadn't figured out why. Maybe I should check again at some point.

2048 bytes is too large. Cache line sizes are usually 64 or 128 bytes in today's ARM machines.

@petermz
Copy link
Contributor Author

petermz commented Sep 20, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement. redhat-interest
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

8 participants