[WIP] Parallel garbage collector #5362

petermz · 2022-11-04T10:23:45Z

Hi --
I'm submitting this work in progress in the hope that someone would take a look and provide some feedback, ideas etc.

Description

The "parallel" GC currently runs single phase -- scanning grey objects -- in parallel. It is enabled using the -XX:+UseParallelGC option at image build time, and the number of worker threads can be set at runtime using the -XX:ParallelGCWorkers option. Worker threads are started early during application startup. They are marked as THREAD_CRASHED so that they ignore safepoint requests and can run during a safepoint. The number of threads started is actually one less than ParallelGCWorkers because the thread that has caused GC is reused as one of the workers.

Worker threads allocate memory in TLABs for speed. A TLAB is an aligned chunk (1Mb by default). When a TLAB is filled up, it is pushed to a ChunkBuffer which is a stack of chunks to be scanned. When a whole chunk, aligned or unaligned, is promoted, it is also pushed to the chunk buffer.

Worker threads pop chunks from the chunk buffer (this is a synchronized operation, but because number of chunks is rather small, it should not cause much contention), and scan them for pointers to live objects to be promoted. For each object being promoted, they allocate memory in their TLABs, then compete to install forwarding pointer in it. The winning thread then proceeds to copy object bits. Losing threads retract the allocated memory, and proceed to the next object.

Limitations

Incremental collections are not yet supported, so effectively OnlyCompletely collection policy is enforced. To lift this restriction, I have to make remembered sets thread safe.
Only a single GC phase runs in parallel. There are others that are potentially beneficial, such as root scanning and post-GC space cleanup.
The HyperAlloc benchmark runs for hours with heap verification on without any problems, but I haven't done much testing for corner cases such as reference queues, very large objects, or long linked lists.

Performance

Measured by natively compiling and running BigRamTester and HyperAlloc benchmarks. Numbers in the charts below are GC pause times in ms. All benchmarks are run on Ubuntu, 8-core i7 with 8 worker threads and incremental collection turned off.

…ashes)

Crashes frequently in FOT (:183,:256), so use `-H:-UseRememberedSet`

Simple test works on 4 threads, HyperAlloc on 1 thread

…and aligned chunk

christianhaeubl · 2023-05-17T08:46:16Z

@petermz : thanks for looking into that. Your latest changes improve the situation a bit but I am still seeing a lot of failures, e.g.:

NullPointerExceptions in fj-kmeans (renaissance)
Random segfaults (e.g., while an object is getting promoted by a GC worker thread)
VMError.shouldNotReachHere() in InteriorObjRefWalker.java:91 because objects have an invalid hub
SIGBUS errors (on macOS/AMD64 and Linux/AArch64)

petermz · 2023-05-19T09:02:13Z

@christianhaeubl: Crazy. I've run complete Dacapo & Renaissance sets on four different machines over the night, and I don't see any single GC failure (there are a few unrelated ones however). Do you use mx benchmark to build and run the benchmarks? Any custom settings such as VerifyHeap? I'm just enabling assertions and parallel GC.

@SergejIsbrecht: I'm now running those original benchmarks plus a bunch of other ones

christianhaeubl · 2023-05-24T09:31:25Z

@petermz : assertions and heap verification tend to hide race conditions. So, I would suggest to run without assertions/verification and with all optimizations enabled. It can also help if you add some artificial load to your machine (e.g., via stress -c ...). Besides that, you can try to reduce the Java heap size to increase the number of GCs.

Some of the crash logs look as if there is still some some race in the reference handling (at least, I see old gen java.lang.ref.WeakReference objects in the registers when the crash is happening), so you might want to do some stress testing of that code part. Other crashes are just related to pretty arbitrary objects in the old generation.

A lot of the failures only happen with specific workloads (e.g., internal test cases) or specific build-time or run-time options but I don't see any conclusive pattern, so it is likely that those errors are just caused by a general race condition that is fairly sensitive to timing or that needs a particular heap shape to occur.

Usually, multiple GC worker threads fail during the same GC (either with a segfault or Fatal error: Object with invalid hub type.). Here is the relevant information from one of the crash logs:

  SubstrateSegfaultHandler caught a segfault

  RAX 0x0000000000000000 
  RBX 0x0000000118000000 is the heap base
  RCX 0x0000000000000000 
  RDX 0x000000000007f7d0 is an unknown value
  RBP 0x000000011dd80830 is an unknown value
  RSI 0x0000000000b60900 is an unknown value
  RDI 0x000000011dd80000 is an unknown value
  RSP 0x000070000acba940 is an unknown value
  R8  0x00000000000030a0 is an unknown value
  R9  0x0000000000000000 
  R10 0x00007fff6ffd1bf5 is an unknown value
  R11 0x0000000000000246 is an unknown value
  R12 0x000000011acfffe8 points into the old generation
    is an object of type java.lang.ref.WeakReference
  R13 0x0000000000000000 
  R14 0x0000000118000000 is the heap base
  R15 0x0000000000000000 
  EFL 0x0000000000010202 is an unknown value
  RIP 0x00000001006fe283 points into AOT compiled code 

Stacktrace for the failing thread 0x000000010670b080 (A=AOT compiled, J=JIT compiled, D=deoptimized, i=inlined):
  A  SP 0x000070000acba940 IP 0x00000001006fe283 size=1200  [image code] missing metadata
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.GCImpl.promoteObject(GCImpl.java:1136)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.GreyToBlackObjRefVisitor.visitObjectReferenceInline(GreyToBlackObjRefVisitor.java:112)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.hub.InteriorObjRefWalker.callVisitor(InteriorObjRefWalker.java:166)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.hub.InteriorObjRefWalker.walkObjectArray(InteriorObjRefWalker.java:155)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.hub.InteriorObjRefWalker.walkObjectInline(InteriorObjRefWalker.java:88)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.GreyToBlackObjectVisitor.visitObjectInline(GreyToBlackObjectVisitor.java:62)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.HeapChunk.callVisitor(HeapChunk.java:324)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.HeapChunk.walkObjectsFromInline(HeapChunk.java:314)
  i  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.UnalignedHeapChunk.walkObjectsInline(UnalignedHeapChunk.java:150)
  A  SP 0x000070000acbadf0 IP 0x000000010057790f size=288   [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.scanChunk(ParallelGC.java:329)
  A  SP 0x000070000acbaf10 IP 0x000000010057a79e size=64    [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.work0(ParallelGC.java:319)
  A  SP 0x000070000acbaf50 IP 0x000000010057a5ff size=32    [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.work(ParallelGC.java:293)
  i  SP 0x000070000acbaf70 IP 0x0000000100500430 size=80    [image code] com.oracle.svm.core.genscavenge.parallel.ParallelGC.gcWorkerRun(ParallelGC.java:282)
  A  SP 0x000070000acbaf70 IP 0x0000000100500430 size=80    [image code] com.oracle.svm.core.code.IsolateEnterStub.ParallelGC_gcWorkerRun_9884278c303fd1db3c85aacc047c185ed9f7552d(IsolateEnterStub.java:0)

petermz · 2023-06-20T14:09:49Z

@christianhaeubl I've identified some potential races and fixed them in commit 59c56b0. They could indeed affect reference processing, and lead to uninitialized object headers being read. I was never able to reproduce any crash, but hopefully these changes could improve the situation.

christianhaeubl · 2023-06-27T13:50:42Z

@petermz : thanks - I ran some of the tests with your latest fixes and stability looks good so far. I will let you know if I run into any issues.

christianhaeubl · 2023-06-28T16:15:13Z

@petermz: I ran more tests over night and encountered a few more crashes. I spent some time debugging those issues and fixed the following crashes:

A GC worker thread could encounter a broken DynamicHub while scanning the already promoted objects in its own allocation chunk: it turned out that retractAllocationParallel(...) used an incorrect size (originalSize instead of copySize), .
A GC worker thread could fail to promote objects: with compressed references enabled, the forwarding pointer could destroy the array length, which broke the object size computation.

I will run the tests again with my fixes in place and I will let you know if I see any further crashes.

christianhaeubl · 2023-06-29T07:48:33Z

@petermz : seems that this solved most of the issues.

I am still seeing crashes on AArch64 though: after a GC, there are invalid references in the heap (most likely references that were missed or not properly adjusted). I would assume that this error is only visible on AArch64 because of the different memory model, or it is a race that is just very hard to reproduce on AMD64.

petermz · 2023-07-20T16:29:04Z

@christianhaeubl I've made one more fix, this time in promotion of unaligned chunks. The remembered set bit was out of sync which could lead to an assertion failing in YoungGeneration::contains, and also failure to mark a card. I've also cleaned up Space::promoteAlignedChunk since in fact it doesn't need the parallel counterpart.

christianhaeubl · 2023-07-25T14:05:33Z

I ran the tests again. Unfortunately, the AArch64 crashes that I mentioned in my last comment are still happening.

koutheir · 2023-09-06T18:05:45Z

I've been reading the Parallel GC implementation. Thank you, @petermz for your efforts! I've got a few comments.

(1) When copying an object between two spaces, the parallel copy synchronizes only when installing the forwarding header bytes. The contents of the object are copied later to the new space, by the thread that successfully installed the forwarding header.
Another thread that observes a forwarded object, then immediately scans the object contents, might observe invalid data, because the winning thread wasn't done copying and synchronizing the object contents. Is this a condition that will never happen?, or should the synchronization point be moved after the object copy finishes?

(2) The entries of the array of GCWorkerThreadState are read/written by multiple threads concurrently. Each entry is smaller than a cache line, so this will probably cause cache contention and data dependency between GC threads even when there is logically no need for that to happen (an artifact of cache coherence mechanisms).
Separating these entries, or padding them to a cache line size (usually 64 bytes) should avoid this situation.

petermz · 2023-09-20T12:35:55Z

Hi @koutheir , thank you for reviewing the code!

(1) I believe the situation you describe should never happen. Worker threads only scan chunks popped from the chunk queue, and chunks are queued when they are full (or once scan is complete). The thread local allocation chunk of the winning thread cannot be queued before copying finishes, because we know it does have room for the object being copied (the copy memory is allocated prior to forwarding header installation). The thread must finish copying object data first, then proceed to the next object, then it might figure out there's no room to copy this new object, at which time the chunk is queued and becomes available for other worker threads to scan.

OTOH worker threads that encounter an object with forwarding pointer installed, never scan it. They just use the pointer value to update the reference to the object.

(2) I actually had this false sharing problem in mind and I even did some experimenting, I think I was using 2048 bytes instead of sizeof(thread data). Curiously enough, that worsened performance noticeably in my testing, I hadn't figured out why. Maybe I should check again at some point.

koutheir · 2023-09-20T13:37:18Z

Hi @koutheir , thank you for reviewing the code!
(2) I actually had this false sharing problem in mind and I even did some experimenting, I think I was using 2048 bytes instead of sizeof(thread data). Curiously enough, that worsened performance noticeably in my testing, I hadn't figured out why. Maybe I should check again at some point.

2048 bytes is too large. Cache line sizes are usually 64 or 128 bytes in today's ARM machines.

petermz · 2023-09-20T13:58:19Z

I agree. My goal was just to measure effects of false sharing, if any, without caring about memory consumption. I was running on Intel BTW.

petermz added 30 commits August 11, 2022 15:22

Noop thread using volatile flags and busy wait

1533d7b

Noop thread with safepoints and synchronization

6195b82

OldGeneration.scanGreyObjects() runs on worker thread (but logging cr…

3dfd353

…ashes)

Worker thread synchronizing via a queue, never put on safepoint

ce6b153

Two parallel workers synchronizing via queue

674df60

releaseSpaces stage working on 2 background threads

ae4f77f

Added a custom Timer to HeapImpl

7db5c49

Parallel dirtyCardIfNecessary

aef1e38

Parallel objRef update

54683ec

Passing original object safely via Pointer

2bfc96e

installFwPtr works but not on workers

7d69248

installForwardingPointer works on parallel threads

f036854

enableRememberedSetForObject works in parallel

b8dd7c8

Protect against promoting an object twice.

ac78015

Crashes frequently in FOT (:183,:256), so use `-H:-UseRememberedSet`

Uninlined several methods

d8b11d0

Parallel memory allocation in to-space.

0820d8d

Simple test works on 4 threads, HyperAlloc on 1 thread

Better sync strategy in HeapChunk.walkObjectsFromInline

980155b

Simplified parallel invocation, assuming complete gc, innerOffset=0, …

66aa7fc

…and aligned chunk

Restored call to Space.promoteAlignedObject()

d526393

Restored call to GCImpl.promoteObject()

3bf93c0

Restored call to GreyToBlackObjRefVisitor.visitObjectReferenceInline()

1b15e01

Extended TaskQueue to 1024 task items

043da38

Put objects on the queue rather than refs

0d936b0

Don't put G2BObjectVisitor instances on TaskQueue

d47b6d2

Added max queue size statistic

239c849

Made TaskQueue private in ParallelGCImpl

e04d49a

Queue grey objects instead of scanning chunks

6b692e2

Queue unaligned objects for parallel scan

69bbd0d

TaskQueue fixes: drain and idleCount

23b19ac

Made TaskQueue reference-free

57e8275

christianhaeubl added 2 commits May 16, 2023 16:33

Merge with master.

e520318

Added some documentation.

c38406c

MichalWlodarczyk-TomTom mentioned this pull request Jun 14, 2023

Intermittent GC spikes with adaptive policy #6756

Closed

petermz force-pushed the parallel-gc branch from 1e02a09 to 8838e46 Compare June 16, 2023 11:30

Fixed some races

59c56b0

petermz force-pushed the parallel-gc branch from 8838e46 to 59c56b0 Compare June 19, 2023 12:10

fniephaus added this to the Planned for the Future milestone Jun 21, 2023

Merge with master.

73f0b54

christianhaeubl added 3 commits June 29, 2023 14:01

Fix an issue where the parallel GC could destroy the array length.

e43b3e9

Fix the object size used in retractAllocation().

80f27d4

Support the parallel GC on Windows.

2047127

This was referenced Jun 29, 2023

[GR-44787] Parallelize full garbage collections. #6551

Closed

[WIP] [GR-44787] Parallelize full garbage collections. #6349

Closed

[GR-44787] Proof of concept for unattached GC threads. #6217

Closed

petermz added 2 commits July 13, 2023 11:12

Enable RememberedSet before promoting unaligned chunk

9710da3

Log number of worker threads

3e57bab

voitylov mentioned this pull request Nov 6, 2023

Using PGO Instrumentation and G1C via the (new) GraalVM distribution bell-sw/LibericaNIK#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Parallel garbage collector #5362

[WIP] Parallel garbage collector #5362

petermz commented Nov 4, 2022

christianhaeubl commented May 17, 2023 •

edited

Loading

petermz commented May 19, 2023

christianhaeubl commented May 24, 2023 •

edited

Loading

petermz commented Jun 20, 2023

christianhaeubl commented Jun 27, 2023

christianhaeubl commented Jun 28, 2023

christianhaeubl commented Jun 29, 2023 •

edited

Loading

petermz commented Jul 20, 2023

christianhaeubl commented Jul 25, 2023

koutheir commented Sep 6, 2023

petermz commented Sep 20, 2023

koutheir commented Sep 20, 2023

petermz commented Sep 20, 2023 via email

[WIP] Parallel garbage collector #5362

Are you sure you want to change the base?

[WIP] Parallel garbage collector #5362

Conversation

petermz commented Nov 4, 2022

Description

Limitations

Performance

christianhaeubl commented May 17, 2023 • edited Loading

petermz commented May 19, 2023

christianhaeubl commented May 24, 2023 • edited Loading

petermz commented Jun 20, 2023

christianhaeubl commented Jun 27, 2023

christianhaeubl commented Jun 28, 2023

christianhaeubl commented Jun 29, 2023 • edited Loading

petermz commented Jul 20, 2023

christianhaeubl commented Jul 25, 2023

koutheir commented Sep 6, 2023

petermz commented Sep 20, 2023

koutheir commented Sep 20, 2023

petermz commented Sep 20, 2023 via email

christianhaeubl commented May 17, 2023 •

edited

Loading

christianhaeubl commented May 24, 2023 •

edited

Loading

christianhaeubl commented Jun 29, 2023 •

edited

Loading