implement kernel-side GC #3298

warner · 2021-06-11T08:40:41Z

This finally implements kernel-side garbage collection. There are six commits here: they are not independent (e.g. tests would not pass with only the first one landed), but they implement the new functionality in mostly non-overlapping pieces, so I'd suggest reviewing the commits one at a time (and looking at the individual commit comments for a description of what each one does).

The overall picture is:

add reference counts to all kernel objects
- not kernel promises: those already have a basic refcount, and we leave that alone
- the object refcounts include two numbers: "reachable" and "recognizable". See docs/garbage-collection.md for the distinction
- each vat c-list entry includes a isReachable flag, with a slightly different meanings for importing vats and the one exporting vat
- the kernel object gets one reachable/recognizable count for each importing c-list (both refcounts are incremented if that entry has isReachable = true, only recognizable otherwise)
- the export of an object doesn't count
- the kernel object also gets one reachable+recognizable count for each:
  - run-queue entry that references the object
  - promise resolution data that references the object (for resolved promises)
  - promise queue message that references the object (for unresolved promises)
- device c-lists can also reference the object, which pins it in place forever
when c-list entries are deleted, or marked as "unreachable", refcounts are decremented
refcounts that touch zero (however briefly) cause their krefs to be added to an ephemeral Set named maybeFreeKrefs
after each delivery, before committing the crank, maybeFreeKrefs is scanned (in a function named processRefcounts), and anything which is still zero may cause a "GC Action" to be added to a durable Set (stored in the database), which is then committed along with the rest of the crank's DB changes
each time the kernel would check the run-queue for the next delivery to make, now it first checks the GC Actions set for any actions that need to be run. gc-actions.js has an algorithm for choosing the order of actions to perform. kernel.js has a new function named processGCMessage() to execute these actions
- executing a GC action means (maybe) deleting a kernel object, translating the action into a VatDeliveryObject (and modifying/deleting c-list entries during translation), then delivering the action to the vat in its own crank.
- liveslots receives this delivery and updates its own tables
when a vat notices a Presence or Remotable has been collected, it may make a syscall to inform the kernel
these are translated into kref space as usual, and the c-lists may be modifed in the process (e.g. syscall.dropImports clears the isReachable flag in the c-list entry, and syscall.retireImports deletes the entry entirely)
the translated KernelSyscallObject is then executed, however for most of these, all the work was done during translation

The patch series includes a new set of increment-refcounts to retain some objects that would otherwise be collected (mostly during unit tests, which don't follow quite the same pathways as usual). It also includes a large batch of unit tests that attempt to cover all the interleavings I could think of in the available time.

This patch series does not attempt to do the following:

change the comms vat (so any object sent over the comms connection will remain reachable and uncollected on the exporting side)
release virtual objects at all
release Presences or Remotables which were once used in virtualized data, but were later removed

However, I think the following features should be complete when this lands:

Presences (not used in virtual objects or comms) will signal the kernel (drop+retire) when they are collected within a vat
the kernel will signal the exporting vat to drop the exported Remotable
liveslots will drop the Remotable, allowing anything it is keeping alive to be dropped too
the kernel data structures (c-lists, object table entries) will be deleted

refs #3106 , and might even close it

warner · 2021-06-11T08:42:02Z

@mhofman I know you may not have a lot of the context on this yet, but I'd appreciate your eyes on this too, if only for general coding issues and "why the heck are you doing it that way" sorts of feedback.

packages/SwingSet/docs/garbage-collection.md

FUDCo · 2021-06-14T23:23:33Z

packages/SwingSet/docs/garbage-collection.md


 We have several subtle challenges to keep in mind:

 * we cannot sense the difference between REACHABLE and UNREACHABLE, although we know it can only happen as userspace runs
-* the transition from UNREACHABLE to COLLECTED can happen spontaneously, at any moment, whenever the engine experience memory pressure and decide to run GC
+* the transition from UNREACHABLE to COLLECTED can happen spontaneously, at any moment, whenever the engine experiences memory pressure and decides to run GC
 * a new delivery might re-import an object that was already on its way out
  * the liveslots `slotToVal` WeakRef will re-use the old Presence if present (REACHABLE or UNREACHABLE)
  * if we're in COLLECTED, then a finalizer callback is already queued, and will run sooner or later, so the callback must not clobber a re-import


Reimportation messes with your model, because it's kind of like a transition from COLLECTED to REACHABLE. Except that from the kernel or vat-to-vat perspective, it's the same object being resurrected, going from COLLECTED to REACHABLE, but from the internal-to-the-vat perspective it's an entirely new object going from UNKNOWN1 to REACHABLE. Since what you're implementing (in fact, what's motivating this model entirely) is the vat-to-vat part, it's easy to overlook that we've got these two perspectives coexisting.

Yeah, I kinda need a way to explain what JS gives us (WeakRef .deref() results, finalizer callbacks) first, in isolation, to build up the notion that we can't observe certain transitions. Then, in a second section, talk about how this relates to vats (REACHABLE<->UNREACHABLE is a userspace transition, so it can't happen spontaneously after userspace is finished). Then tie it in to imports/exports. I'm definitely handwaving here.

packages/SwingSet/docs/garbage-collection.md

FUDCo · 2021-06-15T17:37:05Z

packages/SwingSet/src/kernel/state/vatKeeper.js

  getReachableAndVatSlot,
+  addMaybeFreeKref,


For some reason I'm thinking of the famous Alan Perlis quote: "If you have a procedure with 10 parameters, you probably missed some". I'm wondering if some refactoring is called for here; 16 parameters seems awfully unwieldy.

oh yeah. after this PR though :)

FUDCo · 2021-06-15T17:41:23Z

packages/SwingSet/src/kernel/state/vatKeeper.js

+    const { isReachable, vatSlot } = parseReachableAndVatSlot(
+      kvStore.get(kernelKey),
+    );
+    const { allocatedByVat } = parseVatSlot(vatSlot);


Should there be something that combines parseReachableAndVatSlot and parseVatSlot? It seems weird to parse something to get a thing which you then parse. Not a deep concern, just wondering about hygiene.

maybe.. they're nested, though. I'd have to look at the callers of parseReachableAndVatSlot to see how many of them also decompose the vref further, I'd bet it's like 50%

FUDCo · 2021-06-15T17:50:06Z

packages/SwingSet/src/kernel/state/vatKeeper.js

+      isReachable &&
+      type === 'object' &&
+      !allocatedByVat &&
+      kernelObjectExists(kernelSlot)


Pondering this kernelObjectExists check: it's present here but not in setReachableFlag. On the one hand, I suppose you couldn't be setting the flag on something that doesn't exist. On the other hand, why would you be trying to clear it if it doesn't exist? My intuition is that the check might actually still be right (e.g., don't decrement below 0), but I think the asymmetry needs some explanation.

Good observation. Yes, your first point is correct: we call setReachableFlag only from within mapVatSlotToKernelSlot/mapKernelSlotToVatSlot, just after we've added a thing to the c-list or seen an existing thing, so the kernel object exists. (Except when the setReachable flag is false, which we do when translating e.g. dispatch.retireExport into a vat, which means we've already deleted the kernel object and we need a vref to tell the vat about it, but don't want to resurrect the object).

We deliberately clear the reachable flag in three cases:

the vat does syscall.dropImports (i.e. a Presence was dropped)

the kernel does dispatch.dropExports (i.e. "hey vat please drop your Remotable")

we delete a c-list entry, because of one of five cases:

the vat does syscall.retireExports on a previously-dropped vref (a Remotable was dropped)

the vat does syscall.retireImports on a previously-dropped (i.e. unreachable) vref (i.e. a Presence was dropped even harder)

the kernel does dispatch.retireExports on a previously-dropped vref ("hey vat, nobody can recognize your old Remotable")

the kernel does dispatch.retireImports on a previously-dropped vref ("hey vat, that Presence you think you would still recognize, it's gone, forget about it")

a vat is terminated, and we deleteCListEntry everything it used to have (reachable or not)

In the drop cases, we expect isReachable to be set upon entry, and we're clearing it. In the retire cases, we'd expect it to be clear upon entry (there might be a race which could allow something to re-set it first, I'm not sure right now). In the terminate-vat case, isReachable could be in either state. It seemed safest/simplest to use a clearReachable() that could handle either state, and only decrement the reachable counter if it was set upon entry, rather than have each of these cases call a different function.

In the dispatch.retire cases, the kernel object will have been deleted a moment earlier, so we need to tolerate the lack of a reachahle/recognizable count. Which means deleteCListEntry needs to tolerate a a missing kernel object. deleteCListEntry is responsible for the decref in the vat-termination case, so it must call clearReachable(), which means clearReachableFlag() must tolerate a missing kernel object (or deleteCListEntry must sense the missing kernel object and then refrain from calling clearReachableFlag(), but that seemed worse).

I'm working on adding counters to track all the ups and downs of reachable/recognizable states, and I'm running into the same sort of questions. The increment wants to live in one place, but the decrement happens somewhere seemlngly-unrelated. I'll take another look at the code and see if there's a better way to arrange it that would make it more symmetric.

FUDCo

Whew!

Approving, but probably still worth a 1:1 live walkthrough before actually merging.

warner · 2021-06-15T20:42:27Z

Aye. I'll do some cleanups (and rebase across that mega-change-every-import commit), and then we can do a walkthrough together.

This adds a new DB key for each kernel object, which contains a comma-separated pair of (reachable, recognizable) reference counts. The vatKeeper functions that add/remove clist entries now update these counts: * the 'reachable' count is incremented whenever the reachable flag in the clist entry is changed from false to true, and decremented in the other direction * increment/decrementRefCount take options to indicate whether the kref is being uses as an import, or an export (since exports don't count towards object refcounts), and whether both reachable+recognizable counts are being changed or merely the recognizable count * deleteCListEntry takes a flag to disable decref, which will be used when deleting an object export (since the refcounts will be zero by that point) `deadKernelPromises` and `purgeDeadKernelPromises` were generalized into `maybeFreeKrefs` and `processRefcounts()`. A few other cleanups/improvements were made: * `kernel.dump()` adds `.objects` with refcounts * `kernelObjectExists()` was added as a stable way to ask the question refs #3109

This updates vatTranslator.js to translate syscalls and deliveries (GC Actions), and kernelSyscall.js to execute the translated syscalls. GC-related syscalls do most of their work during the translation phase (clearing the `isReachable` flag in the c-list entry, or deleting the entry altogether). The remaining `kernelSyscallObject` is delivered to `kernelSyscall.js`, but only `retireExports` has anything left to do (it deletes the kernel object table entry, and some day will need to delete any auxdata). GC actions (which are being sent *into* a vat, just like `dispatch.deliver`) also do significant work during translation, nearly the same as the corresponding syscalls. When the translated result is delivered to the vat (beginning a new crank), liveslots executes its half of the operation.

* change `getNextMessage` to check the GC Actions queue before considering the regular run-queue: those actions will always have higher priority * add `processGCMessage` to execute a single GC Action this simply deletes * the kernel object entry (for `retireExports`), then delivers the action to the given vat One test was updated to accomodate the new definition of "empty queue".

This increments the refcount in several cases that would otherwise allow an object to be dropped prematurely: * deviceKeeper: pin objects that pass through the device. Device c-lists lack the "reachable" flag that was recently added to vat c-lists. To prevent devices from losing access to any object that was passed in (temporarily held only by the device, and no vats or run-queue messages), we need to forcibly increment the refcount during device c-list kernel-to-vat translation. This pins those krefs forever. * `kernel.addExport()` is used to pretend that a vat has exported some object, so we can get a kref on it and e.g. submit exomessages with `kernel.queueToKref`. If a real vat had imported this kref, that vat's c-list would maintain a reference to it, keeping the refcount non-zero and inhibiting GC. But if it's merely a unit test that remembers the kref, there is no importing c-list to provide the refcount, and the object is in danger of being collected before the test can finish interacting with it. So this commit changes `kernel.addExport()` to increment the refcount by one, replacing the missing importing vat. * it also increments the refcount on slots of `kpResolution`-retrieved promise resolution data, for the same reason.

This now handles imports, exports, and promises, all in separate clauses. Exports of a dead vat must be orphaned (`.owner` cleared), imports must have their c-list entries deleted (which decrements any remaining reachable and/or recognizable refcount appropriately), and promises controlled/decided by the late vat must be rejected. Deletion of all vat-specific keys is done in a second pass, because the refcount changes made for imports probably depend upon the other vat state still being intact.

This exercises all sorts of state transitions for the new GC actions: * five interleavings of an importing vat doing drop/retire (in the same crank or separate ones) and the exporting vat reacting to the drop with a retire (again, in the same crank or some later one) * it is an error for a vat to retire before doing a drop, confirm that * two importers, dropping and maybe retiring, in various orders * confirm the object survives a promise resolution holding the only reference, and then gets dropped when that lifeline goes away * likewise for a promise queue holding the only reference, both where the enclosing message gets delivered, and when it gets rejected * likewise for the run-queue holding the only reference * likewise for a message being sent back to the exporting vat, without touching any other vat along the way * likewise when a device holds the only reference * finally, check that a terminated vat drops+retires all its imports, instead of retaining a deathgrip on them forever Most of these are tested with raw (`setup()`) vats, but the tests that need devices and vat termination use a real swingset instead.

test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.

Includes some PNG diagrams, and the OmniGraffle source for them. refs #3106

hush logs, use unmetered doomed vat

if a vat calls it before first doing syscall.dropImports

warner · 2021-06-15T21:25:39Z

force-pushing (25bff4b) a version of this branch that is rebased on top of the mega-change-all-the-imports commit, but shouldn't change any other behavior

Both `clearReachableFlag` and `decrementRefCount` are tolerant of the kernel object being missing, so we don't need the extra conditional in `deleteCListEntry`, which lets us remove the extra `decref` argument.

warner added the SwingSet package: SwingSet label Jun 11, 2021

warner added this to the Testnet: Stress Test Phase milestone Jun 11, 2021

warner requested a review from FUDCo June 11, 2021 08:40

warner self-assigned this Jun 11, 2021

warner requested a review from mhofman June 11, 2021 08:42

warner mentioned this pull request Jun 11, 2021

gcAndFinalize might be insufficient on Node.js in test-transcript.js #3240

Open

warner force-pushed the 3109-retire-exports branch from 75b41c0 to c367db8 Compare June 11, 2021 22:53

warner force-pushed the 3109-refcounting branch from e82c5bc to 5436e67 Compare June 11, 2021 22:53

warner force-pushed the 3109-retire-exports branch from c367db8 to 7a0e550 Compare June 11, 2021 23:30

warner force-pushed the 3109-refcounting branch 4 times, most recently from 5bc71d6 to dc62579 Compare June 14, 2021 22:15

Base automatically changed from 3109-retire-exports to master June 14, 2021 22:40

warner force-pushed the 3109-refcounting branch from dc62579 to 36bf06a Compare June 14, 2021 22:43

FUDCo reviewed Jun 14, 2021

View reviewed changes

FUDCo reviewed Jun 15, 2021

View reviewed changes

packages/SwingSet/docs/garbage-collection.md Show resolved Hide resolved

warner requested a review from FUDCo June 15, 2021 01:20

FUDCo reviewed Jun 15, 2021

View reviewed changes

FUDCo approved these changes Jun 15, 2021

View reviewed changes

warner added 6 commits June 15, 2021 14:07

warner added 6 commits June 15, 2021 14:09

chore(swingset): enhance docs/garbage-collection.md

f5fe99f

Includes some PNG diagrams, and the OmniGraffle source for them. refs #3106

chore(swingset): GC doc/diagram edits

b050e4e

chore(swingset): update .graffle source

25bff4b

chore(swingset): improve test-gc-kernel

53e0ea2

hush logs, use unmetered doomed vat

chore(swingset): improve error message in syscall.retireImports

32544e1

if a vat calls it before first doing syscall.dropImports

warner force-pushed the 3109-refcounting branch from 176e546 to 25bff4b Compare June 15, 2021 21:25

warner added 2 commits June 15, 2021 14:34

chore(swingset): make tests less sensitive to number of DB keys per vat

0033c3c

chore(swingset): remove deleteCListEntry() 'decref' flag

0ff0caf

Both `clearReachableFlag` and `decrementRefCount` are tolerant of the kernel object being missing, so we don't need the extra conditional in `deleteCListEntry`, which lets us remove the extra `decref` argument.

warner merged commit d05eeb2 into master Jun 16, 2021

warner deleted the 3109-refcounting branch June 16, 2021 00:51

warner mentioned this pull request Jun 16, 2021

The SwingSet GC Plan (save RAM for unreachable objects) #3106

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement kernel-side GC #3298

implement kernel-side GC #3298

warner commented Jun 11, 2021

warner commented Jun 11, 2021

FUDCo Jun 14, 2021

warner Jun 15, 2021

FUDCo Jun 15, 2021

warner Jun 15, 2021

FUDCo Jun 15, 2021

warner Jun 15, 2021

FUDCo Jun 15, 2021

warner Jun 15, 2021

FUDCo left a comment

warner commented Jun 15, 2021

warner commented Jun 15, 2021

implement kernel-side GC #3298

implement kernel-side GC #3298

Conversation

warner commented Jun 11, 2021

warner commented Jun 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FUDCo left a comment

Choose a reason for hiding this comment

warner commented Jun 15, 2021

warner commented Jun 15, 2021