Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement kernel-side GC #3298

Merged
merged 14 commits into from
Jun 16, 2021
Merged

implement kernel-side GC #3298

merged 14 commits into from
Jun 16, 2021

Conversation

warner
Copy link
Member

@warner warner commented Jun 11, 2021

This finally implements kernel-side garbage collection. There are six commits here: they are not independent (e.g. tests would not pass with only the first one landed), but they implement the new functionality in mostly non-overlapping pieces, so I'd suggest reviewing the commits one at a time (and looking at the individual commit comments for a description of what each one does).

The overall picture is:

  • add reference counts to all kernel objects
    • not kernel promises: those already have a basic refcount, and we leave that alone
    • the object refcounts include two numbers: "reachable" and "recognizable". See docs/garbage-collection.md for the distinction
    • each vat c-list entry includes a isReachable flag, with a slightly different meanings for importing vats and the one exporting vat
    • the kernel object gets one reachable/recognizable count for each importing c-list (both refcounts are incremented if that entry has isReachable = true, only recognizable otherwise)
    • the export of an object doesn't count
    • the kernel object also gets one reachable+recognizable count for each:
      • run-queue entry that references the object
      • promise resolution data that references the object (for resolved promises)
      • promise queue message that references the object (for unresolved promises)
    • device c-lists can also reference the object, which pins it in place forever
  • when c-list entries are deleted, or marked as "unreachable", refcounts are decremented
  • refcounts that touch zero (however briefly) cause their krefs to be added to an ephemeral Set named maybeFreeKrefs
  • after each delivery, before committing the crank, maybeFreeKrefs is scanned (in a function named processRefcounts), and anything which is still zero may cause a "GC Action" to be added to a durable Set (stored in the database), which is then committed along with the rest of the crank's DB changes
  • each time the kernel would check the run-queue for the next delivery to make, now it first checks the GC Actions set for any actions that need to be run. gc-actions.js has an algorithm for choosing the order of actions to perform. kernel.js has a new function named processGCMessage() to execute these actions
    • executing a GC action means (maybe) deleting a kernel object, translating the action into a VatDeliveryObject (and modifying/deleting c-list entries during translation), then delivering the action to the vat in its own crank.
    • liveslots receives this delivery and updates its own tables
  • when a vat notices a Presence or Remotable has been collected, it may make a syscall to inform the kernel
  • these are translated into kref space as usual, and the c-lists may be modifed in the process (e.g. syscall.dropImports clears the isReachable flag in the c-list entry, and syscall.retireImports deletes the entry entirely)
  • the translated KernelSyscallObject is then executed, however for most of these, all the work was done during translation

The patch series includes a new set of increment-refcounts to retain some objects that would otherwise be collected (mostly during unit tests, which don't follow quite the same pathways as usual). It also includes a large batch of unit tests that attempt to cover all the interleavings I could think of in the available time.

This patch series does not attempt to do the following:

  • change the comms vat (so any object sent over the comms connection will remain reachable and uncollected on the exporting side)
  • release virtual objects at all
  • release Presences or Remotables which were once used in virtualized data, but were later removed

However, I think the following features should be complete when this lands:

  • Presences (not used in virtual objects or comms) will signal the kernel (drop+retire) when they are collected within a vat
  • the kernel will signal the exporting vat to drop the exported Remotable
  • liveslots will drop the Remotable, allowing anything it is keeping alive to be dropped too
  • the kernel data structures (c-lists, object table entries) will be deleted

refs #3106 , and might even close it

@warner warner added the SwingSet package: SwingSet label Jun 11, 2021
@warner warner added this to the Testnet: Stress Test Phase milestone Jun 11, 2021
@warner warner requested a review from FUDCo June 11, 2021 08:40
@warner warner self-assigned this Jun 11, 2021
@warner
Copy link
Member Author

warner commented Jun 11, 2021

@mhofman I know you may not have a lot of the context on this yet, but I'd appreciate your eyes on this too, if only for general coding issues and "why the heck are you doing it that way" sorts of feedback.

packages/SwingSet/docs/garbage-collection.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/garbage-collection.md Show resolved Hide resolved
packages/SwingSet/docs/garbage-collection.md Outdated Show resolved Hide resolved
packages/SwingSet/docs/garbage-collection.md Show resolved Hide resolved
packages/SwingSet/docs/garbage-collection.md Show resolved Hide resolved

We have several subtle challenges to keep in mind:

* we cannot sense the difference between REACHABLE and UNREACHABLE, although we know it can only happen as userspace runs
* the transition from UNREACHABLE to COLLECTED can happen spontaneously, at any moment, whenever the engine experience memory pressure and decide to run GC
* the transition from UNREACHABLE to COLLECTED can happen spontaneously, at any moment, whenever the engine experiences memory pressure and decides to run GC
* a new delivery might re-import an object that was already on its way out
* the liveslots `slotToVal` WeakRef will re-use the old Presence if present (REACHABLE or UNREACHABLE)
* if we're in COLLECTED, then a finalizer callback is already queued, and will run sooner or later, so the callback must not clobber a re-import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimportation messes with your model, because it's kind of like a transition from COLLECTED to REACHABLE. Except that from the kernel or vat-to-vat perspective, it's the same object being resurrected, going from COLLECTED to REACHABLE, but from the internal-to-the-vat perspective it's an entirely new object going from UNKNOWN1 to REACHABLE. Since what you're implementing (in fact, what's motivating this model entirely) is the vat-to-vat part, it's easy to overlook that we've got these two perspectives coexisting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I kinda need a way to explain what JS gives us (WeakRef .deref() results, finalizer callbacks) first, in isolation, to build up the notion that we can't observe certain transitions. Then, in a second section, talk about how this relates to vats (REACHABLE<->UNREACHABLE is a userspace transition, so it can't happen spontaneously after userspace is finished). Then tie it in to imports/exports. I'm definitely handwaving here.

packages/SwingSet/docs/garbage-collection.md Show resolved Hide resolved
@warner warner requested a review from FUDCo June 15, 2021 01:20
getReachableAndVatSlot,
addMaybeFreeKref,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I'm thinking of the famous Alan Perlis quote: "If you have a procedure with 10 parameters, you probably missed some". I'm wondering if some refactoring is called for here; 16 parameters seems awfully unwieldy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah. after this PR though :)

Comment on lines +128 to +131
const { isReachable, vatSlot } = parseReachableAndVatSlot(
kvStore.get(kernelKey),
);
const { allocatedByVat } = parseVatSlot(vatSlot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be something that combines parseReachableAndVatSlot and parseVatSlot? It seems weird to parse something to get a thing which you then parse. Not a deep concern, just wondering about hygiene.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe.. they're nested, though. I'd have to look at the callers of parseReachableAndVatSlot to see how many of them also decompose the vref further, I'd bet it's like 50%

isReachable &&
type === 'object' &&
!allocatedByVat &&
kernelObjectExists(kernelSlot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pondering this kernelObjectExists check: it's present here but not in setReachableFlag. On the one hand, I suppose you couldn't be setting the flag on something that doesn't exist. On the other hand, why would you be trying to clear it if it doesn't exist? My intuition is that the check might actually still be right (e.g., don't decrement below 0), but I think the asymmetry needs some explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. Yes, your first point is correct: we call setReachableFlag only from within mapVatSlotToKernelSlot/mapKernelSlotToVatSlot, just after we've added a thing to the c-list or seen an existing thing, so the kernel object exists. (Except when the setReachable flag is false, which we do when translating e.g. dispatch.retireExport into a vat, which means we've already deleted the kernel object and we need a vref to tell the vat about it, but don't want to resurrect the object).

We deliberately clear the reachable flag in three cases:

  • the vat does syscall.dropImports (i.e. a Presence was dropped)
  • the kernel does dispatch.dropExports (i.e. "hey vat please drop your Remotable")
  • we delete a c-list entry, because of one of five cases:
    • the vat does syscall.retireExports on a previously-dropped vref (a Remotable was dropped)
    • the vat does syscall.retireImports on a previously-dropped (i.e. unreachable) vref (i.e. a Presence was dropped even harder)
    • the kernel does dispatch.retireExports on a previously-dropped vref ("hey vat, nobody can recognize your old Remotable")
    • the kernel does dispatch.retireImports on a previously-dropped vref ("hey vat, that Presence you think you would still recognize, it's gone, forget about it")
    • a vat is terminated, and we deleteCListEntry everything it used to have (reachable or not)

In the drop cases, we expect isReachable to be set upon entry, and we're clearing it. In the retire cases, we'd expect it to be clear upon entry (there might be a race which could allow something to re-set it first, I'm not sure right now). In the terminate-vat case, isReachable could be in either state. It seemed safest/simplest to use a clearReachable() that could handle either state, and only decrement the reachable counter if it was set upon entry, rather than have each of these cases call a different function.

In the dispatch.retire cases, the kernel object will have been deleted a moment earlier, so we need to tolerate the lack of a reachahle/recognizable count. Which means deleteCListEntry needs to tolerate a a missing kernel object. deleteCListEntry is responsible for the decref in the vat-termination case, so it must call clearReachable(), which means clearReachableFlag() must tolerate a missing kernel object (or deleteCListEntry must sense the missing kernel object and then refrain from calling clearReachableFlag(), but that seemed worse).

I'm working on adding counters to track all the ups and downs of reachable/recognizable states, and I'm running into the same sort of questions. The increment wants to live in one place, but the decrement happens somewhere seemlngly-unrelated. I'll take another look at the code and see if there's a better way to arrange it that would make it more symmetric.

Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew!

Approving, but probably still worth a 1:1 live walkthrough before actually merging.

@warner
Copy link
Member Author

warner commented Jun 15, 2021

Aye. I'll do some cleanups (and rebase across that mega-change-every-import commit), and then we can do a walkthrough together.

This adds a new DB key for each kernel object, which contains a
comma-separated pair of (reachable, recognizable) reference counts. The
vatKeeper functions that add/remove clist entries now update these counts:

* the 'reachable' count is incremented whenever the reachable flag in the
clist entry is changed from false to true, and decremented in the other
direction
* increment/decrementRefCount take options to indicate whether the kref is
being uses as an import, or an export (since exports don't count towards
object refcounts), and whether both reachable+recognizable counts are being
changed or merely the recognizable count
* deleteCListEntry takes a flag to disable decref, which will be used when
deleting an object export (since the refcounts will be zero by that point)

`deadKernelPromises` and `purgeDeadKernelPromises` were generalized into
`maybeFreeKrefs` and `processRefcounts()`.

A few other cleanups/improvements were made:
* `kernel.dump()` adds `.objects` with refcounts
* `kernelObjectExists()` was added as a stable way to ask the question

refs #3109
This updates vatTranslator.js to translate syscalls and deliveries (GC Actions),
and kernelSyscall.js to execute the translated syscalls.

GC-related syscalls do most of their work during the translation
phase (clearing the `isReachable` flag in the c-list entry, or deleting the
entry altogether). The remaining `kernelSyscallObject` is delivered to
`kernelSyscall.js`, but only `retireExports` has anything left to do (it
deletes the kernel object table entry, and some day will need to delete any
auxdata).

GC actions (which are being sent *into* a vat, just like `dispatch.deliver`)
also do significant work during translation, nearly the same as the
corresponding syscalls. When the translated result is delivered to the
vat (beginning a new crank), liveslots executes its half of the operation.
* change `getNextMessage` to check the GC Actions queue before considering
the regular run-queue: those actions will always have higher priority
* add `processGCMessage` to execute a single GC Action this simply deletes
  * the kernel object entry (for `retireExports`), then delivers the action
    to the given vat

One test was updated to accomodate the new definition of "empty queue".
This increments the refcount in several cases that would otherwise allow an
object to be dropped prematurely:

* deviceKeeper: pin objects that pass through the device. Device c-lists lack
the "reachable" flag that was recently added to vat c-lists. To prevent
devices from losing access to any object that was passed in (temporarily held
only by the device, and no vats or run-queue messages), we need to forcibly
increment the refcount during device c-list kernel-to-vat translation. This
pins those krefs forever.

* `kernel.addExport()` is used to pretend that a vat has exported some
object, so we can get a kref on it and e.g. submit exomessages with
`kernel.queueToKref`. If a real vat had imported this kref, that vat's c-list
would maintain a reference to it, keeping the refcount non-zero and
inhibiting GC. But if it's merely a unit test that remembers the kref, there
is no importing c-list to provide the refcount, and the object is in danger
of being collected before the test can finish interacting with it. So this
commit changes `kernel.addExport()` to increment the refcount by one,
replacing the missing importing vat.

* it also increments the refcount on slots of `kpResolution`-retrieved
promise resolution data, for the same reason.
This now handles imports, exports, and promises, all in separate clauses.
Exports of a dead vat must be orphaned (`.owner` cleared), imports must have
their c-list entries deleted (which decrements any remaining reachable and/or
recognizable refcount appropriately), and promises controlled/decided by the
late vat must be rejected.

Deletion of all vat-specific keys is done in a second pass, because the
refcount changes made for imports probably depend upon the other vat state
still being intact.
This exercises all sorts of state transitions for the new GC actions:

* five interleavings of an importing vat doing drop/retire (in the same crank
or separate ones) and the exporting vat reacting to the drop with a
retire (again, in the same crank or some later one)
* it is an error for a vat to retire before doing a drop, confirm that
* two importers, dropping and maybe retiring, in various orders
* confirm the object survives a promise resolution holding the only
reference, and then gets dropped when that lifeline goes away
* likewise for a promise queue holding the only reference, both where the
enclosing message gets delivered, and when it gets rejected
* likewise for the run-queue holding the only reference
* likewise for a message being sent back to the exporting vat, without
touching any other vat along the way
* likewise when a device holds the only reference
* finally, check that a terminated vat drops+retires all its imports, instead
of retaining a deathgrip on them forever

Most of these are tested with raw (`setup()`) vats, but the tests that need
devices and vat termination use a real swingset instead.
test-controller.js uses `c.step()` to examine the kernel one crank at a time,
but was written before some of those cranks are really GC actions. XS seems
to do a pretty thorough job of GC when asked, so the kernel has several to
do, and the test did not see the regular vat delivery progress it was
expecting. The fix is to step through all GC actions before taking the one
vat-delivery `c.step()` it intended to do.

test-promises.js "refcount while queued" failed because the bootstrap method
under test did not pin all vats in place. As a result, by the time the test
tried to `c.queueToVatExport()` a message to vat-right, vat-right's root
object had been collected. The fix is easy: make `bootstrap()` pin
`vats.right` in place.

The interesting question is why this did not fail under Node.js
(`managerType='local'`). Apparently Node.js is less inclined to release
objects, even when a full `gc()` is invoked. I suspect some pattern of
definition hoisting or optimization is keeping the `vats` variable around
longer than the source code would suggest. Might be related to #3240, but I
haven't been able to find a `gc()` mode that causes Node.js to collect it.
Includes some PNG diagrams, and the OmniGraffle source for them.

refs #3106
hush logs, use unmetered doomed vat
if a vat calls it before first doing syscall.dropImports
@warner
Copy link
Member Author

warner commented Jun 15, 2021

force-pushing (25bff4b) a version of this branch that is rebased on top of the mega-change-all-the-imports commit, but shouldn't change any other behavior

Both `clearReachableFlag` and `decrementRefCount` are tolerant of the kernel
object being missing, so we don't need the extra conditional in
`deleteCListEntry`, which lets us remove the extra `decref` argument.
@warner warner merged commit d05eeb2 into master Jun 16, 2021
@warner warner deleted the 3109-refcounting branch June 16, 2021 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants