-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement kernel-side GC #3298
implement kernel-side GC #3298
Conversation
@mhofman I know you may not have a lot of the context on this yet, but I'd appreciate your eyes on this too, if only for general coding issues and "why the heck are you doing it that way" sorts of feedback. |
75b41c0
to
c367db8
Compare
c367db8
to
7a0e550
Compare
5bc71d6
to
dc62579
Compare
|
||
We have several subtle challenges to keep in mind: | ||
|
||
* we cannot sense the difference between REACHABLE and UNREACHABLE, although we know it can only happen as userspace runs | ||
* the transition from UNREACHABLE to COLLECTED can happen spontaneously, at any moment, whenever the engine experience memory pressure and decide to run GC | ||
* the transition from UNREACHABLE to COLLECTED can happen spontaneously, at any moment, whenever the engine experiences memory pressure and decides to run GC | ||
* a new delivery might re-import an object that was already on its way out | ||
* the liveslots `slotToVal` WeakRef will re-use the old Presence if present (REACHABLE or UNREACHABLE) | ||
* if we're in COLLECTED, then a finalizer callback is already queued, and will run sooner or later, so the callback must not clobber a re-import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimportation messes with your model, because it's kind of like a transition from COLLECTED to REACHABLE. Except that from the kernel or vat-to-vat perspective, it's the same object being resurrected, going from COLLECTED to REACHABLE, but from the internal-to-the-vat perspective it's an entirely new object going from UNKNOWN1 to REACHABLE. Since what you're implementing (in fact, what's motivating this model entirely) is the vat-to-vat part, it's easy to overlook that we've got these two perspectives coexisting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I kinda need a way to explain what JS gives us (WeakRef .deref()
results, finalizer callbacks) first, in isolation, to build up the notion that we can't observe certain transitions. Then, in a second section, talk about how this relates to vats (REACHABLE<->UNREACHABLE is a userspace transition, so it can't happen spontaneously after userspace is finished). Then tie it in to imports/exports. I'm definitely handwaving here.
getReachableAndVatSlot, | ||
addMaybeFreeKref, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I'm thinking of the famous Alan Perlis quote: "If you have a procedure with 10 parameters, you probably missed some". I'm wondering if some refactoring is called for here; 16 parameters seems awfully unwieldy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah. after this PR though :)
const { isReachable, vatSlot } = parseReachableAndVatSlot( | ||
kvStore.get(kernelKey), | ||
); | ||
const { allocatedByVat } = parseVatSlot(vatSlot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be something that combines parseReachableAndVatSlot
and parseVatSlot
? It seems weird to parse something to get a thing which you then parse. Not a deep concern, just wondering about hygiene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe.. they're nested, though. I'd have to look at the callers of parseReachableAndVatSlot
to see how many of them also decompose the vref further, I'd bet it's like 50%
isReachable && | ||
type === 'object' && | ||
!allocatedByVat && | ||
kernelObjectExists(kernelSlot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pondering this kernelObjectExists
check: it's present here but not in setReachableFlag
. On the one hand, I suppose you couldn't be setting the flag on something that doesn't exist. On the other hand, why would you be trying to clear it if it doesn't exist? My intuition is that the check might actually still be right (e.g., don't decrement below 0), but I think the asymmetry needs some explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. Yes, your first point is correct: we call setReachableFlag
only from within mapVatSlotToKernelSlot
/mapKernelSlotToVatSlot
, just after we've added a thing to the c-list or seen an existing thing, so the kernel object exists. (Except when the setReachable
flag is false, which we do when translating e.g. dispatch.retireExport
into a vat, which means we've already deleted the kernel object and we need a vref to tell the vat about it, but don't want to resurrect the object).
We deliberately clear the reachable flag in three cases:
- the vat does
syscall.dropImports
(i.e. a Presence was dropped) - the kernel does
dispatch.dropExports
(i.e. "hey vat please drop your Remotable") - we delete a c-list entry, because of one of five cases:
- the vat does
syscall.retireExports
on a previously-dropped vref (a Remotable was dropped) - the vat does
syscall.retireImports
on a previously-dropped (i.e. unreachable) vref (i.e. a Presence was dropped even harder) - the kernel does
dispatch.retireExports
on a previously-dropped vref ("hey vat, nobody can recognize your old Remotable") - the kernel does
dispatch.retireImports
on a previously-dropped vref ("hey vat, that Presence you think you would still recognize, it's gone, forget about it") - a vat is terminated, and we
deleteCListEntry
everything it used to have (reachable or not)
- the vat does
In the drop
cases, we expect isReachable
to be set upon entry, and we're clearing it. In the retire
cases, we'd expect it to be clear upon entry (there might be a race which could allow something to re-set it first, I'm not sure right now). In the terminate-vat case, isReachable
could be in either state. It seemed safest/simplest to use a clearReachable()
that could handle either state, and only decrement the reachable
counter if it was set upon entry, rather than have each of these cases call a different function.
In the dispatch.retire
cases, the kernel object will have been deleted a moment earlier, so we need to tolerate the lack of a reachahle/recognizable count. Which means deleteCListEntry
needs to tolerate a a missing kernel object. deleteCListEntry
is responsible for the decref in the vat-termination case, so it must call clearReachable()
, which means clearReachableFlag()
must tolerate a missing kernel object (or deleteCListEntry
must sense the missing kernel object and then refrain from calling clearReachableFlag()
, but that seemed worse).
I'm working on adding counters to track all the ups and downs of reachable/recognizable states, and I'm running into the same sort of questions. The increment wants to live in one place, but the decrement happens somewhere seemlngly-unrelated. I'll take another look at the code and see if there's a better way to arrange it that would make it more symmetric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew!
Approving, but probably still worth a 1:1 live walkthrough before actually merging.
Aye. I'll do some cleanups (and rebase across that mega-change-every-import commit), and then we can do a walkthrough together. |
This adds a new DB key for each kernel object, which contains a comma-separated pair of (reachable, recognizable) reference counts. The vatKeeper functions that add/remove clist entries now update these counts: * the 'reachable' count is incremented whenever the reachable flag in the clist entry is changed from false to true, and decremented in the other direction * increment/decrementRefCount take options to indicate whether the kref is being uses as an import, or an export (since exports don't count towards object refcounts), and whether both reachable+recognizable counts are being changed or merely the recognizable count * deleteCListEntry takes a flag to disable decref, which will be used when deleting an object export (since the refcounts will be zero by that point) `deadKernelPromises` and `purgeDeadKernelPromises` were generalized into `maybeFreeKrefs` and `processRefcounts()`. A few other cleanups/improvements were made: * `kernel.dump()` adds `.objects` with refcounts * `kernelObjectExists()` was added as a stable way to ask the question refs #3109
This updates vatTranslator.js to translate syscalls and deliveries (GC Actions), and kernelSyscall.js to execute the translated syscalls. GC-related syscalls do most of their work during the translation phase (clearing the `isReachable` flag in the c-list entry, or deleting the entry altogether). The remaining `kernelSyscallObject` is delivered to `kernelSyscall.js`, but only `retireExports` has anything left to do (it deletes the kernel object table entry, and some day will need to delete any auxdata). GC actions (which are being sent *into* a vat, just like `dispatch.deliver`) also do significant work during translation, nearly the same as the corresponding syscalls. When the translated result is delivered to the vat (beginning a new crank), liveslots executes its half of the operation.
* change `getNextMessage` to check the GC Actions queue before considering the regular run-queue: those actions will always have higher priority * add `processGCMessage` to execute a single GC Action this simply deletes * the kernel object entry (for `retireExports`), then delivers the action to the given vat One test was updated to accomodate the new definition of "empty queue".
This increments the refcount in several cases that would otherwise allow an object to be dropped prematurely: * deviceKeeper: pin objects that pass through the device. Device c-lists lack the "reachable" flag that was recently added to vat c-lists. To prevent devices from losing access to any object that was passed in (temporarily held only by the device, and no vats or run-queue messages), we need to forcibly increment the refcount during device c-list kernel-to-vat translation. This pins those krefs forever. * `kernel.addExport()` is used to pretend that a vat has exported some object, so we can get a kref on it and e.g. submit exomessages with `kernel.queueToKref`. If a real vat had imported this kref, that vat's c-list would maintain a reference to it, keeping the refcount non-zero and inhibiting GC. But if it's merely a unit test that remembers the kref, there is no importing c-list to provide the refcount, and the object is in danger of being collected before the test can finish interacting with it. So this commit changes `kernel.addExport()` to increment the refcount by one, replacing the missing importing vat. * it also increments the refcount on slots of `kpResolution`-retrieved promise resolution data, for the same reason.
This now handles imports, exports, and promises, all in separate clauses. Exports of a dead vat must be orphaned (`.owner` cleared), imports must have their c-list entries deleted (which decrements any remaining reachable and/or recognizable refcount appropriately), and promises controlled/decided by the late vat must be rejected. Deletion of all vat-specific keys is done in a second pass, because the refcount changes made for imports probably depend upon the other vat state still being intact.
This exercises all sorts of state transitions for the new GC actions: * five interleavings of an importing vat doing drop/retire (in the same crank or separate ones) and the exporting vat reacting to the drop with a retire (again, in the same crank or some later one) * it is an error for a vat to retire before doing a drop, confirm that * two importers, dropping and maybe retiring, in various orders * confirm the object survives a promise resolution holding the only reference, and then gets dropped when that lifeline goes away * likewise for a promise queue holding the only reference, both where the enclosing message gets delivered, and when it gets rejected * likewise for the run-queue holding the only reference * likewise for a message being sent back to the exporting vat, without touching any other vat along the way * likewise when a device holds the only reference * finally, check that a terminated vat drops+retires all its imports, instead of retaining a deathgrip on them forever Most of these are tested with raw (`setup()`) vats, but the tests that need devices and vat termination use a real swingset instead.
test-controller.js uses `c.step()` to examine the kernel one crank at a time, but was written before some of those cranks are really GC actions. XS seems to do a pretty thorough job of GC when asked, so the kernel has several to do, and the test did not see the regular vat delivery progress it was expecting. The fix is to step through all GC actions before taking the one vat-delivery `c.step()` it intended to do. test-promises.js "refcount while queued" failed because the bootstrap method under test did not pin all vats in place. As a result, by the time the test tried to `c.queueToVatExport()` a message to vat-right, vat-right's root object had been collected. The fix is easy: make `bootstrap()` pin `vats.right` in place. The interesting question is why this did not fail under Node.js (`managerType='local'`). Apparently Node.js is less inclined to release objects, even when a full `gc()` is invoked. I suspect some pattern of definition hoisting or optimization is keeping the `vats` variable around longer than the source code would suggest. Might be related to #3240, but I haven't been able to find a `gc()` mode that causes Node.js to collect it.
Includes some PNG diagrams, and the OmniGraffle source for them. refs #3106
hush logs, use unmetered doomed vat
if a vat calls it before first doing syscall.dropImports
force-pushing (25bff4b) a version of this branch that is rebased on top of the mega-change-all-the-imports commit, but shouldn't change any other behavior |
Both `clearReachableFlag` and `decrementRefCount` are tolerant of the kernel object being missing, so we don't need the extra conditional in `deleteCListEntry`, which lets us remove the extra `decref` argument.
This finally implements kernel-side garbage collection. There are six commits here: they are not independent (e.g. tests would not pass with only the first one landed), but they implement the new functionality in mostly non-overlapping pieces, so I'd suggest reviewing the commits one at a time (and looking at the individual commit comments for a description of what each one does).
The overall picture is:
isReachable
flag, with a slightly different meanings for importing vats and the one exporting vatisReachable = true
, onlyrecognizable
otherwise)maybeFreeKrefs
maybeFreeKrefs
is scanned (in a function namedprocessRefcounts
), and anything which is still zero may cause a "GC Action" to be added to a durable Set (stored in the database), which is then committed along with the rest of the crank's DB changesgc-actions.js
has an algorithm for choosing the order of actions to perform.kernel.js
has a new function namedprocessGCMessage()
to execute these actionsVatDeliveryObject
(and modifying/deleting c-list entries during translation), then delivering the action to the vat in its own crank.syscall.dropImports
clears theisReachable
flag in the c-list entry, andsyscall.retireImports
deletes the entry entirely)KernelSyscallObject
is then executed, however for most of these, all the work was done during translationThe patch series includes a new set of increment-refcounts to retain some objects that would otherwise be collected (mostly during unit tests, which don't follow quite the same pathways as usual). It also includes a large batch of unit tests that attempt to cover all the interleavings I could think of in the available time.
This patch series does not attempt to do the following:
reachable
and uncollected on the exporting side)However, I think the following features should be complete when this lands:
refs #3106 , and might even close it