Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall.retireImports translation error (when deleting WeakSet) #9939

Closed
warner opened this issue Aug 22, 2024 · 13 comments · Fixed by #9942
Closed

syscall.retireImports translation error (when deleting WeakSet) #9939

warner opened this issue Aug 22, 2024 · 13 comments · Fixed by #9942
Assignees
Labels
liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Aug 22, 2024

Michael observed an intermittent CI failure with:

----- RemoteChainFacade.5  2 making a RemoteChainFacade
error during syscall translation: (Error#7)
Error#7: syscall.retireImports but o-54 is still reachable
    at eval (.../swingset-vat/src/kernel/vatTranslator.js:437:7)

I was able to reproduce it locally (and capture slogfiles) by re-running yarn test test/orchestration/restart-contracts.test.ts in packages/boot on that PR's branch a dozen times.

We're still analyzing the slogs, but it looks like liveslots misbehaved, and did a retireImports without doing a dropImports first.

@mhofman suspects that our use of WeakRef probing (rather than relying upon the FinalizationRegistry's notification) is causing problems under V8 which we wouldn't see under XS, and this test is using local workers, not xsnap.

@warner warner added liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet labels Aug 22, 2024
@warner
Copy link
Member Author

warner commented Aug 22, 2024

I captured two different slogfiles of non-crashing runs, and they show different GC behavior, which explains why the failure is intermittent: V8 is being weird about GC (our old friend #3240), and some runs see different WeakRef probes or finalization notifications than others. But liveslots is supposed to be tolerant of that, as long as the engine is consistent, so this points to either a bug in liveslots, or some interesting failure mode of the V8 GC sensors and how liveslots is using them.

@mhofman
Copy link
Member

mhofman commented Aug 22, 2024

@michaelfig
Copy link
Member

I saw the same failure in CI under Node.js v18, ruling out that v20 is the only culprit:

----- RemoteChainFacade.5  2 making a RemoteChainFacade
error during syscall translation: (Error#7)
Error#7: syscall.retireImports but o-54 is still reachable
    at eval (.../swingset-vat/src/kernel/vatTranslator.js:437:7)
    at Array.map (<anonymous>)
    at translateRetireImports (.../swingset-vat/src/kernel/vatTranslator.js:431:19)
    at Object.vatSyscallToKernelSyscall (.../swingset-vat/src/kernel/vatTranslator.js:620:8)
    at vatSyscallHandler (.../swingset-vat/src/kernel/kernel.js:1563:17)
    at syscallHandler (.../swingset-vat/src/kernel/vat-warehouse.js:42:12)
    at syscallFromWorker (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:164:11)
    at doSyscall (.../swingset-vat/src/supervisors/supervisor-helper.js:72:3)
    at Object.retireImports (.../swingset-vat/src/supervisors/supervisor-helper.js:104:24)
    at scanForDeadObjects (.../swingset-liveslots/src/liveslots.js:338:9)
    at async bringOutYourDead (.../swingset-liveslots/src/liveslots.js:1549:1)
    at async Object.deliver (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:146:14)
    at async Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:537:22)
    at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:425:22)
    at async processBringOutYourDead (.../swingset-vat/src/kernel/kernel.js:666:14)
    at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:1298:15)
    at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1337:20)
    at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
    at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

error during vat dispatch() of bringOutYourDead (Error#8)
Error#8: syscall.retireImports failed: syscall translation error: prepare to die
    at doSyscall (.../swingset-vat/src/supervisors/supervisor-helper.js:87:7)
    at Object.retireImports (.../swingset-vat/src/supervisors/supervisor-helper.js:104:24)
    at scanForDeadObjects (.../swingset-liveslots/src/liveslots.js:338:9)
    at async bringOutYourDead (.../swingset-liveslots/src/liveslots.js:1549:1)
    at async Object.deliver (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:146:14)
    at async Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:537:22)
    at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:425:22)
    at async processBringOutYourDead (.../swingset-vat/src/kernel/kernel.js:666:14)
    at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:1298:15)
    at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1337:20)
    at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
    at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

kernel terminating vat v15 (failure=true)
##### KERNEL PANIC: critical vat v15 failed #####
controller.run() failure (Error#9)
Error#9: #{"#error":"syscall translation error: prepare to die","name":"Error"}
    at terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
    at processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
    at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
    at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

not ok 12 - orchestration › restart-contracts › basicFlows %ava-dur=17379ms
#   start basicFlows
#   making offer
#   REJECTED from ava test.serial("basicFlows"): (Error#9)
#   Error#9: #{"#error":"syscall translation error: prepare to die","name":"Error"}
#       at terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
#       at processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
#       at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
#       at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

  ---
    name: AssertionError
    message: Rejected promise returned by test
    values:
      'Rejected promise returned by test. Reason:': |-
        Error {
          message: '#{"#error":"syscall translation error: prepare to die","name":"Error"}',
        }
    at: |-
      terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
      processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
      async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
      async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
  ...

@warner
Copy link
Member Author

warner commented Aug 22, 2024

ok I think I see a pathway by which a voAwareWeakMap getting deleted could put us in a situation where a Presence vref (o-54) is in the possiblyRetiredSet despite there still being a vdata pillar.

The slog trace of the failing run shows v15 doing a BOYD which does three passes through the scanForDeadObjects loop. On the third pass, we do a recognizer check for o-54 (ir.o-54) without first doing a refcount check, which I think is the possiblyRetiredSet scan (the second of the three clauses). It didn't show any previous recognizer checks.

For o-54 to have gotten into possiblyRetiredSet after the second pass cleared the set, but before the third pass did the check, would require the droppedCollectionRegistry callback to fire during the gcAndFinalize()at the top of the third pass. That would call finalizeDroppedCollection(), which would be routed to voAwareWeakMapDelete(), which would walk all keys of the weakmap and add their vrefs to possiblyRetiredSet.

The scenario I'm visualizing is where the vat has a voAwareWeakMap, this o-54 Presence is added as a key, then userspace drops the weakmap. On the next BOYD, the first two passes do their gcAndFinalize(), but for some weird V8 reason, that's not enough to shake the thing loose. Only on the third pass does it finally get collected, and the finalizer runs. The third pass' possiblyRetiredSet scan notices that o-54 indeed lacks any recognizers (neither in-RAM weakmap/weakset, nor durable ir.o-54 keys), and adds it to importsToRetire, despite the fact that o-54 is still reachable (from the o+d11/19 VOM state).

If the first pass was able to shake it loose, I'd expect finalizeDroppedCollection to be called during the first gcAndFinalize(), which would cause a recognizer check in the first pass, just before we start deleting the first batch of VOMs. It would still be buggy (we'd be retiring something without first dropping it), but the syscall trace would be different than what I captured from the failing run.

So maybe this condition could be reached without V8 being weird, which raises the concerning possibility that it could be reached from XS. It would require a Presence to be reachable from virtual data and also recognizable from a WeakMap or WeakSet, but not have a RAM pillar. The RAM pillar should be dropped, then a BOYD should be done. Then the WeakMap should be deleted, then we do a second BOYD. (If a single BOYD both processed the RAM-pillar drop and the possiblyRetiredSet scan, then the !deadSet.has(vref) clause would skip the isVrefRecognizable() and the retirement).

I think the fix will be for the possiblyRetiredSet scan to include a reachability check: if the vref is reachable, it must not be added to importsToRetire.

I don't understand the !deadSet.has(vref) clause in that loop. My vague memory is that it exists to avoid duplicating work, but it smells like it's also meant to guard against the not-dead-yet case.

@warner
Copy link
Member Author

warner commented Aug 22, 2024

I was able to write a unit test that triggers this.. defintely a liveslots bug.

warner added a commit that referenced this issue Aug 22, 2024
@warner warner self-assigned this Aug 22, 2024
@warner warner changed the title syscall.retireImports translation error (under v8) syscall.retireImports translation error (when deleting WeakSet) Aug 22, 2024
@warner
Copy link
Member Author

warner commented Aug 22, 2024

Ok I think the requirements to trigger the bug are:

  • a Presence is used as the key of a WeakMap/WeakSet, and also held in a virtual/durable object state (or collection)
  • the Presence is dropped, and a BOYD collects it
  • then, the WeakMap/Set is deleted
  • then, a second BOYD notices the recognizable decref from the deleted weak collection, and emits the retireImports (which is bogus because the vref is still reachable, from vdata, which is why the vat did not previously emit a dropImports)

I don't yet know if using the vref in a (virtual/durable) WeakStore would trigger the problem.

@michaelfig tells me that @agoric/vow uses short-lived WeakSets to detect resolution cycles, and the new (as of upgrade16) vat-network uses @agoric/vow extensively. That probably explains why we didn't see this before.

I think it might also be exacerbated by the reapDirtThreshold.gcKrefs that I landed on trunk last week (for the benefit of slow-vat-deletion). That probably causes BOYDs to happen faster than before, which might increase the chance that the Presence is collected in a different BOYD than the WeakSet being collected.

@mhofman
Copy link
Member

mhofman commented Aug 22, 2024

More specifically, Vow watch uses a WeakMap of WeakSet, keyed on a promise watcher (an exo object so virtualized) with the WeakSet containing presences or representiatives of vows.

The "then, the WeakMap/Set is deleted" would be caused by liveslots collecting the watcher in the first BOYD, which would cause the WeakSet to subsequently become collectible by the engine.

// We use an ephemeral WeakSet for the previously seen vows in a watch operation
// While watch is durable, it suffices to detect the cycle in a single incarnation
/** @type {WeakMap<PromiseWatcher, WeakSet<any>>} */
const watcherSeenPayloads = new WeakMap();
/** @param {PromiseWatcher} watcher */
const getSeenPayloads = watcher => {
let seenPayloads = watcherSeenPayloads.get(watcher);
if (!seenPayloads) {
seenPayloads = new WeakSet();
watcherSeenPayloads.set(watcher, seenPayloads);
}
return seenPayloads;
};

@warner
Copy link
Member Author

warner commented Aug 22, 2024

So the way I want to fix this is to make the liveslots scanForDeadObjects loop be safely idempotent: it should be safe, just horribly inefficient, to put every vref (or at least every type: object vref, so every void) into both possiblyDeadSet and possiblyRetiredSet, and have scanForDeadObjects recognize the ones that aren't actually dead or retired. Part of the required behavior is that we only retire something if it was already dropped (or marked for drop by the end of the BOYD).

@erights
Copy link
Member

erights commented Aug 22, 2024

For these short-lived WeakSets and/or WeakMaps as used by vows, what are the keys? What are the expected lifetimes of those keys? Do we expect such keys to have been used as keys in many short-lived WeakSets/WeakMaps?

I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive. Attn @phoddie @patrick-soquet

In general, when we know the WeakSet/WeakMap is expected to have a shorter lifetime than its keys, and its keys will be reused across several such WeakSets/WeakMaps, then we should ask if we should just use a short-lived Set/Map instead. If we can, we should. That's why I am relaxed about the remaining expense of that one case.

@warner
Copy link
Member Author

warner commented Aug 22, 2024

So the way I want to fix this is to make the liveslots scanForDeadObjects loop be safely idempotent

I realized I can't do that, at least not for imports (Presences).

If we'd included an additional "import status" key (e.g. vNN.vc.vom.is.o-123), where we track the vref's state relative to the kernel (so reachable vs merely-recognizable vs nothing), then we could compare "has RAM pillar" (slotToVal.get(vref)) OR "has vdata pillar" (vom.rc.${vref}) against the import status. If the vref was merely-recognizable or missing entirely, but the import status says "reachable", then we emit a syscall.dropImports and change the import status to "merely-recognizable". If the pillars are missing and we have no recognizers for the vref, but the import status is "merely-recognizable", then we emit a syscall.retireImports and delete the import status.

We can't really retrofit that on now, and it would cost an extra DB key per vref so I'm not positive we'd be willing to pay that cost anyways. So we've got a constraint to manage: you must not add a vref to possiblyDeadSet that is not currently reachable. When we arrive in scanForDeadObjects(), we can rely upon every vref in possiblyDeadSet having an (implicit) import status of "reachable", and if our reachability check says "no", then we can safely dropImports it without contradicting any earlier syscalls. Likewise, you must not add a vref to possiblyRetiredSet that is not at least recognizable, so a recognizable===false result can safely do a syscall.retireImports.

One other improvement we can make is to remove some of the duplicate refcount checking. For a virtual/durable object (whose RAM pillar is a Representative), the first phase does a .rc.${vref} and .es.${vref} check (to see if the object is really dead), and the third phase deletes the VOM (which starts with an assertion of unreachability, which does a second .rc and .es check).

So I'm trying to sketch out a clean way to organize scanForDeadObjects that will avoid some of the duplication and also lead to a cleaner code path. I'm wondering if I can process one vref at a time, instead of handling all of possiblyDeadSet in one phase, then all of possiblyRetiredSet in a second phase, then all of deadSet in the third phase. And I'd like to separate the codepaths for our three kinds of trackable objects (with RAM pillars of Remotables, Representatives, and Presences, which map to vref shapes of o+NN, o+dKK/NN, and o-NN).

It may help to cache the reachability/recognizability check, but we have to be careful about cache invalidation. We could share the check between the first and second phases, but the moment we delete a VOM, we might invalidate the refcounts.

@mhofman
Copy link
Member

mhofman commented Aug 23, 2024

I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive.

These are liveslots virtualized WeakSets used most likely (but not always) with virtual or durable objects as entries, as such the GC profile of XS does not come into play, but instead it's the liveslots gc behavior that does.

@mhofman
Copy link
Member

mhofman commented Aug 23, 2024

If we'd included an additional "import status" key (e.g. vNN.vc.vom.is.o-123), where we track the vref's state relative to the kernel (so reachable vs merely-recognizable vs nothing)

This sounds related to my suggested fix for #9338: track in durable storage the full status of presences and representatives.

warner added a commit that referenced this issue Aug 26, 2024
refs #9939

WIP: more precise failing test
warner added a commit that referenced this issue Aug 26, 2024
These tests will fail without the fixes in the next commit

One new test is disabled because of additional pending bugs that
interfere with the test setup (a combo of #9956 and #9959).
warner added a commit that referenced this issue Aug 26, 2024
This rewrites scanForDeadObjects(), which is called during
dispatch.bringOutYourDead to process possiblyDeadSet and
possiblyRetiredSet. The new flow should be easier to review and
understand.

The main behavioral difference is to fix a bug (#9939) in which a vref
that appears in possiblyRetiredSet (because e.g. a weak collection was
deleted, which was using that vref as a key), but which 1: lacks a RAM
pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it
has a vdata pillar), used to be sent to the kernel in a bogus
`syscall.retireImports()` call. Because this vref was not previously
dropped by the vat (syscall.dropImports()), this was a vat-fatal
error.

The new code will only retire such a Presence vref if it was not
reachable by the vat.

fixes #9939
@michaelfig
Copy link
Member

@warner the above commit that landed on master (064ff1a) is the one I'd like you to revert while closing this issue (#9939).

mergify bot pushed a commit that referenced this issue Aug 30, 2024
These tests will fail without the fixes in the next commit

One new test is disabled because of additional pending bugs that
interfere with the test setup (a combo of #9956 and #9959), which will
be re-enabled in PR #9961 (to address #8756 and others)
@mergify mergify bot closed this as completed in #9942 Aug 30, 2024
mergify bot pushed a commit that referenced this issue Aug 30, 2024
This rewrites scanForDeadObjects(), which is called during
dispatch.bringOutYourDead to process possiblyDeadSet and
possiblyRetiredSet. The new flow should be easier to review and
understand.

The main behavioral difference is to fix a bug (#9939) in which a vref
that appears in possiblyRetiredSet (because e.g. a weak collection was
deleted, which was using that vref as a key), but which 1: lacks a RAM
pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it
has a vdata pillar), used to be sent to the kernel in a bogus
`syscall.retireImports()` call. Because this vref was not previously
dropped by the vat (syscall.dropImports()), this was a vat-fatal
error.

The new code will only retire such a Presence vref if it was not
reachable by the vat.

The new tests are marked as expected to pass again.

thanks @mhofman and @gibson042 for recommendations

fixes #9939
mergify bot pushed a commit that referenced this issue Aug 30, 2024
This reverts commit 064ff1a.

Now that the underlying issue is fixed, we can re-enable this
formerly-flaky test. Thanks @michaelfig for your patience.
mergify bot added a commit that referenced this issue Aug 30, 2024
Rewrite scanForDeadObjects(), which is called during
dispatch.bringOutYourDead to process possiblyDeadSet and
possiblyRetiredSet. The new flow should be easier to review and
understand.

The main behavioral difference is to fix a bug (#9939) in which a vref
that appears in possiblyRetiredSet (because e.g. a weak collection was
deleted, which was using that vref as a key), but which 1: lacks a RAM
pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it
has a vdata pillar), used to be sent to the kernel in a bogus
`syscall.retireImports()` call. Because this vref was not previously
dropped by the vat (syscall.dropImports()), this was a vat-fatal
error.

The new code will only retire such a Presence vref if it was not
reachable by the vat.

fixes #9939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants