Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kernel API to restart all vats at this/next startup #8954

Open
warner opened this issue Feb 20, 2024 · 0 comments
Open

add kernel API to restart all vats at this/next startup #8954

warner opened this issue Feb 20, 2024 · 0 comments
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Feb 20, 2024

What is the Problem Being Solved?

To restart all vats at kernel reboot time (e.g. to deploy a new version of xsnap), we need an API to drive the restart. The kernel needs to know, at startup time, but before preloading any vat workers, that all vats should be restarted, using the latest platform facilities (xsnap, SES/Endo/lockdown bundle, liveslots bundle) as present in the kernel source tree at that time.

The host application will drive this: it needs to know that it is being restarted with facilities/bundles that are sufficiently-incompatible to warrant the cost/disruption/trauma of restarting all vats (the kernel cannot figure this out for itself, or at least we don't yet want the kernel to make this decision on its own, given the disruption it will cause).

So we need a way for the host application to inform the kernel, early, that this restart is needed. And we need to make sure it only happens when needed, not e.g. on every reboot after that point.

Description of the Design

I'm not sure yet. The two main approaches would be:

  • 1: Add a new state-changing API to the controller object, tentatively named controller.restartAllVats(). This is meant to be called just after the controller is created (with makeSwingsetController()), but any controller.run() calls.
  • 2: Add an option to makeSwingsetController() which means "please do this restart on this startup"

and/or changing the host app's responsibilities to add a call to a new controller.start(), after createVatController() but before any calls to controller.run().

The problem is that vat preload currently happens during the kernel.start() call that makeSwingsetController() does, and we need to restart all the vats before we do this preload. We could remove the preload, but that would add surprising stalls at random times in the future, like the next block that uses Zoe somehow. We could defer the preload to the first controller.run(), which might still happen during the chain upgrade reboot, but would be an odd API (not really a "preload" at that point). We could also add an option to makeSwingsetController() that disables the preload, but then A: when should the preload happen, B: how do we tell the controller when the preload should happen, and C: if we're adding both an option (to disable preload) and an API call (to trigger the upgrade-all-vats), that's double the work (for host-app authors), and an opportunity for failure (asking for upgrade-all-vats but leaving preload enabled).

My concern with using a makeSwingsetController() option is that I want upgrade-all-vats to be an event, not a state. We expect this to be triggered by specific host-app upgrades (as in "upgrade-16 requires a restart-all-vats", so the cosmic-swingset upgrade-16 handler does something specific, exactly once, the very first time it starts). Expressing this with an option requires the host app to track some sort of flag to remind itself that the restart-all-vats has already been done, and I imagine it would make it too easy to accidentally leave the option set all the time.

Introducing a new controller.start() would enable a better API. The host app would then either do:

controller = await makeSwingsetController();
await controller.start();

or:

controller = await makeSwingsetController();
controller.restartAllVats();
await controller.start();

OTOH, depending upon how cosmic-swingset is organized, that conditional call might have to be tracked with the same kind of reminder flag that I was trying to avoid.

Security Considerations

This API is only available to the host environment, not to vats, so the disruption/upgrade trauma it causes can only be triggered by code which already has the power to cause problems anyways.

Scaling Considerations

Restarting all vats is obviously O(N) in the number of defined vats, and the work to be done includes the ZCF startVat (buildRootObject), and all contract startup/restart code. We can use historical data to measure how long this takes, but I think it was on the order of a few seconds per vat.

Currently this is about N=100, but over time we expect to have far more vats, and this may cease to be a viable approach.

If we implement "suspended vats" (as described in an #8405 comment, but TODO update this link when it gets its own ticket), then this becomes O(N_not_suspended), which may be a lot smaller. Also, we can arrange to suspend all vats at startup, and then only do immediate upgrade/unsuspend of the preloaded vats (accepting the suprising stall for other vats at random times in the future, as someone talks to them). This would reduce the cost down O(N_preloaded_vats), which is probably a smaller (and fixed) set: Zoe, wallet, bridge, the support vats, but only the most important contract vats.

Test Plan

Kernel unit tests to exercise the new API. Some sort of cosmic-swingset test to exercise invoking it (maybe an a3p upgrade step).

If we introduce a new (mandatory) controller.start(), then we'll need to update a lot of tests, as it represents a breaking change to the kernel API (so host applications must change too, like cosmic-swingset).

Upgrade Considerations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

1 participant