add runPolicy object to controller.run() #3460

warner · 2021-07-09T08:37:19Z

What is the Problem Being Solved?

The host applications that use swingset break up their computation into "blocks". Each block finishes with a state commitment to durable storage, followed by the release of all embargoed outbound messages. The externally-visible latency of a message is lower-bounded by the time it takes to perform all the computation in a block ("P").

Solo machines are free to end a block any time they like. Consensus machines must end the block the same way for all validator nodes, and additionally perform significant non-swingset work for each block (e.g. they run a consensus algorithm over the contents and consequences of the block). This extra work takes some amount of time: the default cosmos-SDK settings cause 6 seconds of voting to occur in between the functional transaction-processing time (P), leading to a block time (and minimum latency) of 6+P. Cosmos/Tendermint does not currently make it easy to do any transaction processing during the voting time, leading to a CPU utilization of P/(6+P). P is therefore a tunable parameter which trades off throughput against latency.

To achieve whatever target tradeoff the machine operator chooses, we'd like to stop processing cranks when their cumulative runtime has (roughly) reached P. However their wallclock runtime is not a deterministic function of the machine state (it depends upon the CPU speed, among dozens of other uncontrolled factors), so a consensus-based swingset cannot use it to make this decision. (A solo machine can and should, though). While we cannot measure wallclock time, we do get metering data for each crank, and we can feed this into an externally-developed model to estimate what the elapsed wallclock time would be on the slowest acceptable validator. When the model tells us that we've probably reached the target P time, we stop running cranks.

Currently, cosmic-swingset crudely approximates this model by simply running at most 1000 cranks before ending the block, by calling controller.step() up to 1000 times. We want to replace this with a controller.run(runPolicy) invocation. This runPolicy object can incorporate the model and tell SwingSet when to stop.

Description of the Design

controller.run(runPolicy) delegates directly to kernel.run(runPolicy). The runPolicy is fed information about each delivery, just after it finishes execution. For now, we'll just give it the metering results (computrons consumed) in a call to runPolicy.deliveryComplete(computrons). The return value will be a boolean: true to keep going, false to stop. controller.run checks the policy after each delivery and exits the loop when it says stop, or when there is no more work left to do.

The current 1000-crank behavior will be replaced with a runPolicy that simply counts deliveryComplete invocations. Once a suitable #3459 model is derived from the testnet slogfile corpus, we'll switch to a more sophisticated policy object that watches the cumulative computron count and is configured with a target P time.

A solo machine will use a runPolicy that gets to look at a real clock, and simply runs until a target wallclock time is reached. Consensus machines must use a deterministic runPolicy (and the configured target P must also be part of consensus, perhaps controlled by some kind of governance mechanism).

Security Considerations

Test Plan

In addition to unit tests that show the kernel respecting the policy's decisions, we also want to build a simulator. This simulator should take a policy object and a slogfile-derived table of all the deliveries that took place on a testnet run. From this, we want to see how the cranks would have been broken up into blocks if the chain had been using that policy, and then look at metrics like average block time and externally-visible latency.

I'm not sure we have enough information to actually build that simulator, though:

We don't record solo-to-chain delivery events or their times (however everything coming out of vat-vattp is the direct result of such a delivery, so we can infer their timing, at least to within the block time)
For any given operation (AMM trade, etc), the first solo-to-client message was driven by the solo machine, but all the subsequent messages will be reactions to things happening on the chain. We can model different chain behaviors, which result in different block timing, but those simulations won't accurately model how the solo machine would have responded to the results coming back at different times. A proper simulator would need to figure out which inbound deliveries are actually responses to new blocks, and adjust their delivery times to match.

However, spontaneous activity (such as a timer wakeup event triggering block rewards), where no external machine is immediately interacting with the chain as a result of that activity, should be modelable accurately.

The text was updated successfully, but these errors were encountered:

closes #3460

warner · 2021-07-30T18:27:44Z

My current API uses a runPolicy object with separate methods, as described here:

https://github.com/Agoric/agoric-sdk/blob/3460-run-policy/packages/SwingSet/docs/run-policy.md

I'm wondering about extensibility, though: how to let the app-provided runPolicy object keep working well-enough when the kernel is updated (and has more information to provide). If we have more data about existing events (e.g. we start to report syscall counts along with each crankCompleted), that's easy enough to add to an options bag, which old policy objects will ignore. But if we add new event types, with the current API that would want to invoke a missing method, which is kinda awkward to do in a backwards-compatible way.

I'm wondering what experience other folks have here. I could change the runPolicy API to have a single function (or maybe be a single function) which receives an array whose first element is an event-type string, with instructions for implementers to ignore any string they don't understand. Or I could keep using distinct methods but catch and ignore any TypeErrors (seems bad), or do a preemptive Object.getDescriptor check (seems unwise).

dckc · 2021-07-30T18:35:54Z

...

I'm wondering about extensibility, though: how to let the app-provided runPolicy object keep working well-enough when the kernel is updated (and has more information to provide).

I'm not aware of any requirement to do so.

cosmic-swingset and solo are the only clients; if/when we need to change them, we can change them, no?

I'm not a fan of trying to predict the future. Let's not try to generalize until we have 2 or 3 examples of extending the API.

dckc · 2021-07-30T18:38:38Z

...

I'm wondering what experience other folks have here. I could change the runPolicy API to have a single function (or maybe be a single function) which receives an array whose first element is an event-type string, with instructions for implementers to ignore any string they don't understand.

No, let's avoid stringly-typed stuff, please. Let's get all the help we feasibly can from static checks.

This allows the host application to control how much work `c.run()` does before it returns. Depending upon the policy chosen, this can be a count of cranks, or cumulative computrons used, with more details to be added in the future. closes #3460

warner added enhancement New feature or request SwingSet package: SwingSet labels Jul 9, 2021

warner added this to the Testnet: Metering Phase milestone Jul 9, 2021

warner assigned warner and mhofman Jul 9, 2021

This was referenced Jul 9, 2021

build model of computron-to-wallclock relationship #3459

Closed

kindConstructor vs GC sensitivity vs non-metered deserialization #3462

Closed

warner mentioned this issue Jul 28, 2021

make sure buildRootObject() can't retain agency #3552

Open

warner added a commit that referenced this issue Jul 30, 2021

feat(swingset): add "run policy" object to controller.run()

1cacf23

closes #3460

warner mentioned this issue Aug 2, 2021

add "run policy" object to controller.run() #3580

Merged

warner closed this as completed in #3580 Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add runPolicy object to controller.run() #3460

add runPolicy object to controller.run() #3460

warner commented Jul 9, 2021

warner commented Jul 30, 2021

dckc commented Jul 30, 2021

dckc commented Jul 30, 2021

add runPolicy object to controller.run() #3460

add runPolicy object to controller.run() #3460

Comments

warner commented Jul 9, 2021

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

warner commented Jul 30, 2021

dckc commented Jul 30, 2021

dckc commented Jul 30, 2021