Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add runPolicy object to controller.run() #3460

Closed
warner opened this issue Jul 9, 2021 · 3 comments · Fixed by #3580
Closed

add runPolicy object to controller.run() #3460

warner opened this issue Jul 9, 2021 · 3 comments · Fixed by #3580
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jul 9, 2021

What is the Problem Being Solved?

The host applications that use swingset break up their computation into "blocks". Each block finishes with a state commitment to durable storage, followed by the release of all embargoed outbound messages. The externally-visible latency of a message is lower-bounded by the time it takes to perform all the computation in a block ("P").

Solo machines are free to end a block any time they like. Consensus machines must end the block the same way for all validator nodes, and additionally perform significant non-swingset work for each block (e.g. they run a consensus algorithm over the contents and consequences of the block). This extra work takes some amount of time: the default cosmos-SDK settings cause 6 seconds of voting to occur in between the functional transaction-processing time (P), leading to a block time (and minimum latency) of 6+P. Cosmos/Tendermint does not currently make it easy to do any transaction processing during the voting time, leading to a CPU utilization of P/(6+P). P is therefore a tunable parameter which trades off throughput against latency.

To achieve whatever target tradeoff the machine operator chooses, we'd like to stop processing cranks when their cumulative runtime has (roughly) reached P. However their wallclock runtime is not a deterministic function of the machine state (it depends upon the CPU speed, among dozens of other uncontrolled factors), so a consensus-based swingset cannot use it to make this decision. (A solo machine can and should, though). While we cannot measure wallclock time, we do get metering data for each crank, and we can feed this into an externally-developed model to estimate what the elapsed wallclock time would be on the slowest acceptable validator. When the model tells us that we've probably reached the target P time, we stop running cranks.

Currently, cosmic-swingset crudely approximates this model by simply running at most 1000 cranks before ending the block, by calling controller.step() up to 1000 times. We want to replace this with a controller.run(runPolicy) invocation. This runPolicy object can incorporate the model and tell SwingSet when to stop.

Description of the Design

controller.run(runPolicy) delegates directly to kernel.run(runPolicy). The runPolicy is fed information about each delivery, just after it finishes execution. For now, we'll just give it the metering results (computrons consumed) in a call to runPolicy.deliveryComplete(computrons). The return value will be a boolean: true to keep going, false to stop. controller.run checks the policy after each delivery and exits the loop when it says stop, or when there is no more work left to do.

The current 1000-crank behavior will be replaced with a runPolicy that simply counts deliveryComplete invocations. Once a suitable #3459 model is derived from the testnet slogfile corpus, we'll switch to a more sophisticated policy object that watches the cumulative computron count and is configured with a target P time.

A solo machine will use a runPolicy that gets to look at a real clock, and simply runs until a target wallclock time is reached. Consensus machines must use a deterministic runPolicy (and the configured target P must also be part of consensus, perhaps controlled by some kind of governance mechanism).

Security Considerations

Test Plan

In addition to unit tests that show the kernel respecting the policy's decisions, we also want to build a simulator. This simulator should take a policy object and a slogfile-derived table of all the deliveries that took place on a testnet run. From this, we want to see how the cranks would have been broken up into blocks if the chain had been using that policy, and then look at metrics like average block time and externally-visible latency.

I'm not sure we have enough information to actually build that simulator, though:

  • We don't record solo-to-chain delivery events or their times (however everything coming out of vat-vattp is the direct result of such a delivery, so we can infer their timing, at least to within the block time)
  • For any given operation (AMM trade, etc), the first solo-to-client message was driven by the solo machine, but all the subsequent messages will be reactions to things happening on the chain. We can model different chain behaviors, which result in different block timing, but those simulations won't accurately model how the solo machine would have responded to the results coming back at different times. A proper simulator would need to figure out which inbound deliveries are actually responses to new blocks, and adjust their delivery times to match.

However, spontaneous activity (such as a timer wakeup event triggering block rewards), where no external machine is immediately interacting with the chain as a result of that activity, should be modelable accurately.

@warner
Copy link
Member Author

warner commented Jul 30, 2021

My current API uses a runPolicy object with separate methods, as described here:

https://github.com/Agoric/agoric-sdk/blob/3460-run-policy/packages/SwingSet/docs/run-policy.md

I'm wondering about extensibility, though: how to let the app-provided runPolicy object keep working well-enough when the kernel is updated (and has more information to provide). If we have more data about existing events (e.g. we start to report syscall counts along with each crankCompleted), that's easy enough to add to an options bag, which old policy objects will ignore. But if we add new event types, with the current API that would want to invoke a missing method, which is kinda awkward to do in a backwards-compatible way.

I'm wondering what experience other folks have here. I could change the runPolicy API to have a single function (or maybe be a single function) which receives an array whose first element is an event-type string, with instructions for implementers to ignore any string they don't understand. Or I could keep using distinct methods but catch and ignore any TypeErrors (seems bad), or do a preemptive Object.getDescriptor check (seems unwise).

@dckc
Copy link
Member

dckc commented Jul 30, 2021

...

I'm wondering about extensibility, though: how to let the app-provided runPolicy object keep working well-enough when the kernel is updated (and has more information to provide).

I'm not aware of any requirement to do so.

cosmic-swingset and solo are the only clients; if/when we need to change them, we can change them, no?

I'm not a fan of trying to predict the future. Let's not try to generalize until we have 2 or 3 examples of extending the API.

@dckc
Copy link
Member

dckc commented Jul 30, 2021

...

I'm wondering what experience other folks have here. I could change the runPolicy API to have a single function (or maybe be a single function) which receives an array whose first element is an event-type string, with instructions for implementers to ignore any string they don't understand.

No, let's avoid stringly-typed stuff, please. Let's get all the help we feasibly can from static checks.

warner added a commit that referenced this issue Aug 2, 2021
This allows the host application to control how much work `c.run()` does
before it returns. Depending upon the policy chosen, this can be a count of
cranks, or cumulative computrons used, with more details to be added in the
future.

closes #3460
warner added a commit that referenced this issue Aug 3, 2021
This allows the host application to control how much work `c.run()` does
before it returns. Depending upon the policy chosen, this can be a count of
cranks, or cumulative computrons used, with more details to be added in the
future.

closes #3460
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants