Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Allow separate max-heap-pages for offchain (non-consensus) execution context. #8885

Closed
wants to merge 34 commits into from

Conversation

kianenigma
Copy link
Contributor

@kianenigma kianenigma commented May 22, 2021

  • previous ExexutionStrategy is now wrapped in ExecutionConfig, which is the strategy and a ExecutionContext. This propagated to a lot of strategy -> config replacements. I like the outcome.

  • polkadot companion: Companion for /substrate/pull/8885 polkadot#3378

@github-actions github-actions bot added the A3-in_progress Pull request is in progress. No review needed at this stage. label May 22, 2021
@kianenigma kianenigma requested review from bkchr and tomusdrw May 22, 2021 21:52
@kianenigma kianenigma changed the title Allow separate max-heap-pages for offchain execution context. Allow separate max-heap-pages for offchain (non-consensus) execution context. May 22, 2021
@bkchr
Copy link
Member

bkchr commented May 23, 2021

We cache the wasm instances based on the wasm method, runtime version and the heap pages. Meaning we probably need to increase the cache size, which also leads to a higher memory usage.

(just some note on what we should not forget)

@kianenigma kianenigma marked this pull request as ready for review May 25, 2021 09:24
@kianenigma kianenigma requested a review from andresilva as a code owner May 25, 2021 09:24
@github-actions github-actions bot added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels May 25, 2021
@kianenigma
Copy link
Contributor Author

might not be needed, in favour of #8893

@gavofyork gavofyork added B5-clientnoteworthy C3-medium PR touches the given topic and has a medium impact on builders. D5-nicetohaveaudit ⚠️ PR contains trivial changes to logic that should be properly reviewed. labels May 25, 2021
@kianenigma
Copy link
Contributor Author

@cheme @bkchr this can be given a look now.

@kianenigma
Copy link
Contributor Author

I'm kinda lost and uncertain if this PR is really needed.

This only adds the support for setting the offchain-heap-pages onchain, and then it can be used. Deploying this today, would not help at all unless if we set that value. And setting that value itself has also some caveats as if it is set on-chain, we can not replace it again via a client upgrade..

@cheme
Copy link
Contributor

cheme commented Jun 28, 2021

I'm kinda lost and uncertain if this PR is really needed.

This only adds the support for setting the offchain-heap-pages onchain, and then it can be used. Deploying this today, would not help at all unless if we set that value. And setting that value itself has also some caveats as if it is set on-chain, we can not replace it again via a client upgrade..

I am not sure what are the plan for heap-pages on chain, but if it will live, having the offchain one seems ok to me.
Generally I am thinking of offchain as a default execution context.
I mean, if I was a validator I would like to be able to switch it at any time, so I would probably hack a way to override it.
Maybe ExecutionContext could contain a heap page override value from cli parameter (we got that for runtime (wasm-runtime-override) so would also make sense for heap page value).
Could also probably allow overriding runtime only for offchain which could help in some case (when you don't want to risk breaking consensus with a custom wasm build).

@@ -626,22 +626,24 @@ pub enum Profile {
}

impl Profile {
fn into_execution_strategies(self) -> ExecutionStrategies {
fn into_execution_configs(self) -> ExecutionConfigs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn into_execution_configs(self) -> ExecutionConfigs {
fn into_offchain_execution_configs(self) -> ExecutionConfigs {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor

@cheme cheme Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is calling new_offchain only, so just precising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually shouldn't it be new_offchain only for 'offchain_worker'?

frame/system/src/lib.rs Outdated Show resolved Hide resolved
primitives/state-machine/src/backend.rs Outdated Show resolved Hide resolved

/// The context of the execution.
#[derive(Copy, Clone, Eq, PartialEq, Debug)]
pub enum ExecutionContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing ExecutionContext enum in sp_core.
So would suggest renaming (but no naming idea).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I noticed that as well, but can't find another alternative for now.

primitives/state-machine/src/lib.rs Outdated Show resolved Hide resolved
primitives/state-machine/src/lib.rs Outdated Show resolved Hide resolved
primitives/storage/src/lib.rs Show resolved Hide resolved
test-utils/client/src/lib.rs Outdated Show resolved Hide resolved
kianenigma and others added 2 commits July 2, 2021 09:43
Co-authored-by: cheme <emericchevalier.pro@gmail.com>
@kianenigma
Copy link
Contributor Author

y usage of a node running your branch vs master? Wasmtime memory usage per instance is quite high AFAIR.

Yeah I'll check on the burnin. Note that now we create ONLY one additional instance with this higher number of heap pages.

@kianenigma
Copy link
Contributor Author

Okay, I've done the burn-in. Two observations:

  1. 1 wasm instance for OCW is too little, and will cause a lot of warnings around each election, because the duration of the election OCW thread is more than a typical block time. Instead, what we want is to set the instances to 2, where the second one is the backup for the duration of election.

  2. With 1 wasm instance, I could not see any observable memory increase on our burn-in nodes. I'm doing a new burn-in now with 2 instances, and I hope the answer is still the same.

@kianenigma
Copy link
Contributor Author

kianenigma commented Aug 6, 2021

burn-in with two instances for offchain:
Screenshot 2021-08-06 at 12 02 07

Copy link
Contributor

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, I am just not sure if the additional memory usage is fine but otherwise I approve.

Generally I would find it good for offchain worker to be able to configure the max number of instance, the number of runing instance and override the page size from the cli, but it is an other question. Yet it will make it easy to handle memory issue in the future.

Also Offchain context currently only change the heap size (and the pool), omitting the pool, it could also just have been an 'Override' or 'Custom' heap size option that offchain will specifically use with executor, making changes smaller, but I guess it can be good to distinguish offchain case (using different wasm for offchain could be nice, but it is also an other question).

@@ -626,22 +626,24 @@ pub enum Profile {
}

impl Profile {
fn into_execution_strategies(self) -> ExecutionStrategies {
fn into_execution_configs(self) -> ExecutionConfigs {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually shouldn't it be new_offchain only for 'offchain_worker'?

let exec = &self.execution_strategies;
let exec_all_or = |strat: Option<ExecutionStrategy>, default: ExecutionStrategy| {
let exec_all_or = |start: Option<ExecutionStrategy>, default: ExecutionStrategy| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the renaming? ('strat' was for strategy I guess)

@@ -262,12 +269,13 @@ impl sp_core::traits::ReadRuntimeVersion for WasmExecutor {
true,
"Core_version",
&[],
DEFAULT_HEAP_PAGES_CONSENSUS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be hardcoded to some lower value (just reading a version)?
Or can be the worst idea as instance could not be reuse 🤔

max_consensus_runtime_instances: usize,
/// Same as `consensus_runtimes`, but used in offchain code context. Their size is always bound
/// by 1 (can easily be configurable, but we don't need this right now).
offchain_runtimes: Mutex<[Option<Arc<VersionedRuntime>>; MAX_RUNTIMES]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess offchain is still using same wasm as the runtime.
So we can use twice as much memory of instantiated wasm now, I guess it is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it make sense to have MAX_RUNTIMES_OFFCHAIN. Did we consider only one with two instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afterthought: one instance would be really bad. I'm online being potentially stuck behind phragmen.

// this must be released prior to calling f.
let mut runtimes = match runtime_code.context {
sp_core::traits::CodeContext::Consensus => self.consensus_runtimes.lock(),
sp_core::traits::CodeContext::Offchain => self.offchain_runtimes.lock(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So offchain could not use consensus instance, since heappage differs I guess it would not be a good idea to try.
I do like that there is a clean separation.

let version = WasmOverride::runtime_version(&executor, &wasm, Some(128))
.expect("should get the `RuntimeVersion` of the test-runtime wasm blob");
let version =
WasmOverride::runtime_version(&executor, &wasm, Some(128), CodeContext::Consensus)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does wasm override apply to offchain?
If so will it run with consensus context?

/// Default num of pages for the heap of the runtime, when used for offchain operations.
///
/// 256mb per instance.
const DEFAULT_HEAP_PAGES_OFFCHAIN: u64 = DEFAULT_HEAP_PAGES_CONSENSUS * 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having bigger requirement for offchain is a bit awkward.
I mean most offchain process should be light (I am only aware of 'I am online', but its requirement are certainly small).
So we use a big mem profile for offchain phragmen mostly.
This makes me think that offchain worker default heap page could depend on the offchain method being call.
Same if considering overriding from cli.
(I am not sure the pool allow switching instance depending on heap page, but could also use a single use instance for this special case)

@stale
Copy link

stale bot commented Sep 5, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Sep 5, 2021
@kianenigma
Copy link
Contributor Author

we will probably kill this in favour of whatever @bkchr is cooking.

@stale stale bot removed the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Sep 7, 2021
@stale
Copy link

stale bot commented Oct 7, 2021

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. label Oct 7, 2021
@bkchr
Copy link
Member

bkchr commented Oct 7, 2021

What @kianenigma said :P

@bkchr bkchr closed this Oct 7, 2021
@bkchr bkchr deleted the kiz-ocw-max-heap-pages branch October 7, 2021 09:49
@kianenigma
Copy link
Contributor Author

Wen dynamic heap pages @bkchr?

@bkchr
Copy link
Member

bkchr commented Oct 12, 2021

++++++++++

@kianenigma
Copy link
Contributor Author

honestly, I still come back to this and think it is a good interim feature to have. WDYT @paritytech/sdk-node?

@bkchr
Copy link
Member

bkchr commented Apr 19, 2022

🙈

@koute
Copy link
Contributor

koute commented Apr 20, 2022

honestly, I still come back to this and think it is a good interim feature to have. WDYT @paritytech/sdk-node?

The heap_pages parameter doesn't actually change the maximum number of pages the runtime's memory is allowed to grow to, but the minimum number of pages which will be allocated by default. Basically the original code did this ("did" because I've merged steps 1 and 2 in a subsequent refactoring, but the behavior's still the same; also, I'm talking about the wasmtime executor, not sure if others have exactly the same behavior):

  1. Instantiate a new runtime with whatever the default number of pages is specified inside of the WASM blob. (This can be changed when the runtime is compiled by passing an appropriate flag to the linker, so in theory it can be anything, although in our case we don't change it.)
  2. Grow the memory by extra heap_pages specified. Now we have N + heap_pages preallocated for our memory.
  3. (N + heap_pages) * PAGE_SIZE - __heap_base bytes of memory are now available for any subsequent heap allocation (where __heap_base is also determined at compile-time so it also can be anything, and PAGE_SIZE is 64K).

Just for your reference, here's the memory map of a current-ish Kusama runtime I have laying around:

0x000000 - 0x100000: stack
0x100000 - 0x159778: rodata
0x159778 - 0x159910: data
0x159910 - 0x15991c: bss
0x15991c - 0x159920: padding
0x159920 - 0x160000: heap

So basically any dynamic allocation made inside of the runtime is served by that heap chunk at the end, and the heap_pages enlarges that chunk by heap_pages * PAGE_SIZE.

Now, our runtime doesn't do this, but the heap can already be dynamically grown from within the runtime (using this intrinsic) and from what I can see our allocator supports this! And before you ask, yes, there is actually a limit: the max_memory_size specified in the sc_executor_wasmtime::Config struct caps how big the memory can grow, however for the normal runtime this is None, so essentially you could (in theory) allocate as much memory as you want. (And for PVFs this is capped at 2080 pages.)

Of course this is not very convenient, since it's not automatic. The allocator host function will panic outside of the runtime when running out of memory instead of automatically growing the heap, and you need to grow the heap from within the runtime, so essentially you'd have to either blindly grow the memory or somehow know that the next allocation will fail. We could easily modify the allocator to automatically try to grow the memory if we wanted to though.

Wen dynamic heap pages @bkchr?

TLDR: Technically it's already here due to how the system works, although it's not very convenient, and it's debatable whether it's intended to be actually used or not... 🤷

@bkchr
Copy link
Member

bkchr commented Apr 20, 2022

Point here being that I have a branch that has all this already prepared. I just need to clean it up and fix two tests 🙈

@kianenigma
Copy link
Contributor Author

@koute thanks for the explainer, and I think you basically unwrapped what basti has implemented in his head and WIP branch.

Please keep me up to date with further changes regarding this, and do bear in mind that allowing some OCW to allocate more memory (both DEFAULT_HEAP_PAGES and MAX_POSSIBLE_ALLOCATION, both of which are somewhat arbitrary constant in our codebase dictated to the whole world) will be of huge advantage.

If you happen to know another workaround to achieve this in the short term, please do let me know.

@kianenigma
Copy link
Contributor Author

FYI, we're still running into this in a semi-random basis on the offchain workers of westned, in a rather unpredictable fashion: https://grafana.parity-mgmt.parity.io/goto/DF1cyNX7z?orgId=1

@louismerlin louismerlin removed the D5-nicetohaveaudit ⚠️ PR contains trivial changes to logic that should be properly reviewed. label Oct 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. A5-stale Pull request did not receive any updates in a long time. No review needed at this stage. Close it. C3-medium PR touches the given topic and has a medium impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants