Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Fix WASM executor without instance reuse; cleanups and refactoring #10313

Conversation

koute
Copy link
Contributor

@koute koute commented Nov 19, 2021

The current codepath for the WASM executor without instance reuse is broken. We're constantly reusing the same Store, which according to the wasmtime docs we're not supposed to do:

A Store is intended to be a short-lived object in a program. No form of GC is implemented at this time so once an instance is created within a Store it will not be deallocated until the Store itself is dropped. This makes Store unsuitable for creating an unbounded number of instances in it because Store will never release this memory.

This results in a memory leak, and after 10k calls into the runtime the executor will get permabricked and will panic on each call since it won't be able to spawn any new instances anymore. (Fortunately this is not a problem in practice since the codepath which reuses the instances is the default, and that one is unaffected.)

So this PR fixes the problem by bundling the Store along with each instance of the executor.

Also, since this code was originally written wasmtime grew a bunch of new APIs which we can use to clean up the code a little, so that's what I did. Multiple Rcs and RefCells were removed which makes it easier to reason about the lifetimes now. We also had a blanket unsafe impl Send for WasmtimeInstance {} which was also removed. Unfortunately we still need an unsafe impl Send for the sandbox store, but it's still an improvement since the impl now touches strictly less code. (It'd be nice to get rid of that one too, but unfortunately that would necessitate even bigger changes.)

@koute koute added A0-please_review Pull request needs code review. I3-bug The node fails to follow expected behavior. I7-refactor Code needs refactoring. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Nov 19, 2021
@crystalin
Copy link
Contributor

@koute should we expect some performance impact changing this ?

@koute
Copy link
Contributor Author

koute commented Nov 19, 2021

@koute should we expect some performance impact changing this ?

Nope. It might change the performance of the executor without instance reuse, however:

  1. as I've explained that codepath is currently broken,
  2. and it's currently unused anyway as far as I know.

Copy link
Contributor

@pepyakin pepyakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I welcome the clean up


// Scan all imports, find the matching host functions, and create stubs that adapt arguments
// and results.
let imports = crate::imports::resolve_imports(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

do you think it would be better to push imports to be a submodule of instance_wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... to be honest I don't think it's worth it in this case.

We could either make it a proper submodule (as in - dedicated directory with a dedicated file in it), but it seems silly for only a single 300 line file, or we could just inline it into instance_wrapper.rs, but it's kinda nice that the imports is in its own dedicated file which only does one thing and exports only one thing (the Imports struct + a function to create it).

client/executor/wasmtime/src/instance_wrapper.rs Outdated Show resolved Hide resolved
self.host_state_mut().sandbox_store.0 = Some(store);

let instance_idx_or_err_code =
match result.expect("instantiating the sandbox does not panic") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, the sandbox module or runtimes can actually panic!

When state-machine runs, it needs to have access to storage backend. This storage backend can be trusted, i.e. the on-disk storage. We assume it never panics and if it does, then that's an unrecoverable error.

However, when running a light-client, a backend could be based on a witness - basically data, and a merkle proof that that data actually comes from where it is claimed to come from. The caveat is that we do not eagerly check this proof, as far as I understand, and there are situations when the runtime may request a storage entry that actually does not present in the witness. In such case, a panic is emitted. You can get more details and my complaints here under the heading "panics". I assumed we could remove this code by now I am not sure → we will need at least some way to communicating such conditions through the sandbox layer (cc @athei) (assuming we will continue to support sandbox API)

To tie back it to this specific case, I assume the following can happen:

  1. on a light client we execute some call with a proof. The proof is corrupted and the backend will panic if X is accessed.
  2. the runtime is spawned.
  3. the runtime instantiates a new sandbox with some wasm code.
  4. In the start function calls back into the runtime and makes the runtime fetch X from storage and thus panics.
  5. it unwinds here and here we panic due to unwrap losing the context of the original panic.

Soi I am wondering, if were better to resume_unwind or not catching at all?

cc @bkchr

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know! Okay, so since this is technically outside of the scope of this PR I'll just do a resume_unwind to keep the original behavior. (I don't know, this might not be be necessary and maybe we could just ignore the panic altogether, however I think it's just simpler to catch it and restore the sandbox store anyway even if might not be necessary now. One less thing you have to think about whether it'll break something or not.)

@koute
Copy link
Contributor Author

koute commented Nov 22, 2021

One more minor cleanup - I've moved the utility functions for reading/writing memory from instance_wrapper.rs to util.rs. (Those were previously instance methods of InstanceWrapper, but now they're freestanding functions, and they're not used in instance_wrapper.rs itself, so it's kinda awkward to leave them there.)

@koute koute requested a review from bkchr November 22, 2021 08:22
client/executor/wasmtime/src/tests.rs Outdated Show resolved Hide resolved
client/executor/wasmtime/src/runtime.rs Outdated Show resolved Hide resolved
client/executor/wasmtime/src/host.rs Outdated Show resolved Hide resolved
client/executor/wasmtime/src/host.rs Outdated Show resolved Hide resolved
client/executor/wasmtime/src/host.rs Outdated Show resolved Hide resolved
@koute
Copy link
Contributor Author

koute commented Nov 23, 2021

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 4de8ee2 into paritytech:master Nov 23, 2021
grishasobol pushed a commit to gear-tech/substrate that referenced this pull request Mar 28, 2022
…aritytech#10313)

* Fix WASM executor without instance reuse; cleanups and refactoring

* Align to review comments

* Move the functions for reading/writing memory to `util.rs`

* Only `#[ignore]` the test in debug builds

* More review comments and minor extra comments
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
…aritytech#10313)

* Fix WASM executor without instance reuse; cleanups and refactoring

* Align to review comments

* Move the functions for reading/writing memory to `util.rs`

* Only `#[ignore]` the test in debug builds

* More review comments and minor extra comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit I3-bug The node fails to follow expected behavior. I7-refactor Code needs refactoring.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants