Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Get rid of wasm-gc #1262

Closed
pepyakin opened this issue Dec 12, 2018 · 62 comments · Fixed by #12280
Closed

Get rid of wasm-gc #1262

pepyakin opened this issue Dec 12, 2018 · 62 comments · Fixed by #12280
Labels
I7-refactor Code needs refactoring.
Milestone

Comments

@pepyakin
Copy link
Contributor

wasm-gc got deprecated but we still use it in our build pipeline.

The author mentions that:

The Rust compiler now natively supports --gc-sections when linking wasm executables, which means wasm executables already have 90% of their garbage removed when coming out of the compiler.

But for us this is not entirely true and the size of the runtime is:

  • 882K baseline
  • 636K with wasm-gc

And

The wasm-pack (and wasm-bindgen) project will already run this by default for you, so there's no need to run it again.

But we don't use neither of these.

We need to figure out what do we want to use.

@pepyakin pepyakin added this to the As-and-when milestone Dec 12, 2018
@pepyakin
Copy link
Contributor Author

According to twiggy diff node_runtime.wasm node_runtime.compact.wasm:

 Delta Bytes │ Item
─────────────┼─────────────────────────────────
     -252601 ┊ <total>
      -68707 ┊ custom section '.debug_str'
      -56447 ┊ custom section '.debug_line'
      -52165 ┊ custom section '.debug_info'
      -16040 ┊ custom section '.debug_ranges'
      -10261 ┊ custom section '.debug_pubnames'
       -6072 ┊ custom section '.debug_pubtypes'
       -1432 ┊ custom section '.debug_abbrev'
         -20 ┊ "function names" subsection
         -18 ┊ custom section '.debug_macinfo'
          -3 ┊ __wasm_call_ctors
          -1 ┊ elem[0]
          -1 ┊ func[357]

so major source of the cruft is debug_ sections.

There is a tool called wasm-strip in the wabt suite, which strips all custom sections.

882K baseline
643K after wasm-strip
636K after wasm-gc

however, if we decide to depend on wabt in our build pipeline we could then use wabt's wasm-opt, which could optimize binary (including pruning of dead symbols) and produce a binary with size of 561K.

@bkchr
Copy link
Member

bkchr commented Dec 14, 2018

Why do we have debug sections in a release build? Shouldn't they be removed?

@pepyakin
Copy link
Contributor Author

I'd say they should be removed! But for now this is not the case (and it has been this way for a while).
And I'm not sure why is this that way, but I asked why.

@gavofyork gavofyork added the I7-refactor Code needs refactoring. label Dec 18, 2018
@pepyakin
Copy link
Contributor Author

pepyakin commented Oct 2, 2019

So, a quick recap:

  • substrate uses a wasm file to encode logic of blockchain. We call it wasm runtime
  • an example wasm runtime is written in rust, built with cargo. Rust uses LLVM LLD linker to link rust libraries and produce a final wasm blob
  • by default, rustc puts a quite some junk into the final binary and that junk is unnecessary.
  • to strip this unnecessary stuff, we use wasm-gc.

There is a chance that since the last time we checked the things have changed. Notably, there is a LLD flag -s which can be used to strip all unnecessary sections.

To check if that actually helps we need:

  1. Disable wasm-gc. We can comment it on the line below just for the experiment.

    let res = Command::new("wasm-gc")
    .arg(&wasm_file)
    .arg(&wasm_compact_file)
    .status()
    .map(|s| s.success());

  2. Try build the wasm runtime passing this flag to LLD and disable wasm-gc. This can be achieved by building substrate with the following command:

    WASM_BUILD_RUSTFLAGS='-C link-arg=-s' cargo build --release
    

You should be able to find the runtime files under target/release/wbuild directory. Keep in mind that if you skip wasm-gc then there won't be the .compact.wasm.

If the results turn out to be satisfying, we can then get rid of wasm-gc enitrely.

@pepyakin pepyakin changed the title wasm-gc is deprecated Get rid of wasm-gc Oct 2, 2019
@kianenigma
Copy link
Contributor

kianenigma commented Oct 3, 2019

with wasm-gc

// compact 
→ pwd
/Users/kianenigma/Desktop/Parity/substrate/target/release/wbuild/node-runtime
 → du -sk node_runtime.compact.wasm 
1156	node_runtime.compact.wasm

// Original
→ pwd                      
.../substrate/target/release/wbuild/target/wasm32-unknown-unknown/release
→ du -sk node_runtime.wasm        
1480	node_runtime.wasm

without wasm-gc

// run with `let res: Result<bool, std::io::Error> = Ok(true);`pwd
/substrate/target/release/wbuild/target/wasm32-unknown-unknown/release
→ du -sk node_runtime.wasm
1168	node_runtime.wasm

for which, after a cargo clean, I saved the compact file of the previous build, pasted it back here to silent this error

error: couldn't read /Users/kianenigma/Desktop/Parity/substrate/target/release/wbuild/node-runtime/node_runtime.compact.wasm: No such file or directory (os error 2)
 --> /Users/kianenigma/Desktop/Parity/substrate/target/release/build/node-runtime-de07cf4da2bc533b/out/wasm_binary.rs:2:36
  |
2 |                 pub const WASM_BINARY: &[u8] = include_bytes!("/Users/kianenigma/Desktop/Parity/substrate/target/release/wbuild/node-runtime/node_runtime.compact.wasm");
  |                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

and then ran the command with the linker flag. but I am not 100% sure if I have done it correctly. I pasted the cwd as well for someone to check.

@Swader
Copy link
Contributor

Swader commented Oct 4, 2019

Just a note that the test suite depends on wasm-gc and the prerequisites installation script doesn't pull it in. So to reduce new user friction, it should either be added to that one, or removed from the tests imo.

@bkchr
Copy link
Member

bkchr commented Oct 4, 2019

Which script? The init script installs it.
Nevertheless, it is documented in the readme and having clear install instructions is always better than having any scripts that work on one distribution.

@Swader
Copy link
Contributor

Swader commented Oct 4, 2019

You're right, I've been relying too much on the getsubstrate.io script which does not pull it in, but the README does document it.

@bkchr
Copy link
Member

bkchr commented Oct 4, 2019

@shawntabrizi ^^^

@bkchr
Copy link
Member

bkchr commented Oct 4, 2019

(That is also a script I really would like to get rid of!)

@Swader
Copy link
Contributor

Swader commented Oct 4, 2019

I ran into it from the Substrate Kitties workshop and here.

As a tech ed I'm all for simplification and abstracting bootstrapping from the user, but I do agree that if we cannot provide a coherent experience to newcomers, such a script shouldn't be a thing. Maybe something like a vagrant box with all the prerequisites installed would be better.

@shawntabrizi
Copy link
Member

We can and should update the getsubstrate.io script for the time being.

@bkchr I am 100% with you, let's get rid of all the scripts. For now though, I don't see a good solution for installing all these prerequisites.

Maybe the best thing is to literally have an install page with each operating system and instructions?

@Swader vagrant is a virtual env?

@shawntabrizi
Copy link
Member

Here is a PR to update the getsubstrate.io script for now: paritytech/scripts#101

@Swader
Copy link
Contributor

Swader commented Oct 4, 2019

@shawntabrizi vagrant is a VM tool for composing and running reproducible headless VMs of little size and resource intensity. For a good while I was maintaining https://github.com/Swader/homestead_improved

Basically it lets you pre-install stuff with provisioning scripts (shell, ansible, etc.) into a headless distro like a 200MB ubuntu, similar to what you might find in WSL, but it's not raw - it's all the prereqs in place. So no downloading rust, running updates, etc. The organization / author keeps the box up to date, the users merely run vagrant box update once in a while, though this is optional because once they run an instance of the VM they can just update it as any other headless VM.

Here's a better explained rundown of mine from ages ago: https://www.sitepoint.com/re-introducing-vagrant-right-way-start-php/ - mentions PHP but has nothing to do with it specifically.

In a nutshell, all OSes then have the exact same thing. On windows, it works exactly the same as on OS X or Linux. So suddenly all of your users have to follow exactly one set of instructions which always works (download vagrant, oraclevm and this box) and they have the same env from which to report bugs, which is coincidentally identical to the env on which you can then reproduce them.

@kianenigma
Copy link
Contributor

@Swader if you are recommending us maintaining vagrant boxes (or similar IaC tools) that'd be cool. We already have docker builds that can be used mostly to play with the ui and less to hack substrate itself.

But, such things would never be enough. We need manual installation as well, always.

@Swader
Copy link
Contributor

Swader commented Oct 5, 2019

I agree. Let me see if I can put together a box, I could use one myself so I'm sure I could dedicate part of my time to maintaining it. We can evaluate once done.

@DrSensor
Copy link

Hi, is it okay if I work on this issue?

@bkchr
Copy link
Member

bkchr commented Oct 19, 2019

You can work on this, but our experiments show that you probably will not find anything that achieves the same result.

@shawntabrizi
Copy link
Member

@pepyakin this should be closed right?

@pepyakin
Copy link
Contributor Author

I wasn't following for what is happening on the wasm-builder side lately, but just a quick glance revealed that it still does invoke wasm-gc. So apparently it is still being used.

@athei
Copy link
Member

athei commented Jan 21, 2022

I don't get it. AFAIK wasm builder builds the runtime with lto in release profile already. How to do you see those differences with only lto enabled:

release_profile.insert("lto".into(), true.into());

@koute
Copy link
Contributor

koute commented Jan 21, 2022

I don't get it. AFAIK wasm builder builds the runtime with lto in release profile already. How to do you see those differences with only lto enabled:

Just guessing - probably because of the extra codegen-units=1, and also AFAIK by default lto runs in thinlto mode, doesn't it? (So it'd be different than lto=fat.)

The build time metric is a bit daunting.

Hmm.... indeed, although extra 40s doesn't seem that bad considering how long everything else takes to build.

Maybe we should have separate cargo profiles for a proper release build and a "development" release build?

@pepyakin
Copy link
Contributor Author

pepyakin commented Jan 21, 2022

Maybe we should have separate cargo profiles for a proper release build and a "development" release build?

Yes, I was thinking about something along these lines too. Alternatively, maybe finally get the debug builds (with lots of opt-level overrides for performance critical crates).

Hmm.... indeed, although extra 40s doesn't seem that bad considering how long everything else takes to build.

The problem is that it's 40 seconds for one runtime. In polkadot we have 4 ༼ ༎ຶ ෴ ༎ຶ༽

I envision that with #7288 we will be able to split the runtimes out so those will be buildable with separate cargo xbuild-runtime one-by-one, or something along these lines, which will bring us to only 40s

@koute
Copy link
Contributor

koute commented Jan 21, 2022

Hmm.... indeed, although extra 40s doesn't seem that bad considering how long everything else takes to build.

The problem is that it's 40 seconds for one runtime. In polkadot we have 4 ༼ ༎ຶ ෴ ༎ຶ༽

Yes but, wouldn't they be compiled in parallel?

@athei
Copy link
Member

athei commented Jan 21, 2022

Just guessing - probably because of the extra codegen-units=1

But @nazar-pc reports a difference for lto="fat" only. This is what confuses me.

and also AFAIK by default lto runs in thinlto mode, doesn't it? (So it'd be different than lto=fat.)

Not according to this documentation. AFAIK true == fat.

@pepyakin
Copy link
Contributor Author

Yes but, wouldn't they be compiled in parallel?

Oh, right, that's true

@nazar-pc
Copy link
Contributor

But @nazar-pc reports a difference for lto="fat" only. This is what confuses me.

I think this is caused by incremental compilation, I did not wipe target in between those initial tests until I got to measure time.

Yes but, wouldn't they be compiled in parallel?

Assuming you have enough CPU cores of course 😉

@athei
Copy link
Member

athei commented Jan 21, 2022

I think this is caused by incremental compilation, I did not wipe target in between those initial tests until I got to measure time.

You also reported different code sizes.

@nazar-pc
Copy link
Contributor

You also reported different code sizes.

I just re-ran it a few times and I do reproducibly get different sizes with and without .append_to_rust_flags("-C lto=fat") added to WasmBuilder call.

I don't think incremental compilation in Rust guarantees for builds from scratch to be fully identical to some incremental rebuilds, but this is not it regardless.

@nazar-pc
Copy link
Contributor

nazar-pc commented Jan 21, 2022

I checked commands that are being generated, and with release_profile.insert("lto".into(), true.into()); you only get -C lto, which explains the difference.

I guess you need "true".into() to get what you expected to get. I don't like those weak and mixed types parameters, they are confusing 😞

@koute
Copy link
Contributor

koute commented Jan 21, 2022

Yes but, wouldn't they be compiled in parallel?

Assuming you have enough CPU cores of course wink

Well, indeed I do, whole 32 of them. (:

Anyhow, this seems like it's something we should measure. Maybe you could also compile polkadot (so all of the runtimes get compiled) while limiting the compilation to a certain number of cores (on Linux you should be able to do this through taskset) and compare how the compile time changes assuming we have N cores?

@nazar-pc
Copy link
Contributor

Polkadot compilation on the same machine as above, also attaching timing results.

Status quo:

   Completed rococo-runtime v0.9.13 build script (run) in 143.0s
   Completed westend-runtime v0.9.13 build script (run) in 148.2s
   Completed polkadot-runtime v0.9.13 build script (run) in 150.6s
   Completed kusama-runtime v0.9.13 build script (run) in 151.9s
...
    Finished release [optimized] target(s) in 6m 15s

cargo-timing-before.html.zip

-C lto=fat -C codegen-units=1:

   Completed rococo-runtime v0.9.13 build script (run) in 169.4s
   Completed westend-runtime v0.9.13 build script (run) in 187.0s
   Completed polkadot-runtime v0.9.13 build script (run) in 193.7s
   Completed kusama-runtime v0.9.13 build script (run) in 212.5s
...
    Finished release [optimized] target(s) in 7m 06s

cargo-timing-after.html.zip

Runtime compilation into WASM is a single-threaded process, so with 4+ cores time overhead doesn't add up and compilation happens in parallel.

@bkchr
Copy link
Member

bkchr commented Jan 21, 2022

We could just remove wasm-gc and do the optimal build only for when we prepare a release. We already do this for stuff like log where we disable it completely for the on chain build to reduce the final binary size.

@athei
Copy link
Member

athei commented Jan 28, 2022

We could just remove wasm-gc and do the optimal build only for when we prepare a release. We already do this for stuff like log where we disable it completely for the on chain build to reduce the final binary size.

#10747

However, I kept wasm-gc around for now as it still reduces the file size. wasm-opt needs to be applied first but I didn't want to conflate that in one PR.

I am not sure how important code size is but we could also look into compiling the runtime with opt-level = z.

@bkchr
Copy link
Member

bkchr commented Jan 28, 2022

could also look into compiling the runtime with opt-level = z.

I have done this 2 years ago or whatever and this wasn't bringing any good results. I think it was even slower.

@athei
Copy link
Member

athei commented Jan 28, 2022

could also look into compiling the runtime with opt-level = z.

I have done this 2 years ago or whatever and this wasn't bringing any good results. I think it was even slower.

This isn't surprising. It optimizes for smaller code rather than "fast code". Often that isn't necessarily slower. I am not sure how important a smaller runtime would be and if it makes sense to sacrifice some performance. opt-level = s could be a good middle ground.

@ggwpez
Copy link
Member

ggwpez commented Jan 28, 2022

This isn't surprising. It optimizes for smaller code rather than "fast code". Often that isn't necessarily slower. I am not sure how important a smaller runtime would be and if it makes sense to sacrifice some performance. opt-level = s could be a good middle ground.

There are cases where the performance improves since the smaller binary causes less cache misses when fetching the instructions. Would have to test it on the exact hardware.
Rust also seems to have guided Optimizations to get some automatic cache optimization.
But I'm not sure if this is actually a good idea, since we try to get worst-cast performance and the Guided Optimizations makes some assumptions about the normal program flow.
This could lead to good benchmarking performance but worse performance when we hit a branch that was not automatically optimized.

@bkchr
Copy link
Member

bkchr commented Jan 29, 2022

IMO we should not over-complicate it. Just keep the compiler optimizing for performance, remove stuff in the final binary that we don't need and then we compress it anyway again.

@nazar-pc
Copy link
Contributor

nazar-pc commented Apr 1, 2022

-C lto=fat -C codegen-units=1 is already used in production profile, does that mean there is nothing preventing wasm-gc from being removed?

@athei
Copy link
Member

athei commented Apr 1, 2022

I think it sill reduces the code size a bit. Didn't check by how much, though.

@nazar-pc
Copy link
Contributor

nazar-pc commented Apr 1, 2022

The difference right now is roughly this:

  • zstd polkadot_runtime.wasm 1_182_538 bytes
  • zstd polkadot_runtime.compact.wasm 1_120_459 bytes

So it is only ~5.2% increase, which is significant, but not deal-breaker, especially considering that things got smaller after -C lto=fat -C codegen-units=1 introduction.

@pepyakin
Copy link
Contributor Author

pepyakin commented Apr 5, 2022

i wonder where this difference comes from?

can you run twiggy diff over those two?

@nazar-pc
Copy link
Contributor

nazar-pc commented Apr 5, 2022

Uncompressed size:

  • polkadot_runtime.wasm is 4_954_045 bytes
  • polkadot_runtime.compact.wasm is 4_527_986 bytes
$ twiggy diff polkadot_runtime.wasm polkadot_runtime.compact.wasm 
 Delta Bytes │ Item
─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     +300764 ┊ "function names" subsection
      -51119 ┊ custom section '.debug_str'
      -23698 ┊ custom section '.debug_info'
      -22742 ┊ custom section '.debug_line'
      -16368 ┊ custom section '.debug_pubnames'
       -8873 ┊ <polkadot_runtime::Call as frame_support::traits::dispatch::UnfilteredDispatchable>::dispatch_bypass_filter::hb866e76c7c626a49
       -7840 ┊ custom section '.debug_ranges'
       -7772 ┊ polkadot_runtime::Runtime::metadata::h169c0e3519f6dcf9
       -3199 ┊ polkadot_runtime::_::<impl parity_scale_codec::codec::Encode for polkadot_runtime::Call>::encode_to::h4d530ea869712805
       -3018 ┊ <xcm_executor::XcmExecutor<Config> as xcm::v2::traits::ExecuteXcm<<Config as xcm_executor::config::Config>::Call>>::execute_xcm_in_credit::h3f8e6465ee293b4b
       -2832 ┊ <sp_runtime::generic::era::Era as scale_info::TypeInfo>::type_info::h1a1dae6524123429
       -2685 ┊ polkadot_runtime::_::<impl parity_scale_codec::codec::Encode for polkadot_runtime::Event>::encode_to::h783a8bffe709a93e
       -2395 ┊ <(TupleElement0,TupleElement1) as frame_support::traits::hooks::OnRuntimeUpgrade>::on_runtime_upgrade::he126fdfea7233115
       -2321 ┊ polkadot_runtime_parachains::initializer::<impl polkadot_runtime_parachains::initializer::pallet::Pallet<T>>::apply_new_session::hdd9528bf14999f39
       -2194 ┊ <pallet_staking::pallet::pallet::Call<T> as frame_support::traits::dispatch::UnfilteredDispatchable>::dispatch_bypass_filter::h1f225e22d27b0cf4
       -2128 ┊ pallet_session::<impl pallet_session::pallet::Pallet<T>>::rotate_session::h4ec98104757f4ffe
       -1841 ┊ <xcm::v1::multilocation::MultiLocation as core::convert::TryFrom<xcm::v0::multi_location::MultiLocation>>::try_from::h7793c69f1b61502b
       -1817 ┊ custom section '.debug_abbrev'
       -1731 ┊ pallet_election_provider_multi_phase::unsigned::<impl pallet_election_provider_multi_phase::pallet::Pallet<T>>::prepare_election_result::he31bdc622f13337b
       -1435 ┊ <(TupleElement0,TupleElement1) as frame_support::traits::hooks::OnInitialize<BlockNumber>>::on_initialize::hf592aba116c5ad05
     -200248 ┊ ... and 2740 more.
     -426059 ┊ Σ [2760 Total Rows]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I7-refactor Code needs refactoring.
Projects
None yet
Development

Successfully merging a pull request may close this issue.