Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance criteria & conflicts #108

Closed
kg opened this issue Jun 3, 2015 · 13 comments
Closed

Performance criteria & conflicts #108

kg opened this issue Jun 3, 2015 · 13 comments

Comments

@kg
Copy link
Contributor

kg commented Jun 3, 2015

Good performance (by some set of metrics) is an absolute requirement for us to label a set of decisions as v1, and for us to ship a polyfill and native implementation, as I understand it.

We need to get some clarity on what those metrics are, and try to arrive at an understanding about which metrics matter the most to us. There is a balancing act necessary here where it will not be possible to achieve great results on all these metrics - optimizing really well for wire size can have negative impacts on decode speed or memory usage during decoding or streaming decode performance, etc.

There are also some existing assumptions that we're basing our performance planning on here. Most of these appear to be based on evidence, and we just need to document it. I'll enumerate the ones I've heard:

  • It's assumed that there is an efficient way to hook into the module/executable loading pipeline with user-authored JS or asm.js and use that to efficiently do transforms like macro expansion or applying deltas for updates.
  • It's assumed that reliable caching of some form is available, either by caching wasm executables, or writing some code that stores them into indexedDB and loads them later. This is particularly implied by any strategy that involves complex loader code/pre-filtering.
  • (edited for clarity) It's assumed that VMs and polyfills will be able to do streaming compilation of a wasm executable, and that they will want to do this before the file is fully downloaded and without keeping the full executable resident in memory.
  • (edited for clarity) It's assumed that even on mobile, the cost of keeping the wasm executable in memory doesn't cause any functional problems and is insignificant.
  • It's assumed that a polyfill needs to produce asm.js equivalent to what would be shipped now, which means producing a single module and compiling it in one go, and we assume that this is the best path for running these applications on mobile.
  • It's assumed that implementations will want (need?) to AOT compile an entire executable (many of the above assumptions strongly imply this on their own.)

I'm sure there is more I've overlooked, and I suspect a couple of these are partial understandings on my part. Most of them are things I've heard multiple times, however.

As far as performance criteria go, here are the criteria as I generally understand them:

  • Startup time is very important to us. We want excellent first-run startup time AND even better startup time on later runs (cached compilation, etc)
  • Memory usage to compile the decoded representation is extremely important to us. For existing asm.js applications this is a major issue and has already required extensive changes to JS runtimes.
  • Wire size for first run is very important to us (this feeds into first-run startup time). We want the asm to be small over the wire after transport compression. We don't care about pre-compression file size (I think?)
  • (edited for precision) Application run-time performance/throughput must be equal to or superior to asm.js in the mid term.
  • Being able to efficiently debug wasm applications is important to us (i.e. a userspace debugger isn't sufficient, we need a fast native one)
  • Interacting with code outside of the wasm executable must be reasonably efficient. It is unclear whether we are okay with eating some short/mid-term performance hits here in order to improve our design (i.e. having to copy data into/out of the wasm heap instead of aliasing the heap)
  • The decoding process needs to avoid performance/risk landmines - O(N_N_N) complexity, multiple-pass decoding, complex computation, etc.
  • Network traffic and I/O for later runs (downloading updates/patches off the network, loading quickly from cache storage) matters to us... but it's unclear how much it matters, and whether we're okay with letting that work itself out via userspace code & browser improvements.
  • Memory usage to decode the wire format matters to us: We want it to be reasonable.
  • We are okay with sacrificing performance on the compile side of the pipeline in order to meet our criteria on the decode/runtime side. At some point this may have to change (for JIT, etc) but we don't care about it right now.

Once we have a general consensus on all this I'll create a PR to document it.

@lukewagner
Copy link
Member

That's a great list and on first scan I think the only issue I have is with:

It's assumed that implementations will want (need?) to AOT compile an entire executable (many of the above assumptions strongly imply this on their own.)

WebAssembly clearly enables a simple AOT story. Realistically, I think engines are going to want to use a bunch of variations, mixing baseline compilers (that run AOT, while fully-optimized compilers execute in the background, swapping in when ready), profile-based recompilation, JITish compilation etc to optimize cold load time. (I expect a lot of experimentation in this space as it would be an area to compete on quality-of-implementation.) Many of the not-fully-AOT strategies will trade time-to-first-frame for occasional janks while running. This may be fine for many applications (e.g., that spend the first N seconds in a main menu), but some applications will need full performance on their first frame (imagine a game that starts with an intense rendered scene in the intro and doesn't want the first 20 seconds to be stuttery). It'd be nice to have some knob/option/flag/queryable-state (open for discussion) that lets an application request or test for "full" performance. In the limit, we could standardize a way for applications to annotate individual functions (based, e.g., on PGO) as being cold or requiring full-performance AOT (well after v.1, of course).

So perhaps we could say "It is assumed that engines will provide (always, or under program control) AOT compilation of high-performance code."

@kg
Copy link
Contributor Author

kg commented Jun 3, 2015

So perhaps we could say "It is assumed that engines will provide (always, or under program control) AOT compilation of high-performance code."

I've been thinking along these lines exactly. Maybe we (post-v1, so to speak, but early) provide a simple intrinsic that lets you opt into up-front AOT compilation of a function, or a module, or some other unit of granularity - i.e. don't force an AOT compile before you even hit your loading screen, but be able to ensure that all your hot functions are compiled before you start playing a movie or a game to avoid jank and dropped frames. I think a small hook like that is probably all that would be needed to tackle those scenarios, and it's likely not a blocker.

A similar practice here is that many modern games will front-load compilation of pixel shaders to avoid janking when they're first used. In the old days, games would try to force texture uploads onto the GPU to avoid jank from lazy on-demand uploads.

@lukewagner
Copy link
Member

Yeah, definitely sounds like we're thinking along the same lines. I agree we should probably wait to think about this post-v.1 since it'll be better informed by everyones' experience while implementing v.1 and we'll want some time to experiment with what are the minimal set of knobs/hooks/annotations that give all the necessary control to developers.

@titzer
Copy link

titzer commented Jun 3, 2015

From the V8 perspective, the two policies that would be easiest to
implement to get a native implementation up and running are 1.) AOT
everything or 2.) Compile optimized on first invocation. Our policy for
asm.js is basically treat it like JS; compile unoptimized (super fast
compilation) and then optimize hot code with TurboFan (slow compilation).
That works great since there is already an unoptimized engine to run all
the cold code. But for WebAsm we'd have to build something else to run the
cold code, probably generating machine code by hand, porting, etc. Yuck.

That said, I think the future is really bright for dynamic optimization of
WebAsm. E.g. it would be really great if engines did some profiling and
inlined methods based on hotness and not static heuristics like LLVM. It'd
be really great if engines profiled the receivers of indirect calls and did
guarded inlining. It'd be really great if engines profiled branches and
adjusted register allocation and other code generation decisions around
that. There are some really crazy corner cases in floating point that
probably only very rarely occur, and some checked fast paths make sense. We
want to make that a very non-intrusive and non-janky thing though. E.g. we
want to run mostly optimized with only a little instrumentation to pick up
opportunities where static knowledge hasn't given us enough. For users this
basically means there isn't any jank but somehow the application just
magically (and smoothly) gets faster the longer it runs :-)

On Wed, Jun 3, 2015 at 9:21 PM, Luke Wagner notifications@github.com
wrote:

Yeah, definitely sounds like we're thinking along the same lines. I agree
we should probably wait to think about this post-v.1 since it'll be better
informed by everyones' experience while implementing v.1 and we'll want
some time to experiment with what are the minimal set of
knobs/hooks/annotations that give all the necessary control to developers.


Reply to this email directly or view it on GitHub
WebAssembly/spec#108 (comment).

@cosinusoidally
Copy link

Memory usage to compile the decoded representation is extremely important to us. For existing asm.js applications this is a major issue and has already required extensive changes to JS runtimes.

I assume this is one of the major motivating reasons behind wasm? If so, I think it would be useful if this was explained in the FAQ.

I think it would also be useful to explain why JavaScript virtual machines consume so much memory when loading large asm.js codebases. For example, if I load the AngryBots-asm.js code into Chrome then memory consumption jumps up by about a gigabyte (and then drops back down again).

@jfbastien
Copy link
Member

@cosinusoidally this is what we mean in the high-level goals when we mention "load-time-efficient". The FAQ does discuss memory usage as it pertains to the polyfill (it compares usage to regular asm.js).

The Chrome issue you mention isn't inherent to asm.js, it's the compiler being silly (known V8 bug in this case). It's a difficult problem to fix, but similar kinds of issues could also occur for WebAssembly.

Where WebAssembly wins is in simplicity of the format, and shedding some of JavaScript's oddness. That's hard to quantify without getting into nitpicky details that folks argue over, so we've just avoided playing point-the-finger :-)

@trevnorris
Copy link

if I load the AngryBots-asm.js code into Chrome then memory consumption jumps up by about a gigabyte (and then drops back down again)

I commonly see V8 using more heap than is technically necessary, but from what I understand it uses the heap however necessary to achieve better performance. If you were to decrease the max old space size it would load using less memory, but take longer. This practice is fairly common for devs who run node on a raspberry pi and don't have the memory.

In regards to wasm, wouldn't be surprising if it does use more than expected memory usage. It's not uncommon for node to get issues about potential memory leaks with graphs of large heap usage. When in reality only half of that is in use by objects.

@titzer
Copy link

titzer commented Jul 9, 2015

Wasm will not have the parser problem referenced in the chromium issue
above. We're being careful to design for high decode and compilation
efficiency, not only time but space.

On Thu, Jul 9, 2015 at 3:37 AM, Trevor Norris notifications@github.com
wrote:

if I load the AngryBots-asm.js code into Chrome then memory consumption
jumps up by about a gigabyte (and then drops back down again)

I commonly see V8 using more heap than is technically necessary, but from
what I understand it uses the heap however necessary to achieve better
performance. If you were to decrease the max old space size it would load
using less memory, but take longer. This practice is fairly common for devs
who run node on a raspberry pi and don't have the memory.

In regards to wasm, wouldn't be surprising if it does use more than
expected memory usage. It's not uncommon for node to get issues about
potential memory leaks with graphs of large heap usage. When in reality
only half of that is in use by objects.


Reply to this email directly or view it on GitHub
#108 (comment).

@cosinusoidally
Copy link

The Chrome issue you mention isn't inherent to asm.js, it's the compiler being silly (known V8 bug in this case). It's a difficult problem to fix, but similar kinds of issues could also occur for WebAssembly.

I did a bit of experimenting and I managed to find a workaround. It turns out you get the huge memory spike if you define all your functions inside the asm.js module closure. When I moved all the function definitions outside the closure then the spike went away. I did this by moving all the functions outside the closure, splitting the module into chunks, converting all the main closure vars into globals, and loading the whole thing inside an iframe. This works, but unfortunately it regresses performance quite a bit.

The hacked together a proof of concept is here: https://github.com/cosinusoidally/angrybots-chunked-js . As mentioned above, it turns out that loading the module in chunks wasn't the main issue (so the repo's name is a bit misleading). Chunked loading could prove to be handy though.

With the this proof of concept the memory consumption over time goes from this:

before

to this:

after

Which is quite a significant difference in peak startup memory usage (around a gig, I think).

With regards to the pollyfill, are there any plans to use something like a interpreter/baseline compiler (written in JavaScript) in order to reduce startup memory usage for browsers without wasm support?

@kripken
Copy link
Member

kripken commented Jul 10, 2015

Interesting about the memory usage reduction, but you are probably decreasing throughput that way, though.

Yes, an interpreter for the polyfill is something worth experimenting with. A baseline compiler is actually what the polyfill is - it is just going to write out the wasm into asm.js in a simple way like a baseline compiler would. Any true optimization would have been done by the compiler emitting wasm, or will be done by the JS VM's JIT.

@titzer
Copy link

titzer commented Jul 11, 2015

The memory usage spike is due to the scope analysis of V8 needing the AST
for the entire function to be in memory at once. Thus if you split up one
large module into smaller ones, the spikes will be less severe. We've
explored the idea of reducing this memory usage spike in V8, but range into
a tangle of legacy, so it wasn't straightforward.

On Fri, Jul 10, 2015 at 5:41 PM, Alon Zakai notifications@github.com
wrote:

Interesting about the memory usage reduction, but you are probably
decreasing throughput that way, though.

Yes, an interpreter for the polyfill is something worth experimenting
with. A baseline compiler is actually what the polyfill is - it is just
going to write out the wasm into asm.js in a simple way like a baseline
compiler would. Any true optimization would have been done by the compiler
emitting wasm, or will be done by the JS VM's JIT.


Reply to this email directly or view it on GitHub
#108 (comment).

@Deamon87
Copy link

Sorry for getting into the middle of discussion, but I'd like to leave a few points.

There is the well known equation, that is:

(performance / memory usage) = constant.

It basically says that by increasing the memory usage you can increase the theoretical maximum performance. And vice versa: by decreasing memory usage you decrease the theoretical maximum performance.

On code level the increasing memory for performance is usually means you use tables of pre-computed data, so the cpu cycles are not used for these computations and thus performance raises.

Also, the concept of WebAssembly so far is more similar to JVM/CLR virtual machines than javascript. As you know both JVM and CLR use own assembly language. So some optimization techniques can be derived from their implementations.

@binji
Copy link
Member

binji commented Oct 23, 2015

Closing for now, please create new issues for new discussion.

@binji binji closed this as completed Oct 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants