You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.
This is open-ended. The problem is that many key use cases, such as matrix multiplication kernels, need to know a number of SIMD vector registers that they can count on using. In practice, the number of available architecture registers tends to be just large enough to hit peak performance, so matrix multiplication kernels tend to use all available registers. Here is an example.
In theory, a higher-level language (than raw asm) such as WebAsm abstracts away this fixed number of architecture registers, offering infinitely many variables instead. In practice, register-intensive simd kernels are one area where this abstraction has not been working well. This abstraction is based on spilling registers as necessary, which has only a marginal performance impact on most code, but has often catastrophic impact on register-intensive simd kernels (performance degradations > 2x, sometimes 10x).
This prompts a few question for someone trying to write WebAsm matrix multiplication kernels:
Can the programmer query the number of architecture registers?
Can the programmer make assumptions about the correspondence between the number of SIMD vector variables used in a part of the program, and the register usage of the generated code?
These issues have been severely affecting also C/C++ with intrinsics, and are the main reason why many people prefer to write assembly instead. However, in C/C++ with intrinsics, at least:
One knows the target architecture.
One can "massage" the compiler into generating the expected code. Compilation is AOT and one gets a chance to look at the generated code before shipping.
I'm afraid that these issues, with are bad enough in C/C++ intrinsics to halfway kill this programming model for critical use cases, will affect WebAsm SIMD more severely still due to the abstraction of the client device and browser and the JIT compilation.
The text was updated successfully, but these errors were encountered:
bjacob
changed the title
Support use cases that need to target a specific number of registers.
Support register-tight use cases
May 12, 2020
Can the programmer query the number of architecture registers?
No, exposing underlying architectural details would introduce platform-specific behavior and violate WebAssembly's determinism. Although this kind of nondeterminism might be considered for a future proposal, it is out of scope for this SIMD proposal.
Can the programmer make assumptions about the correspondence between the number of SIMD vector variables used in a part of the program, and the register usage of the generated code?
No, different engines may make different register allocation decisions and may optimize or otherwise transform the code however they deem fit, so programmers should not be making these sorts of assumptions. It may be possible to make assumptions about codegen for a particular engine, but it should not be assumed that those assumptions will generalize to other engines.
The low-level, portable SIMD instructions in this proposal have proven to be useful for a wide variety of workloads, but we are aware that there are also many workloads that depend on non-portable instructions. Keep an eye out for future proposals meant to address this problem.
This is open-ended. The problem is that many key use cases, such as matrix multiplication kernels, need to know a number of SIMD vector registers that they can count on using. In practice, the number of available architecture registers tends to be just large enough to hit peak performance, so matrix multiplication kernels tend to use all available registers. Here is an example.
In theory, a higher-level language (than raw asm) such as WebAsm abstracts away this fixed number of architecture registers, offering infinitely many variables instead. In practice, register-intensive simd kernels are one area where this abstraction has not been working well. This abstraction is based on spilling registers as necessary, which has only a marginal performance impact on most code, but has often catastrophic impact on register-intensive simd kernels (performance degradations > 2x, sometimes 10x).
This prompts a few question for someone trying to write WebAsm matrix multiplication kernels:
These issues have been severely affecting also C/C++ with intrinsics, and are the main reason why many people prefer to write assembly instead. However, in C/C++ with intrinsics, at least:
I'm afraid that these issues, with are bad enough in C/C++ intrinsics to halfway kill this programming model for critical use cases, will affect WebAsm SIMD more severely still due to the abstraction of the client device and browser and the JIT compilation.
The text was updated successfully, but these errors were encountered: