-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relaxed BFloat16 Dot Product instruction #77
Comments
@yurydelendik @dtig |
(Typo to fix: in reference lowering code |
I wanted to also comment on what we discussed about bf16 conversions to and from fp32 (correct me if anything is incorrect): @Maratyszcza is proposing that the native conversion instructions (e.g., |
BFloat16 numbers are usually only used for storage, up-casted to IEEE FP32 for computations. The proposed |
I would like to add some implementer feedback from JSC/WebKit. There is a lot of interesting feedback in this thread, and I have a few questions. Some of this was mentioned in the meeting, but do we have any data about:
In general, we are concerned about instructions that might make deterministic fingerprinting much easier, and also about how they might cause compatibility/interoperability issues for users who browse the web on less-common CPU architectures. An important part of the Relaxed-SIMD proposal for us in general is the fact that it justifies this risk with very compelling performance data. We would not consider implementing BFloat16 unless we have proof that this would speed up an entire class of reasonably popular programs. Overall, since Relaxed-SIMD is in stage 3, we think that this change should be put in a separate proposal. |
Thanks for the feedback @justinmichaud (and great to see JSC/WebKit being more active in WAsm SIMD)!
Performance on x86 CPUs without AVX512-BF16 extension would be similar to software emulation of the On Galaxy S22 with Snapdragon 8 Gen 1 I see throughput of 4
This instruction is known to be helpful for neural network computations (both for inference and training). Background Effects in Google Meet and Teachable Machine are examples of applications using such functionality.
The native
Unlike native platforms, WebAssembly doesn't have the capability to detect supported WebAssembly ISA extensions at run-time: if a WAsm engine doesn't support any instruction in a WAsm module, the whole will be rejected, even if the instruction is never executed in runtime. Unfortunately, this means that WebAssembly extensions can't be as granular as native ISA extensions: developers have to rebuild the whole webapp for each set of WAsm extensions they target, and having an extension with a single instruction in it is overburdening for the developers. Please note that it is permissible to add new instructions at stage 3 (WebAssembly SIMD added dozens of instructions while at Stage 3). It is Stage 4 is where the spec is frozen. |
Thanks for the fast response!
Thanks! Is there a real benchmark or library that is available to be compiled with this instruction vs just the normal set of relaxed-simd instructions? The two examples that you linked seem interesting, but I don't see how they could be ported today. I am also wondering if there is an implementation using BFMLALB/BFMLALT that is complete enough to collect real-world performance numbers, since that is the only implementation we would consider on ARM. We really need to see data in the context of a full benchmark or application here, because on our CPUs instruction throughput doesn't usually indicate general performance.
This is good to know. So if I am understanding things correctly, the current state of the world is that this is software emulated on every single chip that is available today except for ARMv8+, and ARMv8+ can only be distinguished from emulation by performance (which is fine)?
This makes sense, but I would just like to note that this instruction seems like a pretty big departure from the others in the proposal. In addition, it seems like there are almost no desktop CPUs that natively support this instruction yet, so maybe it is too soon to standardize? |
Alibaba's MNN and Tencent's NCNN support BF16 computations (via software conversions to/from FP32), but AFAICT neither use the ARM BF16 extension yet. It is also worth noting that native software targeting ARM BF16 is unlikely to use My plan is to implement BF16 GEMM (matrix-matrix multiplication) kernels in XNNPACK using generic NEON /
This is correct. To be specific, BF16 extension on ARM was introduced as optional in ARMv8.2-A and became mandatory in ARMv8.6-A.
I see little risk in adding this instruction, as BF16 extension is mandatory in ARMv8.6-A (and ARMv9), and thus eventually all ARM-based computers will have it. The latest Windows-based laptops (e.g. Lenovo ThinkPad X13s) already support BF16 extension. In the desktop x86 space, AMD Zen 4 is reported to support AVX512-BF16, so Intel will have to match to compete. However, if we don't add this instruction, WebAssembly SIMD will miss out on the nearly 2X speedup for AI models, which would put privacy-focused technologies like Federated Learning as risk. |
To reiterate an important point to make sure it doesn't get lost in a long discussion: we need to hear back from engine maintainers before we conclude those instructions are available to the engines. Implementing support for new ISA extension requires implementing new decode, operands, etc, which comes with maintenance costs. I have brought this up before and it did not look like there was interest (I personally would welcome this changing 😉). |
I opened the issue at https://bugzilla.mozilla.org/show_bug.cgi?id=1778751 with intent to implement the reference lowering in SpiderMonkey. We feel that there is a need to support BFloat16 for matrix multiplication / ML use cases. In the future we will adjust the solution for ARM (and AVX) BF16 extensions as they become popular. |
There is inconsistency in arguments ordering. For (Also reference lowering has typo related to this) |
@yurydelendik Good point! However, it is
Lets have discussion in the FMA proposal #27 |
I implemented BF16 GEMM microkernels using
As expected,
|
@justinmichaud @yurydelendik @dtig @penzn @abrown Do you have further concerns about the benefits of these instructions? If no, I'd like to move on to voting on the proposal. |
Voting sgtm, no concerns specifically but I don't expect that we'd be supporting the BFDOT/AVX512-BF16 (but will support BFMLALB/BFMLALT) lowering right now. |
From the bug that @yurydelendik filed they are not planning to support either of the extensions, this doesn't make for good native support, particularly on x86. What is maybe more concerning to me though is that there is no consensus that it is feasible to add AVX512 support to engines in near future - let's say hardware support does appear in a few years, it would still require at least basics of AVX512 implemented in engines. I would like to second @justinmichaud point about that "future proofing" and maybe expand on it a bit. There is the danger that we won't add semantics that will be introduced in the future, and also there is inertia in changing existing instruction sets that's why it is probably a better strategy to target what is already shipping (one can argue that AVX512_BF16 is already established, but that hits the issue described above). If we do go forward with this we have to make this explicit somewhere in the proposal what instruction sets this would possibly target. I think there is a discussion of using FMA as baseline, maybe also mention that BF extensions can be targeted if available. |
Just to note: the initial implementation does not reflect the future intent. Optimizations for ARM or AVX512 extensions will be added as soon as there will be confidence that they will become popular on clients hardware. |
@dtig may correct me, but AFAIU the only roadblock to AVX512 support in the engines is the currently low market penetration of the technology, and it would be re-evaluated as AVX512 becomes more widely available.
On x86 this instruction can be lowered to SSE2 (separate FP MUL + FP ADD), AVX (same, but 3-operand forms), FMA3/FMA4 (using FMA instructions instead of FP MUL + FP ADD), or AVX512-BF16. Of course, the further down this list, the better is the performance. |
What I mean is that we would need BF16 extension in hardware to get speedup with this instruction, also that extension enables more variability (singe instruction on Arm vs the other lowering). By the way, thank you for the rundown of the proposal. I have two naive follow up questions. Would we need other bfloat16 operations at some point? Also, what is the relationship between BF16 and FP16 and do we expect to go to FP16 at some point? |
The only other commonly supported bfloat16 operation is conversion from FP32 to BF16 with rounding to nearest-even. However, this operation is potentially expensive to emulate when not supported in hardware.
The two formats are unrelated. Native instruction sets don't offer even conversion operations between these two floating-point formats (although all conversions can be done through FP32 intermediate without loss of accuracy). |
Do you think we would need to support IEEE half precision at some point? If I understand this correctly, like with the bfloat16, there are fp16 extensions available or proposed, at least technically: AVX512 extension on x86, Arm v7 extension, and Arm v8. |
I think we'd need to support IEEE half-precision conversion, as hardware coverage for that is very good and efficient software emulation is available for SSE4/AVX1 processors. I don't expect support for half-precision arithmetics in WAsm any time soon because it is well-supported only on recent ARM64 processors (x86 has AVX512-FP16 specification, but no implementations, RISC-V doesn't even have a specification), and software emulation is very inefficient. |
Please vote on the inclusion of the 👍 For including the BFloat16 Dot Product instruction |
A more nuanced vote: as the option isn't presented in the vote. I vote neutral. If merged into the spec, we'd support the BFMLALB/BFMLALT lowering on Arm at this time, because it's exposing the same level of entropy as exposing FMA. Not ruling out supporting the BFDOT or the AVX512 lowering in the future, but no plans to do so at this time. |
Given the votes on #77 (comment) and #77 (comment) (I take the "thumbs-up" on Deepti's comment as a neutral vote), we have majority in support of adding this instruction. @justinmichaud if you wish, please expand more (beyond what you have mentioned above) on your vote against. #82 will be canceled given the heads up on holidays. At the next sync I can present a summary of the instruction set, current status, as a preparation for presenting to the CG on phase advancement. |
I thought I would have a chance to vote tomorrow and talk about this but since that is cancelled I would like to discuss it here. My vote depends on whether the x86 version will ever be implementable on major engines like V8. IIRC, there is no way to encode EVEX instructions (the ones specified in AVX512-related features) in V8 and I sensed that encoding was not a real possibility. If that is actually the case, then I would vote no 👎 to this instruction since there is no feasible path to implement this in V8, a major engine. If in fact the lack of support to encode EVEX instructions was only a temporary circumstance and V8 would accept a patch with the EVEX encoding for this instruction, then I would vote yes 👍. Can someone from the V8 team help me understand what the situation is with EVEX support? |
@abrown There's two separate issues which are the entropy exposed by the native Dot product extensions (both with VDPBF16PS, and BFDOT), and the second issue being that these instructions are only available on a small subset of newer hardware.
I'd like to explicitly second @Maratyszcza point above, when there's enough devices in the wild that use them, that would be a good reason to support these instructions being generated in V8. Though from an engine perspective, I'd like to make sure we do this consistently instead of adding adhoc encodings for some subset of SIMD operations. So the tl;dr is though we might not add these encodings at this time (both AVX512-BF16, BF16), I wouldn't rule out supporting them in the future. |
As proposed in WebAssembly/relaxed-simd#77. Only an LLVM intrinsic and a clang builtin are implemented. Since there is no bfloat16 type, use u16 to represent the bfloats in the builtin function arguments. Differential Revision: https://reviews.llvm.org/D133428
This is now implemented in LLVM/Clang. |
Regarding ARM BF16 extension on Apple processors: according to swiftlang/llvm-project@677da09, BF16 extension is already supported on Apple A15, A16, and M2 processors. |
Introduction
BFloat16 is a 16-bit floating-point format that represents the IEEE FP32 numbers truncated to the high 16 bits. BFloat16 numbers have the same exponential range as the IEEE FP32 numbers, but way fewer mantissa bits, and is primarily used in neural network computations which are very tolerable to reduced precision. Despite being a non-IEEE format, BFloat16 computations are supported in the recent and upcoming processors from Intel (Cooper Lake, Alder Lake, Sapphire Rapids), AMD (Zen 4), and ARM (Cortex-A510, Cortex-A710, Cortex-X2), and can be efficiently emulated on older hardware by zero-padding BFloat16 numbers with zeroes to convert to IEEE FP32.
Both x86 and ARM BFloat16 extensions provide instructions to compute an FP32 dot product of two two-element BFloat16 elements with addition to the FP32 accumulator. However, there're some differences in details:
x86 introduced support for BFloat16 computations with the AVX512-BF16 extension, and the BFloat16 dot production functionality is exposed via the
VDPBF16PS
(Dot Product of BF16 Pairs Accumulated into Packed Single Precision) instruction. The instructionVDPBF16PS dest, src1, src2
computesfor each 32-bit lane
i
in the accumulator registerdest
. Additionally, denormalized numbers in inputs are treated as zeroes and denormalized numbers in outputs are replaced with zeroes.ARM added support for BFloat16 computations with the ARMv8.2-A
BFDOT
instruction (mandatory with ARMv8.6-A), which implements the following computation:where all multiplications and additions are unfused and use non-IEEE Round-to-Odd rounding mode. Additionally, denormalized numbers in inputs are treated as zeroes, denormalized numbers in outputs are replaced with zeroes.
ARMv9.2-A introduced additional "Extended BFloat16" (EBF16) mode, which modifies the behavior of
BFDOT
as follows:where
temp.fp32[i]
is calculated as a fused sum-of-products operations with only a single (standard, Round-to-Nearest-Even) rounding at the end.Furthermore, ARMv8.6-A introduced
BFMLALB
/BFMLALT
instructions which perform widening fused multiply-add of even/odd BFloat16 elements to an IEEE FP32 accumulator. A pair of these instructions can be similarly used to implement BFloat16 Dot Product primitive.What are the instructions being proposed?
I propose a 2-element dot product with accumulation instruction with BFloat16 input elements and FP32 accumulator & output elements. The instruction has relaxed semantics to allow lowering to native
VDPBF16PS
(x86) andBFDOT
(ARM) instructions, as well as FMA instructions where available.I suggest
f32x4.relaxed_dot_bf16x8_add_f32x4
as a tentative name for the instruction. In a sense it is a floating-point equivalent of thei32x4.dot_i16x8_add_s
instruction proposed in WebAssembly/simd#127.What are the semantics of these instructions?
y = f32x4.relaxed_dot_bf16x8_add_f32(a, b, c)
computesy.fp32[i] = y.fp32[i] + cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) + cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
The relaxed nature of the instruction manifests in several allowable options:
Evaluation ordering options
We permit two evaluation orders for the computation:
cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) + cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
in the first step, then compute the sum withy.fp32[i]
in the second step.y.fp32[i] += cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i])
in the first step, then computey.fp32[i] += cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
in the second step.Fusion options
Rounding options
How will these instructions be implemented?
x86/x86-64 processors with AVX512-BF16 instruction set
c = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toVDPBF16PS xmm_c, xmm_a, xmm_b
y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toVMOVDQA xmm_y, xmm_c
+VDPBF16PS xmm_c, xmm_a, xmm_b
ARM64 processors with BF16 extension (using
BFDOT
instructions)y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toMOV Vy.16B, Vc.16B
+BFDOT Vy.4S, Va.8H, Vb.8H
c = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toBFDOT Vc.4S, Va.8H, Vb.8H
ARM64 processors with BF16 extension (using
BFMLALB
/BFMLALT
instructions)y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toMOV Vy.16B, Vc.16B
+BFMLALB Vy.4S, Va.4H, Vb.4H
+BFMLALT Vy.4S, Va.8H, Vb.8H
c = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered toBFMLALB Vc.4S, Va.4H, Vb.4H
+BFMLALT Vc.4S, Va.8H, Vb.8H
Reference lowering through the WAsm Relaxed SIMD instruction set
y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c)
is lowered to:const v128_t a_lo = wasm_i32x4_shl(a, 16)
const v128_t b_lo = wasm_i32x4_shl(b, 16)
const v128_t a_hi = wasm_v128_and(a, wasm_i32x4_const_splat(0xFFFF0000))
const v128_t b_hi = wasm_v128_and(b, wasm_i32x4_const_splat(0xFFFF0000))
y = f32x4.relaxed_fma(a_lo, b_lo, c)
y = f32x4.relaxed_fma(a_hi, b_hi, y)
How does behavior differ across processors? What new fingerprinting surfaces will be exposed?
The use of AVX512-BF16
VDPBF16PS
instruction can be detected through testing for behavior on denormal inputs and outputs.The use of ARM
BFDOT
instruction can be detected through testing for behavior on denormal inputs and outputs, or testing for Round-to-Odd rounding. However, an ARM implementation using a pair ofBFMLALB
/BFMLALT
instructions would be indistinguishable from a generic lowering using FMA instructions (but probably slower thanBFDOT
lowering).What use cases are there?
The text was updated successfully, but these errors were encountered: