Poll on relaxed mode bf16 dot product #106

ngzhian · 2022-11-08T18:41:04Z

This is a poll on #88 in the context of relaxed mode (slides).

In the 2022-11-04 meetings (notes), we discussed this, main points:

not widely available, requires Intel AVX512, but newer hardware supports this (AMD Zen 4), ARMv9 SOCs (Exynos, Graviton 3)
no BF16 standard
has 3 non-determinism paths, in the future we might need more, and new hardware might not conform to what we spec
close to 2x speedup in benchmarks (done by measuring different lowering in native code)
initially, engines will likely lower BF16 dot product to Wasm instructions, thus will not realize performance gains

👍 for inclusion of BF16 dot product (i.e. BF16 dot product stays in this proposal)
👎 against inclusion of BF16 dot product (i.e. remove BF16 dot product from this proposal)
update: 👀 for neutral option

tlively · 2022-11-08T19:40:56Z

Where did we settle on what the best deterministic, portable semantics of this instruction would be? How do we expect that deterministic semantics to perform relative to the deterministic semantics of other instructions? On what timescale do we expect to realize that 2x potential speedup in browsers?

Maratyszcza · 2022-11-08T20:49:28Z

I see two options to define deterministic semantics for y = f32x4.relaxed_dot_bf16x8_add_f32x4(a, b, c). In both options we compute y.fp32[n] = FMA(cast<fp32>(a.bf16[2n+1]), cast<fp32>(b.bf16[2n+1]), FMA(cast<fp32>(a.bf16[2n]), cast<fp32>(b.bf16[2n]), c.fp32[n]).

Option 1: denormal inputs in a and b are processed exactly.
Option 2: denormal inputs in a and b are treated as zero.

Option 1 is better for majority of existing systems: it is easy to implement in software by extracting even/odd numbers, extending them to IEEE FP32, and doing FMA operations. It also match the semantics of two-instruction (BFMLALB + BFMLALT) lowering on ARM BF16 instruction set. Benchmarks on ARM suggest 1.4-1.8X speedup with this implementation option.
Option 2 is more forward-looking, it natively matches the behavior of the VDPBF16PS AVX512-BF16 instruction and of the single-instruction lowering (BFDOT) on ARMv9.2 BF16 instruction set. It can be emulated on existing devices with FMA by enabling denormals-as-zero mode. Typical microkernels that use BFloat16 Dot Product have many of these instruction in chain without any other floating-point instructions in-between, and thus enabling/disabling the denormals-as-zero mode can be done once for a sequence of f32x4.relaxed_dot_bf16x8_add_f32x4 instructions. Benchmarks on ARM suggest 2.3-3.4X speedup from this option.

I would recommend Option 2 as (treat denormals as zero) as the deterministic behavior as both x86 and ARM are converging on this option going forward and it can be efficiently implemented on any hardware with Fused Multiply-Add.

Maratyszcza · 2022-11-08T20:57:47Z

Regarding the "No BF16 standard": BFloat16 represents the high 16 bits of an IEEE FP32 number, and thus can be losslessly converted to FP32. All math in the proposed BF16 Dot Product instruction is done on IEEE FP32 representation, so it is standardized.

Regarding hardware support: the following CPUs support ARM BF16 or AVX512-BF16:

Intel Cooper Lake
Intel Alder Lake (unofficially, but can be enabled on some motherboards with some BIOS versions)
AMD Zen 4
Qualcomm Snapdragon 8 Gen 1 and 8+ Gen 1
Samsung Exynos 2200
MediaTek Dimensity 9000, Dimensity 9000+, Dimensity 9200
Apple A15, A16, M2
Other mobile processors with ARM Cortex-A510, Cortex-A710, Cortex-A715, Cortex-X2, Cortex-X3
Server processors with ARM Neoverse E2, Neoverse N2, Neoverse V2 cores

akirilov-arm · 2022-11-10T18:30:46Z

Neoverse V1 (which Graviton3 is based on) also supports BFloat16 (the FEAT_BF16 ISA extension, to be precise). Also, I presume that the Armv9.2 behaviour you are referring to is the FEAT_EBF16 extension, in which case note that it is technically optional. Furthermore, processors that support it would be able to switch between the previous and the 9.2 behaviour at will, though in theory privileged software might prevent unprivileged applications such as the majority of the Wasm runtimes from doing so.

Maratyszcza · 2022-11-11T01:42:36Z

@conrad-watt @sunfishcode @titzer @penzn Since you voted against BFloat16 Dot Product, could you comment on what is the dealbraker about this instruction for you?

penzn · 2022-11-11T01:53:55Z

Well, I might have voted before there was a 'neutral' option, not sure. I think it is OK to drop it (as opposed to "we need to drop it"), if that is the compromise to move proposal forward, especially since for now most engines would emulate it anyway.

sunfishcode · 2022-11-11T06:03:31Z

@Maratyszcza Between your messages here and @akirilov-arm's, it's not clear to me which CPUs have which semantics. And it's not clear that the ARMv9.2 optional FEAT_EBF16 semantics will even be deterministic in hardware in practice. And the reports here are that browsers wouldn't expect to realize the performance gains initially, and it's not clear on what timescale that's expected to change.

conrad-watt · 2022-12-01T10:44:57Z

Given the results of the poll, do we feel able to make a decision here?

Maratyszcza · 2022-12-06T17:40:16Z

@conrad-watt The decision is to remove BF16 Dot Product.

This was referenced Nov 8, 2022

Poll on relaxed mode #105

Open

Feedback and next steps on relaxed mode #107

Open

ngzhian mentioned this issue Dec 6, 2022

Remove bf16 from overview and implementation status #111

Merged

ngzhian closed this as completed Feb 14, 2023

ngzhian mentioned this issue Feb 14, 2023

Missing tests? #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poll on relaxed mode bf16 dot product #106

Poll on relaxed mode bf16 dot product #106

ngzhian commented Nov 8, 2022 •

edited

Loading

tlively commented Nov 8, 2022

Maratyszcza commented Nov 8, 2022

Maratyszcza commented Nov 8, 2022

akirilov-arm commented Nov 10, 2022

Maratyszcza commented Nov 11, 2022

penzn commented Nov 11, 2022 •

edited

Loading

sunfishcode commented Nov 11, 2022

conrad-watt commented Dec 1, 2022

Maratyszcza commented Dec 6, 2022

Poll on relaxed mode bf16 dot product #106

Poll on relaxed mode bf16 dot product #106

Comments

ngzhian commented Nov 8, 2022 • edited Loading

tlively commented Nov 8, 2022

Maratyszcza commented Nov 8, 2022

Maratyszcza commented Nov 8, 2022

akirilov-arm commented Nov 10, 2022

Maratyszcza commented Nov 11, 2022

penzn commented Nov 11, 2022 • edited Loading

sunfishcode commented Nov 11, 2022

conrad-watt commented Dec 1, 2022

Maratyszcza commented Dec 6, 2022

ngzhian commented Nov 8, 2022 •

edited

Loading

penzn commented Nov 11, 2022 •

edited

Loading