Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BFloat16 dot product #88

Merged
merged 1 commit into from
Sep 15, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 51 additions & 25 deletions proposals/relaxed-simd/Overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ All the instructions take 3 operands, `a`, `b`, `c`, perform `a * b + c` or `-(a
where:

- the intermediate `b * c` is be rounded first, and the final result rounded again (for a total of 2 roundings), or
- the the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).
- the entire expression evaluated with higher precision and then only rounded once (if supported by hardware).

### Relaxed laneselect

Expand Down Expand Up @@ -279,6 +279,32 @@ i16x8_dot_i8x16_i7x16_s(a, b) = dot_product(signed=True, elements=2, a, b
i32x4.dot_i8x16_i7x16_add_s(a, b, c) = dot_product(signed=False, elements=2, a, b, c)
```

### Relaxed BFloat16 dot product

- `f32x4.relaxed_dot_bf16x8_add_f32x4(a: v128, b: v128, c: v128) -> v128`

BFloat16 is a 16-bit floating-point format that represents the IEEE FP32 numbers
truncated to the high 16 bits. This instruction computes a FP32 dot product of 2
BFloat16 with accumulation into another FP32.

```python
def bfloat16_dot_product(a, b, c):
for i in range(8):
y.fp32[i] =
y.fp32[i] +
cast<fp32>(a.bf16[2*i]) * cast<fp32>(b.bf16[2*i]) +
cast<fp32>(a.bf16[2*i+1]) * cast<fp32>(b.bf16[2*i+1])
```

This instruction is implementation defined in the following ways:

- evaluation order
- can compute dot product in one step, then accumulation in another, or
- accumulate first product in one step, then accumulate second product in
another step
- fusion, the steps described above can be both fused or both unfused
- the intermediate results can be Round-to-Nearest-Even or Round-to-Odd.


## Binary format

Expand All @@ -290,30 +316,30 @@ where chosen to fit into the holes in the opcode space of SIMD proposal. Going
forward, the opcodes for relaxed-simd specification will be the ones in the
"opcode" column, and it will take some time for tools and engines to update.

| instruction | opcode | prototype opcode |
| ---------------------------------- | -------------- | ---------------- |
| `i8x16.relaxed_swizzle` | 0x100 | 0xa2 |
| `i32x4.relaxed_trunc_f32x4_s` | 0x101 | 0xa5 |
| `i32x4.relaxed_trunc_f32x4_u` | 0x102 | 0xa6 |
| `i32x4.relaxed_trunc_f64x2_s_zero` | 0x103 | 0xc5 |
| `i32x4.relaxed_trunc_f64x2_u_zero` | 0x104 | 0xc6 |
| `f32x4.relaxed_fma` | 0x105 | 0xaf |
| `f32x4.relaxed_fms` | 0x106 | 0xb0 |
| `f64x2.relaxed_fma` | 0x107 | 0xcf |
| `f64x2.relaxed_fms` | 0x108 | 0xd0 |
| `i8x16.relaxed_laneselect` | 0x109 | 0xb2 |
| `i16x8.relaxed_laneselect` | 0x10a | 0xb3 |
| `i32x4.relaxed_laneselect` | 0x10b | 0xd2 |
| `i64x2.relaxed_laneselect` | 0x10c | 0xd3 |
| `f32x4.relaxed_min` | 0x10d | 0xb4 |
| `f32x4.relaxed_max` | 0x10e | 0xe2 |
| `f64x2.relaxed_min` | 0x10f | 0xd4 |
| `f64x2.relaxed_max` | 0x110 | 0xee |
| `i16x8.relaxed_q15mulr_s` | 0x111 | unimplemented |
| `i16x8.dot_i8x16_i7x16_s` | 0x112 | unimplemented |
| `i32x4.dot_i8x16_i7x16_add_s` | 0x113 | unimplemented |
| Reserved for bfloat16 | 0x114 | unimplemented |
| Reserved | 0x115 - 0x12F | |
| instruction | opcode | prototype opcode |
| ------------------------------------ | -------------- | ---------------- |
| `i8x16.relaxed_swizzle` | 0x100 | 0xa2 |
| `i32x4.relaxed_trunc_f32x4_s` | 0x101 | 0xa5 |
| `i32x4.relaxed_trunc_f32x4_u` | 0x102 | 0xa6 |
| `i32x4.relaxed_trunc_f64x2_s_zero` | 0x103 | 0xc5 |
| `i32x4.relaxed_trunc_f64x2_u_zero` | 0x104 | 0xc6 |
| `f32x4.relaxed_fma` | 0x105 | 0xaf |
| `f32x4.relaxed_fms` | 0x106 | 0xb0 |
| `f64x2.relaxed_fma` | 0x107 | 0xcf |
| `f64x2.relaxed_fms` | 0x108 | 0xd0 |
| `i8x16.relaxed_laneselect` | 0x109 | 0xb2 |
| `i16x8.relaxed_laneselect` | 0x10a | 0xb3 |
| `i32x4.relaxed_laneselect` | 0x10b | 0xd2 |
| `i64x2.relaxed_laneselect` | 0x10c | 0xd3 |
| `f32x4.relaxed_min` | 0x10d | 0xb4 |
| `f32x4.relaxed_max` | 0x10e | 0xe2 |
| `f64x2.relaxed_min` | 0x10f | 0xd4 |
| `f64x2.relaxed_max` | 0x110 | 0xee |
| `i16x8.relaxed_q15mulr_s` | 0x111 | unimplemented |
| `i16x8.dot_i8x16_i7x16_s` | 0x112 | unimplemented |
| `i32x4.dot_i8x16_i7x16_add_s` | 0x113 | unimplemented |
| `f32x4.relaxed_dot_bf16x8_add_f32x4` | 0x114 | unimplemented |
| Reserved | 0x115 - 0x12F | |

## References

Expand Down