Adapt to upstream changes wrt. native support for BFloat16 #51

maleadt · 2023-10-04T08:35:07Z

This PR adapts BFloat16s.jl to JuliaLang/julia#51470, where I'm adding native support for BFloat16s to Julia (using the bfloat type in LLVM). I decided to keep as much functionality as possible in this package, so Base only defines Core.BFloat16 and the necessary codegen support.

The main benefit of this change is that we now emit drastically simpler IR, and rely on LLVM to lower it to something that the hardware supports. For example:

julia> test(x::T) where T = T(2) * x + one(T)

julia> test(BFloat16(1))
BFloat16(3.0)

Before this PR:

julia> @code_llvm debuginfo=:none test(BFloat16(1))
define i16 @julia_test_556(i16 zeroext %0) #0 {
top:
  %1 = zext i16 %0 to i32
  %2 = shl nuw i32 %1, 16
  %bitcast_coercion = bitcast i32 %2 to float
  %3 = fmul float %bitcast_coercion, 2.000000e+00
  %4 = fcmp ord float %3, 0.000000e+00
  br i1 %4, label %L38, label %L75

L38:                                              ; preds = %top
  %bitcast_coercion4 = bitcast float %3 to i32
  %5 = lshr i32 %bitcast_coercion4, 16
  %.op5 = and i32 %5, 1
  %6 = add i32 %bitcast_coercion4, 32767
  %7 = add i32 %6, %.op5
  %8 = and i32 %7, -65536
  %phi.cast = bitcast i32 %8 to float
  %phi.bo = fadd float %phi.cast, 1.000000e+00
  %9 = fcmp ord float %phi.bo, 0.000000e+00
  br i1 %9, label %L54, label %L75

L54:                                              ; preds = %L38
  %bitcast_coercion3 = bitcast float %phi.bo to i32
  %10 = lshr i32 %bitcast_coercion3, 16
  %11 = and i32 %10, 1
  %narrow = add nuw nsw i32 %11, 32767
  %12 = zext i32 %narrow to i64
  %13 = zext i32 %bitcast_coercion3 to i64
  %14 = add nuw nsw i64 %12, %13
  %15 = lshr i64 %14, 16
  %16 = trunc i64 %15 to i16
  br label %L75

L75:                                              ; preds = %L54, %L38, %top
  %value_phi2 = phi i16 [ %16, %L54 ], [ 32704, %L38 ], [ 32704, %top ]
  ret i16 %value_phi2
}

julia> @code_native debuginfo=:none test(BFloat16(1))
julia_test_560:                         # @julia_test_560
# %bb.0:                                # %top
	shl	edi, 16
	mov	ax, 32704
	vmovd	xmm0, edi
	vaddss	xmm0, xmm0, xmm0
	vucomiss	xmm0, xmm0
	jp	.LBB0_3
# %bb.1:                                # %L38
	push	rbp
	vmovd	ecx, xmm0
	movabs	rdx, offset .LCPI0_0
	mov	rbp, rsp
	bt	ecx, 16
	adc	ecx, 32767
	and	ecx, -65536
	vmovd	xmm0, ecx
	vaddss	xmm0, xmm0, dword ptr [rdx]
	vucomiss	xmm0, xmm0
	pop	rbp
	jp	.LBB0_3
# %bb.2:                                # %L54
	vmovd	eax, xmm0
	bt	eax, 16
	adc	eax, 32767
	shr	eax, 16
.LBB0_3:                                # %L75
                                        # kill: def $ax killed $ax killed $eax
	ret

Using this PR, on JuliaLang/julia#51470:

julia> @code_llvm debuginfo=:none test(BFloat16(1))
; Function Signature: test(Core.BFloat16)
define bfloat @julia_test_2911(bfloat %"x::BFloat16") #0 {
top:
  %0 = fpext bfloat %"x::BFloat16" to float
  %1 = fmul float %0, 2.000000e+00
  %2 = fptrunc float %1 to bfloat
  %3 = fpext bfloat %2 to float
  %4 = fadd float %3, 1.000000e+00
  %5 = fptrunc float %4 to bfloat
  ret bfloat %5
}

julia> @code_native debuginfo=:none test(BFloat16(1))
julia_test_3001:                        # @julia_test_3001
; Function Signature: test(Core.BFloat16)
# %bb.0:                                # %top
	#DEBUG_VALUE: test:x <- $xmm0
	push	rbp
	mov	rbp, rsp
	push	rbx
	sub	rsp, 8
	vmovd	eax, xmm0
	movabs	rbx, offset __truncsfbf2
	shl	eax, 16
	vmovd	xmm0, eax
	vaddss	xmm0, xmm0, xmm0
	call	rbx
	vmovd	eax, xmm0
	movabs	rcx, offset .LCPI0_0
	shl	eax, 16
	vmovd	xmm0, eax
	vaddss	xmm0, xmm0, dword ptr [rcx]
	call	rbx
	add	rsp, 8
	pop	rbx
	pop	rbp
	ret

So the LLVM IR is much simpler, while the native code is (as expected) similar in complexity.

Performance is hard to compare for such simple operations, but representing BFloat16s natively should make it possible for LLVM to optimize them, and also select better instructions when possible. For example, with a CPU supporting AVX512BF16 and LLVM 17, we compile:

define <16 x bfloat> @trunc(<16 x float>) {
    %2 = fptrunc <16 x float> %0 to <16 x bfloat>
    ret <16 x bfloat> %2
}

to:

trunc:                                  # @trunc
        vcvtneps2bf16   ymm0, zmm0
        ret

So this will make it possible to use BFloat16s.jl with our vectorization packages (by using NTuple{16,Core.VecElement{BFloat}}, which now lowers to <16 x bfloat>).

This PR also switches the significand implementation, as it contained undefined behavior (for one(BFloat16), isig is Int16(0)). The new implementation is copied from Base.

Closes #51

codecov · 2023-10-04T08:36:41Z

Codecov Report

Attention: 54 lines in your changes are missing coverage. Please review.

Comparison is base (a42c4fa) 65.41% compared to head (75a6d23) 22.22%.

❗ Current head 75a6d23 differs from pull request most recent head ca97442. Consider uploading reports for the commit ca97442 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master      #51       +/-   ##
===========================================
- Coverage   65.41%   22.22%   -43.20%     
===========================================
  Files           3        3               
  Lines         133      171       +38     
===========================================
- Hits           87       38       -49     
- Misses         46      133       +87

Files	Coverage Δ
src/bfloat16.jl	`23.56% <23.94%> (-48.71%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maleadt · 2023-10-06T08:45:50Z

Interestingly, even though bfloat was added to LLVM 11 specifically to support ARM intrinsics, the storage-level support is really limited. For example, just synthesizing a constant results in a selection error:

target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-linux-none"

define bfloat @julia_BFloat16_2304() {
top:
  ret bfloat 0xR0000
}

LLVM ERROR: Cannot select: 0x562986b35f10: bf16 = ConstantFP<APFloat(0)>

This is fixed on LLVM 17, but aarch64 still lacks arithmetic-level support there:

; ModuleID = 'f'
source_filename = "f"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-eabi"

define float @julia_f_572(bfloat %"x::BFloat16") {
top:
  %0 = fpext bfloat %"x::BFloat16" to float
  ret float %0
}

LLVM ERROR: Cannot select: 0x558b0b8a7ac0: f32 = fp_extend 0x558b0b8a7a50
  0x558b0b8a7a50: bf16,ch = CopyFromReg 0x558b0b82d970, Register:bf16 %0
    0x558b0b8a79e0: bf16 = Register %0

On x86, both these work on LLVM 15+ which is the lower bound for this feature (as we only added Core.BFloat16 to Julia 1.11).

cc @vchuravy

maleadt · 2023-10-10T13:46:13Z

Well this is weird. I cannot reproduce the CI failure on any system of mine. I thought it was ABI related, but it looks like LLVM somehow materializes a wrong constant here. BFloat16(1f0) is 0x3f80, and when we ask LLVM to emit IR that truncates 1f0 to BFloat16 (which gets const-propper by the IRBuilder) we do get that value, but not when Julia does so:

bfloat 0xR3F80

vs

ret bfloat 0xR3C00

maleadt · 2023-10-10T15:07:52Z

Alright, found something that reproduces locally:

julia> f() = Core.Intrinsics.fptrunc(Core.BFloat16, 1f0)
f (generic function with 1 method)

julia> f()
Core.BFloat16(0x3c00)

julia> fptrunc(x) = Core.Intrinsics.fptrunc(Core.BFloat16, x)
fptrunc (generic function with 1 method)

julia> h() = fptrunc(1f0)
h (generic function with 1 method)

julia> h()
Core.BFloat16(0x3f80)

maleadt mentioned this pull request Oct 6, 2023

is BFloat16s.jl up to date with https://github.com/JuliaLang/julia/pull/51470 #53

Closed

maleadt added 4 commits October 6, 2023 14:13

Use LLVM intrinsics when Julia supports BFloat16.

09f2bdb

Use Base's 'significand' implementation to avoid UB.

80a1280

Restrict use of codegen to x86 only.

22f2a04

Split codegen support in storage and arithmetic.

867d6f5

maleadt force-pushed the codegen branch from 4f3861c to 867d6f5 Compare October 6, 2023 12:14

maleadt added 2 commits October 6, 2023 14:21

Switch order of macros to improve error message.

3c6ed43

Try using Float32 as printf conversion type.

569dcc5

maleadt marked this pull request as draft October 6, 2023 12:42

maleadt force-pushed the codegen branch from 30182c8 to cbf693c Compare October 6, 2023 12:45

Add ABI test.

ca97442

maleadt force-pushed the codegen branch 4 times, most recently from afb5922 to 35c4799 Compare October 10, 2023 13:26

maleadt force-pushed the codegen branch 2 times, most recently from 07a1dec to d2f8bbd Compare October 10, 2023 13:59

maleadt marked this pull request as ready for review October 26, 2023 14:34

maleadt force-pushed the codegen branch 3 times, most recently from 75a6d23 to ca97442 Compare November 2, 2023 07:29

maleadt merged commit 730511b into JuliaMath:master Nov 2, 2023
48 of 54 checks passed

maleadt deleted the codegen branch November 2, 2023 08:39

jonas-schulze mentioned this pull request Mar 28, 2024

Segfault using Julia 1.11-alpha2 on AMD EPYC 9554 #68

Closed

jonas-schulze mentioned this pull request Apr 10, 2024

Native BFloat16 support not working on AMD EPYC 9554 JuliaLang/julia#54025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt to upstream changes wrt. native support for BFloat16 #51

Adapt to upstream changes wrt. native support for BFloat16 #51

maleadt commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading

maleadt commented Oct 6, 2023

maleadt commented Oct 10, 2023

maleadt commented Oct 10, 2023

Adapt to upstream changes wrt. native support for BFloat16 #51

Adapt to upstream changes wrt. native support for BFloat16 #51

Conversation

maleadt commented Oct 4, 2023 • edited Loading

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

maleadt commented Oct 6, 2023

maleadt commented Oct 10, 2023

maleadt commented Oct 10, 2023

maleadt commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading