Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Vector Math library to allow SimdF32::sin and similar to work in core #109

Open
programmerjake opened this issue Apr 25, 2021 · 44 comments

Comments

@programmerjake
Copy link
Member

programmerjake commented Apr 25, 2021

WIP implementation: https://salsa.debian.org/Kazan-team/vector-math

Problem Statement:
Vector Math functions are not widely available and can’t be used from core due to needing to link to the external vector math library, therefore we are building a math library for Rust that can be used everywhere and doesn’t require external dependencies allowing it to be used in core.

This allows SimdF32::sin() and similar to then work everywhere SimdF32 or similar works (basically everywhere).
This includes WebAssembly (with or without the simd128 extension), Microcontrollers, etc.
Also, where actual SIMD instructions are available, it will be faster than just calling f32::sin a bunch of times.

We will want to extend LLVM to use our Vector Math library as the fallback implementation for LLVM's vector math intrinsics (e.g. llvm.sin.v8f32).
The reason we want to go through all the hassle of getting LLVM to generate calls to our library is then because LLVM can then generate the native sin instruction(s) where supported and call our library otherwise. Leaving it up to LLVM to decide is by far the best option since it can do const-propagation and other optimizations based on its knowledge of how sin ought to behave, and LLVM is the best spot to make target-specific decisions.
This means we can’t just use LLVM’s existing features for a vector math library but require it to learn how to generate calls to our Vector Math library when native instructions aren't available.
Example of adding libcall support to LLVM: https://reviews.llvm.org/D53927

Implementation:
https://salsa.debian.org/Kazan-team/vector-math

The Vector Math library is being written in #![no_std] Rust, along with an abstraction layer allowing the implemented functions to also be used in Kazan (a Vulkan GPU driver being developed for Libre-SOC, who is helping funding development, see bug on Libre-SOC's bugtracker).

The abstraction layer will have four implementations, the first three of which are currently implemented:

  1. A scalar math implementation for testing purposes.
  2. A implementation wrapping over core::simd.
  3. A demo implementation of generating compiler IR.
  4. An implemententation that generates Kazan's compiler IR, allowing integration with Kazan.

Kazan needs all the functions that Vulkan requires (basically all the sin, cos, tan, atan, sinpi, cospi, exp, log2, pow, etc. functions), if we work together on the same library we could also use it for Rust's std::simd, potentially saving a bit of work. Both Kazan and rustc share backends (LLVM and cranelift) and would like to support having vector math functions inlined into calling code, giving a potentially substantial performance boost.

Implemented functions so far:
f16/f32/f64:

  • abs, copy_sign, trunc, round_to_nearest_ties_to_even, floor, ceil
  • ilogb
  • sin_pi, cos_pi, sin_cos_pi, tan_pi
  • sqrt_fast

u8/u16/u32/u64/i8/i16/i32/i64:

  • count_leading_zeros
  • count_trailing_zeros
  • count_ones
@programmerjake programmerjake changed the title Potentially share Vector Math library with the Kazan GPU driver (related to Libre-SOC and SimpleV) Potentially share Vector Math library with the Kazan GPU driver (related to Libre-SOC and SimpleV; Funding available) Apr 26, 2021
@programmerjake
Copy link
Member Author

Funding available! Provisionally assigning EUR 2000, to be adjusted as needed.

@programmerjake
Copy link
Member Author

From the Libre-SOC bugtracker:

We'd need this library because a SimpleV-capable math library doesn't
exist, we need to build one. We can share the implementation costs with Rust's
std::simd (which also wants a vector math library without libc, libm, or OS
dependencies for use in Rust's core library). In particular, this means the
math library will have generic vector-length-independent implementations
written in Rust.

@calebzulawski
Copy link
Member

I'm unfamiliar with Kazan, but if both projects use LLVM perhaps most of the implementation work here should be handled in LLVM?

It seems to me like the most universal approach would be to provide a library as part of the LLVM project that implements these functions, similar to (or maybe even part of?) compiler-rt.

@programmerjake
Copy link
Member Author

We'd also want it to work with cranelift, so having it be part of LLVM may not be best.

@thomcc
Copy link
Member

thomcc commented Apr 27, 2021

For clarity: It would still be written in Rust?

@programmerjake
Copy link
Member Author

@thomcc That's the plan...Rust seems like the best language for core::simd.

@thomcc
Copy link
Member

thomcc commented Apr 27, 2021

Right, I just wasn't sure it met your other requirements.

@programmerjake
Copy link
Member Author

Kazan is also written in Rust.

there can either be a vector abstraction layer that compiles out to nothing in std::simd and compiles to code for generating IR in Kazan, or Kazan can just use the spir-v backend for rustc, or save the generated llvm & cranelift ir.

@programmerjake
Copy link
Member Author

A rough sketch of how I think we could have the vector math library's API that the vector math functions would be implemented using:

pub struct F16(u16); // to be replaced by built-in type when Rust gains support

/// reference used to build IR for Kazan; an empty type for `core::simd`
pub trait Context: Copy {
    // todo: add missing type conversions such as i32 -> f64 and i64 -> u8
    // scalar types
    type Bool: Bool + Make<Self, Prim = bool> + Select<Self::Bool>;
    type U8: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u8>;
    type I8: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i8>;
    type U16: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u16>;
    type I16: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i16>;
    type F16: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = F16>;
    type U32: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u32>;
    type I32: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i32>;
    type F32: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f32>;
    type U64: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u64>;
    type I64: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i64>;
    type F64: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f64>;
    // Vector types
    type VecBool: From<Self::Bool> + Bool + Make<Self, Prim = bool> + Select<Self::VecBool>;
    type VecU8: From<Self::U8> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u8>;
    type VecI8: From<Self::I8> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i8>;
    type VecU16: From<Self::U16> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u16>;
    type VecI16: From<Self::I16> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i16>;
    type VecF16: From<Self::F16> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = F16>;
    type VecU32: From<Self::U32> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u32>;
    type VecI32: From<Self::I32> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i32>;
    type VecF32: From<Self::F32> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f32>;
    type VecU64: From<Self::U64> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u64>;
    type VecI64: From<Self::I64> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i64>;
    type VecF64: From<Self::F64> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f64>;
    fn make<T: Make<Self>>(self, v: T::Prim) -> T {
        T::make(self, v)
    }
}

pub trait Make<Context>: Sized {
    type Prim;
    fn make(ctx: Context, v: Self::Prim) -> Self;
}

pub trait Number: Compare + Add<Output = Self> + Sub<Output = Self> + Mul<Output = Self> + Div<Output = Self> + Rem<Output = Self> + AddAssign + SubAssign + MulAssign + DivAssign + RemAssign {}

pub trait BitOps: Copy + And<Output = Self> + Or<Output = Self> + Xor<Output = Self> + Not<Output = Self> + AndAssign + OrAssign + XorAssign {}

pub trait Int<ShiftRhs>: Number + BitOps + Shl<ShiftRhs, Output = Self> + Shr<ShiftRhs, Output = Self> + ShlAssign<ShiftRhs> + ShrAssign<ShiftRhs> {}

pub trait UInt<ShiftRhs>: Int<ShiftRhs> {}

pub trait SInt<ShiftRhs>: Int<ShiftRhs> + Neg<Output = Self> {}

pub trait Float: Number + Neg<Output = Self> {
    fn abs(self) -> Self;
    fn trunc(self) -> Self;
    fn ceil(self) -> Self;
    fn floor(self) -> Self;
    fn round(self) -> Self;
    fn fma(self, a: Self, b: Self) -> Self;
    fn is_nan(self) -> Self::Bool;
    fn is_infinity(self) -> Self::Bool;
    fn is_finite(self) -> Self::Bool;
}

pub trait Bool: BitOps {}

pub trait Select<T>: Bool {
    fn select(self, true_v: T, false_v: T) -> T;
}

pub trait Compare: Copy {
    type Bool: Bool + Select<Self>;
    fn eq(self, rhs: Self) -> Self::Bool;
    fn ne(self, rhs: Self) -> Self::Bool;
    fn lt(self, rhs: Self) -> Self::Bool;
    fn gt(self, rhs: Self) -> Self::Bool;
    fn le(self, rhs: Self) -> Self::Bool;
    fn ge(self, rhs: Self) -> Self::Bool;
}

A math function would be implemented like so:

pub fn sincospi<C: Context>(ctx: C, mut v: C::VecF64) -> (C::VecF64, C::VecF64) {
    // todo handle non-finite
    v *= ctx.make(0.5);
    v -= v.floor();
    v *= ctx.make(4.0);
    let quadrant = v.floor().as_i64();
    v -= v.floor();
    // v now in range of 0 to 90 deg
    // use first few terms of taylor series of sin(x*pi/2) and cos(x*pi/2) -- needs adjusting for accuracy; numbers likely incorrect
    let v_sq = v * v;
    let s = ctx.make(-0.004681754135318687);
    let s = s * v_sq + ctx.make(0.07969262624616703);
    let s = s * v_sq + ctx.make(-0.6459640975062462);
    let s = s * v_sq + ctx.make(1.570796326794897);
    let s = s * v;
    // s is now sin(v * pi / 2)
    let c = ctx.make(-0.02086348076335296);
    let c = c * v_sq + ctx.make(0.253669507901048);
    let c = c * v_sq + ctx.make(-1.23370055013617);
    let c = c * v_sq + ctx.make(1.0);
    // c is now cos(v * pi / 2)
    let bit0 = (quadrant & ctx.make(1)).eq(ctx.make(1));
    let bit1 = (quadrant & ctx.make(2)).eq(ctx.make(2));
    let c_neg = bit0 ^ bit1;
    let s_neg = bit1;
    let swap = bit0;
    let abs_sin = swap.select(s, c);
    let abs_cos = swap.select(c, s);
    let cos = c_neg.select(-abs_cos, abs_cos);
    let sin = c_neg.select(-abs_sin, abs_sin);
    (sin, cos)
}

@programmerjake
Copy link
Member Author

Started an implementation at https://salsa.debian.org/Kazan-team/vector-math

@dvc94ch
Copy link

dvc94ch commented May 1, 2021

not sure about the exact requirements you have, but rust-gpu settled on glam.

@programmerjake
Copy link
Member Author

requirements are to implement all scalar functions required by Vulkan, vectorized (e.g. let x: f32 = f32::sin(y) would be vectorized into something like let x: f32x16 = f32x16::sin_with_mask(y, mask);), for f16, f32, and f64 in terms of other vector functions so they ultimately get translated to fadd, fsub, fmul, fdiv, fma, fsqrt, fp compares, and integer ops. The full list is given here (only the scalar fp ops need to be implemented):
https://www.khronos.org/registry/spir-v/specs/1.0/GLSL.std.450.html

additional requirements are to implement all fp functions desired by std::simd.

nice to have:
also implement (needed for OpenCL support):

  • *pi variants of trigonometric functions (e.g. sinpi), they will be much easier to implement correctly than the non-*pi functions since the non-*pi functions require complicated high-precision argument reduction (x mod 2*pi) to give the correct answer for large inputs.
  • log2, log10, log1p, exp2, exp10, expm1
  • cbrt

All functions must at least meet Vulkan accuracy requirements:
https://www.khronos.org/registry/vulkan/specs/1.2/html/chap36.html#spirvenv-precision-operation

@programmerjake
Copy link
Member Author

programmerjake commented May 2, 2021

not sure about the exact requirements you have, but rust-gpu settled on glam.

from what I can tell, the only functions glam has implemented that would meet the above requirements is exp and pow, except that those (or at least exp, didn't check pow's implementation) are implemented by just doing a series of scalar exp operations, defeating the point of having a faster-than-scalar math library. Also, glam only supports up to vec4, whereas Kazan needs generic vector lengths of up to 64.

Thanks anyway!

@programmerjake
Copy link
Member Author

Got all the vector traits wired up for use with scalars (where all vectors are length 1 for testing purposes or otherwise) and with generating demo compiler IR. Still need to wire up std::simd support.

https://salsa.debian.org/Kazan-team/vector-math/-/blob/7975aa9639f3a5a702b130a7cf992ffe71c86e2a/src/ir.rs#L1547
The following Rust:

fn f<Ctx: Context>(ctx: Ctx, a: Ctx::VecU8, b: Ctx::VecF32) -> Ctx::VecF64 {
    let a: Ctx::VecF32 = a.into();
    (a - (a + b - ctx.make(5f32)).floor()).to()
}

(the to() function is a replacement for the as keyword)

Generates the following demo IR:

function(in<arg_0>: vec<U8>, in<arg_1>: vec<F32>) -> vec<F64> {
    op_0: vec<F32> = Cast in<arg_0>
    op_1: vec<F32> = Add op_0, in<arg_1>
    op_2: vec<F32> = Sub op_1, splat(0x40A00000_f32)
    op_3: vec<F32> = Floor op_2
    op_4: vec<F32> = Sub op_0, op_3
    op_5: vec<F64> = Cast op_4
    Return op_5
}

Opinions on ease of use for writing functions like f?

@programmerjake
Copy link
Member Author

Implemented ilogb. Started implementing the std::simd bindings, only to discover that I forgot that boolean vectors need different types for different element sizes.

@programmerjake
Copy link
Member Author

I added bindings for std::simd:
https://salsa.debian.org/Kazan-team/vector-math/-/commit/d77893b257c86d7ddc81f3772b5e143ce768d291
Looking at the assembly generated for stdsimd::tests::do_ilogb_f32x4, it looks pretty reasonable, everything's inlined, however there is still some branching that llvm didn't optimize out:

assembly for stdsimd::tests::do_ilogb_f32x4
_ZN11vector_math7stdsimd5tests14do_ilogb_f32x417h16d8f5e11aac2a22E:
	.cfi_startproc
	subq	$24, %rsp
	.cfi_def_cfa_offset 32
	vbroadcastss	.LCPI82_0(%rip), %xmm1
	xorl	%eax, %eax
	vmovaps	%xmm1, (%rsp)
	.p2align	4, 0x90
.LBB82_1:
	cmpq	$16, %rax
	je	.LBB82_4
	cmpl	$32, (%rsp,%rax)
	leaq	4(%rax), %rax
	jb	.LBB82_1
	jmp	.LBB82_3
.LBB82_4:
	xorl	%eax, %eax
	vmovaps	%xmm1, (%rsp)
	.p2align	4, 0x90
.LBB82_5:
	cmpq	$16, %rax
	je	.LBB82_7
	cmpl	$32, (%rsp,%rax)
	leaq	4(%rax), %rax
	jb	.LBB82_5
	jmp	.LBB82_3
.LBB82_7:
	xorl	%eax, %eax
	vmovaps	%xmm1, (%rsp)
	.p2align	4, 0x90
.LBB82_8:
	cmpq	$16, %rax
	je	.LBB82_10
	cmpl	$32, (%rsp,%rax)
	leaq	4(%rax), %rax
	jb	.LBB82_8
.LBB82_3:
	leaq	.L__unnamed_116(%rip), %rdi
	leaq	.L__unnamed_117(%rip), %rdx
	movl	$30, %esi
	callq	*_ZN4core9panicking5panic17h3de4db67bd397eb3E@GOTPCREL(%rip)
	ud2
.LBB82_10:
	vbroadcastss	.LCPI82_1(%rip), %xmm1
	vbroadcastss	.LCPI82_3(%rip), %xmm5
	vbroadcastss	.LCPI82_4(%rip), %xmm6
	vbroadcastss	.LCPI82_5(%rip), %xmm7
	vpbroadcastd	.LCPI82_2(%rip), %xmm2
	vxorps	%xmm3, %xmm3, %xmm3
	vcmpunordps	%xmm3, %xmm0, %xmm4
	vmulps	%xmm1, %xmm0, %xmm1
	vblendvps	%xmm4, %xmm5, %xmm6, %xmm4
	vandps	%xmm6, %xmm0, %xmm6
	vpbroadcastd	.LCPI82_5(%rip), %xmm5
	vcmpltps	%xmm7, %xmm6, %xmm6
	vpsrld	$23, %xmm0, %xmm7
	vpsrld	$23, %xmm1, %xmm1
	vpand	%xmm2, %xmm1, %xmm1
	vpand	%xmm2, %xmm7, %xmm2
	vpbroadcastd	.LCPI82_6(%rip), %xmm7
	vpand	%xmm5, %xmm0, %xmm5
	vcmpeqps	%xmm3, %xmm0, %xmm0
	vpcmpeqd	%xmm3, %xmm5, %xmm5
	vpaddd	%xmm7, %xmm2, %xmm2
	vblendvps	%xmm6, %xmm2, %xmm4, %xmm2
	vpbroadcastd	.LCPI82_7(%rip), %xmm6
	vbroadcastss	.LCPI82_8(%rip), %xmm4
	vpaddd	%xmm6, %xmm1, %xmm1
	vblendvps	%xmm0, %xmm4, %xmm1, %xmm0
	vblendvps	%xmm5, %xmm0, %xmm2, %xmm0
	vmovaps	%xmm0, (%rdi)
	addq	$24, %rsp
	.cfi_def_cfa_offset 8
	retq

Compiled using:

cargo rustc --release --tests --features=ir,fma,stdsimd -- --emit=asm -C target-cpu=native

on a Ryzen 3900X on Ubuntu

@programmerjake
Copy link
Member Author

programmerjake commented May 10, 2021

I added implementations of sin_pi and cos_pi for f16 and f32, I tested all possible inputs, the functions are accurate to 2ULP though aren't guaranteed to give the correct sign for zeros.

Testing the f32 functions required writing a custom reference implementation of sin_cos_pi since (x * f64::consts::PI).sin_cos() isn't accurate enough when the exact mathematical value of either sin or cos is close to zero (e.g. sin_pi(1.0) == 0.0 but f64::consts::PI.sin() != 0.0 due to round-off error).

https://salsa.debian.org/Kazan-team/vector-math/-/blob/d79f43bed2398cbc4f6b75b8e55ee317289599a1/src/algorithms/trig_pi.rs#L180

@programmerjake programmerjake changed the title Potentially share Vector Math library with the Kazan GPU driver (related to Libre-SOC and SimpleV; Funding available) Create a Vector Math library to allow SimdF32::sin to work in core May 11, 2021
@programmerjake programmerjake changed the title Create a Vector Math library to allow SimdF32::sin to work in core Create a Vector Math library to allow SimdF32::sin and similar to work in core May 11, 2021
@programmerjake
Copy link
Member Author

I added sin_pi and cos_pi for f64, as well as adding abs, copy_sign, and trunc for all of f16/f32/f64.

@programmerjake
Copy link
Member Author

I added round_to_nearest_ties_to_even, ceil, floor, and tan_pi

@programmerjake
Copy link
Member Author

Added count_leading_zeros, count_trailing_zeros, and count_ones for all of u8/16/32/64 and i8/16/32/64

@calebzulawski
Copy link
Member

What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc?

@programmerjake
Copy link
Member Author

What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc?

Well, I'm hoping to implement more than just transcendental float functions, I'm aiming more for all the non-trivial library functions (more or less). Also, idk if cranelift has built-in support for bit counting functions. The bit counting functions are in the list in #14...

@bjorn3
Copy link
Member

bjorn3 commented May 18, 2021

Also, idk if cranelift has built-in support for bit counting functions.

Cranelift has the clz and ctz instructions. They don't seem to support vectors yet though. Wasm doesn't have SIMD bit counting functions yet: WebAssembly/simd#6

@andy-thomason
Copy link

Apologies for popping in as the new guy.

If you are interested. I'm writing a crate called doctor_syn as part of the extendr R language extension project.

This is a small computer algebra system that operates on Rust expressions and amongst other things generates
polynomial approximations for arbitrary expressions, such as sin or cos, or anything you can express as a series or integral.

The intention is to generate sets of trancendental and stats functions to variable precision.

The nice thing is that the compiler can generate functions "to order" in procedural macros. But you
could just run it to a fixed number of terms for this limited use case.

In games we often had "high throughput" and "low latency" variants.

@calebzulawski
Copy link
Member

calebzulawski commented May 18, 2021

I think we have two choices here--we can implement bit-counting functions directly in rust if we don't think any architectures have optimized SIMD implementations--but then that should be implemented directly in stdsimd. If there are specific instructions for them we should use the compiler codegen, which means cranelift would need to implement it for SIMD eventually. Either way, I don't think it belongs in a separate library, which just acts as a libc alternative.

@Artoria2e5
Copy link

I seem to recall from before that a vector math library called SLEEF was also seeking to be integrated into LLVM. How is vector-math going to be different from it?

@thomcc
Copy link
Member

thomcc commented May 20, 2021

Our requirements (primarily the part where we want rustc to be able to inline the functions) lead to it needing to be implemented in Rust, which is a big difference.

@programmerjake
Copy link
Member Author

Also, there's an abstraction layer allowing the vector math functions to generate compiler IR instead of doing the operations directly (not required for core::simd, but very useful for Kazan and other projects that implement compilers)

@andy-thomason
Copy link

Sorry to pop up again.

I'm trying to refine my best attempt at the trig functions which are very easy as they are periodic
and well behaved, ie. converge quickly as McLaurin series.

Here is a version for f32, but SIMD versions should be the same.

fn sin(x: f32) -> f32 {
    let x = x * (1.0 / (std::f32::consts::PI * 2.0));
    let x = x - x.floor() - 0.5;
    12.268859941019306_f32
        .mul_add(x * x, -41.216241051002875_f32)
        .mul_add(x * x, 76.58672703334098_f32)
        .mul_add(x * x, -81.59746095374902_f32)
        .mul_add(x * x, 41.34151143437585_f32)
        .mul_add(x * x, -6.283184525811273_f32)
        * x
}

This gives a short instruction sequence with moderate latency (All 4 cycles per instruction on Skylake).

example::sin:
        vmulss  xmm0, xmm0, dword ptr [rip + .LCPI0_0]
        vroundss        xmm1, xmm0, xmm0, 9
        vsubss  xmm0, xmm0, xmm1
        vaddss  xmm0, xmm0, dword ptr [rip + .LCPI0_1]
        vmulss  xmm1, xmm0, xmm0
        vmovss  xmm2, dword ptr [rip + .LCPI0_2]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_3]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_4]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_5]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_6]
        vfmadd213ss     xmm2, xmm1, dword ptr [rip + .LCPI0_7]
        vmulss  xmm0, xmm0, xmm2
        ret

ie. About 48 cycle latency with a 6 cycle throughput.

The method uses the Newton polynomial method I mentioned with special attention to
the endpoints.

The table-makers dilemma limits us to 2-3 ulp (measured from the maximum of +/- 1.0)
and adding more terms does not improve this - and in fact may make it worse.

The same kernel executed in f64, convered to f32 gives < 0.5 ulp for the full 0..PI*2 range.
Adding more terms to the f64 expansion will improve f32 ULP, but not f64 ulp which requires
some nasty tricks to solve the TMD.

I've been planning to write a paper on this and may get around to it in some future life.

Note that we can approximate the whole range in a single polynomial. Using sin-cos and
select shortens the range but reduces throughput.

It is expected that in a loop the compiler will put together 4 or more similar operations:

pub fn f(x: &mut [f32]) {
    for v in x {
        *v = sin(*v);
    }
}

Does in fact do this.

Let me know what you think. I can make some similar kernels for the other "standard" functions
if you are interested.

@programmerjake
Copy link
Member Author

Implemented sqrt_fast_f16/f32/f64 which are accurate to 2/3/2 ULPs respectively. No libm needed!

@andy-thomason
Copy link

I am working on sin, cos, exp, log.

I'm hoping we can do tan, sec, csc without divides.

For inverse trig, a good atan2 is a good place to start as all the rest can be derived
from this.

I can go through the doctor_syn code if you want to do some more.

@andy-thomason
Copy link

Could someone help me to add these implementations.
I am also very happy to support other doing this,

Where should they go? How do we test them formally etc.

Any suggestions welcome.

@andy-thomason
Copy link

I've added a PR #126 @programmerjake @calebzulawski

I've only added the sin(f32) function as a placeholder, but I can generate others using libmgen
in the doctor_syn crate (contributors welcome).

@programmerjake
Copy link
Member Author

Could someone help me to add these implementations.
I am also very happy to support other doing this,

Where should they go?

I'd advocate for them going in https://salsa.debian.org/Kazan-team/vector-math which I'm planning on merging into core kinda like compiler-builtins is. They should be built to use the abstraction layer in crate::traits::Context (best viewed using rustdocs), example:
https://salsa.debian.org/Kazan-team/vector-math/-/blob/a43a29ccd5824b9b6dcb45baa3ccb3f8c5ea72e1/src/algorithms/base.rs#L57

How do we test them formally etc.

For f16/f32 functions, it's feasible to test all possible inputs for unary functions; for f64, I'm just making-do with testing every Nth input bit-pattern.
https://salsa.debian.org/Kazan-team/vector-math/-/blob/a43a29ccd5824b9b6dcb45baa3ccb3f8c5ea72e1/src/algorithms/trig_pi.rs#L884

I have a dedicated CI runner for vector-math (shared between all Kazan and Libre-SOC projects), so I just run the tests for an hour.
https://salsa.debian.org/Kazan-team/vector-math/-/pipelines/257995/builds

@programmerjake
Copy link
Member Author

Started thread on llvm-dev: https://lists.llvm.org/pipermail/llvm-dev/2021-June/150965.html

@andy-thomason
Copy link

We now have most of the basic libm functions in scalar form.

https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs

I need to add a scalar to simd transform, more accurate versions,
and some of the "pedantry" of the various standards as optional
extras.

Examples are preserving the input of sin(x) for low magnitude of x.
ie. sin(x) = x for small x including -0 and +0.

Different standards disagree on where the errors should occur
and others leave it quite open. The game industry, for example,
does not need small value corrections.

I have also considered generating LLVM code to generate the IR.

@LowLevelMahn
Copy link

is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions?
for example in speed and accuracy, on multiple platforms?

@programmerjake
Copy link
Member Author

We now have most of the basic libm functions in scalar form.

https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs

Sorry, some of those functions are horribly broken:
exp2(384.0) returns -Inf! it should never return a negative value, much less negative infinity! you can also get it to produce NaN and most any f32 value.

@programmerjake
Copy link
Member Author

is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions?
for example in speed and accuracy, on multiple platforms?

not yet...there is a test suite that compares it's accuracy against the correct mathematical results (or as close as is easyish to get -- it tests f16/f32 against the f64 function from std rounded to f16/f32, the f64 versions are tested against the rug bindings to MPFR, which is known to produce correct results). The functions are documented how accurate they are -- if they don't have documentation they should be 100% accurate (such as copy_sign), assuming I didn't miss anything.

@mike-barber
Copy link

On the accuracy front, do we have any plans to support different levels of accuracy? I'm aware that users may have different requirements depending on their domain -- e.g. scientific/engineering vs machine learning.

Having different implementations available could be quite useful, e.g.

  • accurate
  • fast
  • really fast and dirty

For context: I'm relatively inexperienced with floating point approximations, but I have previously done some hacking here and here on a fast exp implementation useful for imprecise things like calculating a softmax quickly. This example is probably somewhere in the fast or really fast and dirty region, and I wouldn't suggest re-using it as corners have been cut :)

I believe the approach, like Andy's one, is scalable in terms of time/accuracy by including more coefficients.

@andy-thomason
Copy link

andy-thomason commented Jun 13, 2021 via email

@programmerjake
Copy link
Member Author

programmerjake commented Jun 13, 2021

An example is cos and sin which use a lookup table in POSIX impls to handle four quadrants. This carries a very high cost to Get one more bit of accuracy, possibly four times slower.

The sin_pi and related functions I have implemented so far use two polynomial evaluations, then use select to get the right sign and to switch between sin and cos. On CPUs with a pipelined FMA or pipelined mul and add units (basically everything modern other than microcontrollers), the polynomials can run in just 2-3 more cycles (not tested, but I am a cpu designer for Libre-SOC) than a single polynomial evaluation, since the FMAs run interleaved with each other.

@andy-thomason
Copy link

You are right, this is a good way to get an accurate cos and sin and indeed has lower latency.
If you are calling a function occasionally, you want to optimise for latency over throughput.

If you are in a loop that has been unrolled, each of the operations will be done four or more times
by the compiler to fill in the gaps between the instructions (latency/cpi) increasing throughput.
For this case, having fewer instructions is best and we may be willing to sacrifice the last two bits.

I wouldn't want to call what would be preferred. Even the modulus and scaling steps would be considered
optional if you stored Euler angles as small integers, for example.

In practice, making a super fast function will be rendered useless by the memory access speed which
will always slow you down in the loop case and the I-cache performance in the "single" use case. Big
game engines are nearly always I-cache bound, I discovered at Sony. We spent a lot of time making
the instructions shorter, but LLVM does not do this because all their tests are small loops.

@andy-thomason
Copy link

I was inspired by @programmerjake to try the quadrant version of sin/cos

The conditional negation of the cos and sin results is a bit hacky and could
be replaced by val ^ ((x & 1) << 31)

As a result of the shorter range, we can also reduce the number of terms
and so it may be as fast as the "fast" version anyway!

The godbolt output looks not too bad, although LLVM is not scheduling the result optimally.

I also need to add a simdifying transform to the functions.

fn sin(x: f32) -> f32 {
    let x = x * (1.0 / (std::f32::consts::PI));
    let xh = x + 0.5;
    let xr = x.round();
    let xhr = xh.round();
    let s = x - xr;
    let c = xh - xhr;
    let sr = (-0.5878112252588481_f32)
        .mul_add(s * s, 2.5496484487855455_f32)
        .mul_add(s * s, -5.167705042516404_f32)
        .mul_add(s * s, 3.1415926342503133_f32)
        * s;
    let cr = (0.23132925050778008_f32)
        .mul_add(c * c, -1.335050718961723_f32)
        .mul_add(c * c, 4.0587078433143_f32)
        .mul_add(c * c, -4.934802176219238_f32)
        .mul_add(c * c, 0.9999999999999999_f32);
    let ss = if (xr as i32) & 1 == 0 { sr } else { -sr };
    let cs = if (xhr as i32 & 1) == 0 { -cr } else { cr };
    if s.abs() <= 0.25 {
        ss
    } else {
        cs
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants