-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a Vector Math library to allow SimdF32::sin
and similar to work in core
#109
Comments
Funding available! Provisionally assigning EUR 2000, to be adjusted as needed. |
From the Libre-SOC bugtracker:
|
I'm unfamiliar with Kazan, but if both projects use LLVM perhaps most of the implementation work here should be handled in LLVM? It seems to me like the most universal approach would be to provide a library as part of the LLVM project that implements these functions, similar to (or maybe even part of?) compiler-rt. |
We'd also want it to work with cranelift, so having it be part of LLVM may not be best. |
For clarity: It would still be written in Rust? |
@thomcc That's the plan...Rust seems like the best language for |
Right, I just wasn't sure it met your other requirements. |
Kazan is also written in Rust. there can either be a vector abstraction layer that compiles out to nothing in |
A rough sketch of how I think we could have the vector math library's API that the vector math functions would be implemented using: pub struct F16(u16); // to be replaced by built-in type when Rust gains support
/// reference used to build IR for Kazan; an empty type for `core::simd`
pub trait Context: Copy {
// todo: add missing type conversions such as i32 -> f64 and i64 -> u8
// scalar types
type Bool: Bool + Make<Self, Prim = bool> + Select<Self::Bool>;
type U8: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u8>;
type I8: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i8>;
type U16: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u16>;
type I16: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i16>;
type F16: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = F16>;
type U32: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u32>;
type I32: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i32>;
type F32: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f32>;
type U64: UInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = u64>;
type I64: SInt<Self::U32> + Compare<Bool = Self::Bool> + Make<Self, Prim = i64>;
type F64: Float + Compare<Bool = Self::Bool> + Make<Self, Prim = f64>;
// Vector types
type VecBool: From<Self::Bool> + Bool + Make<Self, Prim = bool> + Select<Self::VecBool>;
type VecU8: From<Self::U8> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u8>;
type VecI8: From<Self::I8> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i8>;
type VecU16: From<Self::U16> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u16>;
type VecI16: From<Self::I16> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i16>;
type VecF16: From<Self::F16> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = F16>;
type VecU32: From<Self::U32> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u32>;
type VecI32: From<Self::I32> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i32>;
type VecF32: From<Self::F32> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f32>;
type VecU64: From<Self::U64> + UInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = u64>;
type VecI64: From<Self::I64> + SInt<Self::VecU32> + Compare<Bool = Self::VecBool> + Make<Self, Prim = i64>;
type VecF64: From<Self::F64> + Float + Compare<Bool = Self::VecBool> + Make<Self, Prim = f64>;
fn make<T: Make<Self>>(self, v: T::Prim) -> T {
T::make(self, v)
}
}
pub trait Make<Context>: Sized {
type Prim;
fn make(ctx: Context, v: Self::Prim) -> Self;
}
pub trait Number: Compare + Add<Output = Self> + Sub<Output = Self> + Mul<Output = Self> + Div<Output = Self> + Rem<Output = Self> + AddAssign + SubAssign + MulAssign + DivAssign + RemAssign {}
pub trait BitOps: Copy + And<Output = Self> + Or<Output = Self> + Xor<Output = Self> + Not<Output = Self> + AndAssign + OrAssign + XorAssign {}
pub trait Int<ShiftRhs>: Number + BitOps + Shl<ShiftRhs, Output = Self> + Shr<ShiftRhs, Output = Self> + ShlAssign<ShiftRhs> + ShrAssign<ShiftRhs> {}
pub trait UInt<ShiftRhs>: Int<ShiftRhs> {}
pub trait SInt<ShiftRhs>: Int<ShiftRhs> + Neg<Output = Self> {}
pub trait Float: Number + Neg<Output = Self> {
fn abs(self) -> Self;
fn trunc(self) -> Self;
fn ceil(self) -> Self;
fn floor(self) -> Self;
fn round(self) -> Self;
fn fma(self, a: Self, b: Self) -> Self;
fn is_nan(self) -> Self::Bool;
fn is_infinity(self) -> Self::Bool;
fn is_finite(self) -> Self::Bool;
}
pub trait Bool: BitOps {}
pub trait Select<T>: Bool {
fn select(self, true_v: T, false_v: T) -> T;
}
pub trait Compare: Copy {
type Bool: Bool + Select<Self>;
fn eq(self, rhs: Self) -> Self::Bool;
fn ne(self, rhs: Self) -> Self::Bool;
fn lt(self, rhs: Self) -> Self::Bool;
fn gt(self, rhs: Self) -> Self::Bool;
fn le(self, rhs: Self) -> Self::Bool;
fn ge(self, rhs: Self) -> Self::Bool;
} A math function would be implemented like so: pub fn sincospi<C: Context>(ctx: C, mut v: C::VecF64) -> (C::VecF64, C::VecF64) {
// todo handle non-finite
v *= ctx.make(0.5);
v -= v.floor();
v *= ctx.make(4.0);
let quadrant = v.floor().as_i64();
v -= v.floor();
// v now in range of 0 to 90 deg
// use first few terms of taylor series of sin(x*pi/2) and cos(x*pi/2) -- needs adjusting for accuracy; numbers likely incorrect
let v_sq = v * v;
let s = ctx.make(-0.004681754135318687);
let s = s * v_sq + ctx.make(0.07969262624616703);
let s = s * v_sq + ctx.make(-0.6459640975062462);
let s = s * v_sq + ctx.make(1.570796326794897);
let s = s * v;
// s is now sin(v * pi / 2)
let c = ctx.make(-0.02086348076335296);
let c = c * v_sq + ctx.make(0.253669507901048);
let c = c * v_sq + ctx.make(-1.23370055013617);
let c = c * v_sq + ctx.make(1.0);
// c is now cos(v * pi / 2)
let bit0 = (quadrant & ctx.make(1)).eq(ctx.make(1));
let bit1 = (quadrant & ctx.make(2)).eq(ctx.make(2));
let c_neg = bit0 ^ bit1;
let s_neg = bit1;
let swap = bit0;
let abs_sin = swap.select(s, c);
let abs_cos = swap.select(c, s);
let cos = c_neg.select(-abs_cos, abs_cos);
let sin = c_neg.select(-abs_sin, abs_sin);
(sin, cos)
} |
Started an implementation at https://salsa.debian.org/Kazan-team/vector-math |
not sure about the exact requirements you have, but rust-gpu settled on glam. |
requirements are to implement all scalar functions required by Vulkan, vectorized (e.g. additional requirements are to implement all fp functions desired by nice to have:
All functions must at least meet Vulkan accuracy requirements: |
from what I can tell, the only functions glam has implemented that would meet the above requirements is exp and pow, except that those (or at least exp, didn't check pow's implementation) are implemented by just doing a series of scalar exp operations, defeating the point of having a faster-than-scalar math library. Also, glam only supports up to vec4, whereas Kazan needs generic vector lengths of up to 64. Thanks anyway! |
Got all the vector traits wired up for use with scalars (where all vectors are length 1 for testing purposes or otherwise) and with generating demo compiler IR. Still need to wire up https://salsa.debian.org/Kazan-team/vector-math/-/blob/7975aa9639f3a5a702b130a7cf992ffe71c86e2a/src/ir.rs#L1547 fn f<Ctx: Context>(ctx: Ctx, a: Ctx::VecU8, b: Ctx::VecF32) -> Ctx::VecF64 {
let a: Ctx::VecF32 = a.into();
(a - (a + b - ctx.make(5f32)).floor()).to()
} (the Generates the following demo IR:
Opinions on ease of use for writing functions like |
Implemented |
I added bindings for assembly for
|
I added implementations of Testing the |
SimdF32::sin
to work in core
SimdF32::sin
to work in core
SimdF32::sin
and similar to work in core
I added sin_pi and cos_pi for f64, as well as adding abs, copy_sign, and trunc for all of f16/f32/f64. |
I added round_to_nearest_ties_to_even, ceil, floor, and tan_pi |
Added count_leading_zeros, count_trailing_zeros, and count_ones for all of u8/16/32/64 and i8/16/32/64 |
What do the counting functions have to do with the transcendental float functions? LLVM has ctlz etc that we should be using. Do we have any reason to believe that goes to libc? |
Well, I'm hoping to implement more than just transcendental float functions, I'm aiming more for all the non-trivial library functions (more or less). Also, idk if cranelift has built-in support for bit counting functions. The bit counting functions are in the list in #14... |
Cranelift has the clz and ctz instructions. They don't seem to support vectors yet though. Wasm doesn't have SIMD bit counting functions yet: WebAssembly/simd#6 |
Apologies for popping in as the new guy. If you are interested. I'm writing a crate called This is a small computer algebra system that operates on Rust expressions and amongst other things generates The intention is to generate sets of trancendental and stats functions to variable precision. The nice thing is that the compiler can generate functions "to order" in procedural macros. But you In games we often had "high throughput" and "low latency" variants. |
I think we have two choices here--we can implement bit-counting functions directly in rust if we don't think any architectures have optimized SIMD implementations--but then that should be implemented directly in stdsimd. If there are specific instructions for them we should use the compiler codegen, which means cranelift would need to implement it for SIMD eventually. Either way, I don't think it belongs in a separate library, which just acts as a libc alternative. |
I seem to recall from before that a vector math library called SLEEF was also seeking to be integrated into LLVM. How is |
Our requirements (primarily the part where we want rustc to be able to inline the functions) lead to it needing to be implemented in Rust, which is a big difference. |
Also, there's an abstraction layer allowing the vector math functions to generate compiler IR instead of doing the operations directly (not required for |
Sorry to pop up again. I'm trying to refine my best attempt at the trig functions which are very easy as they are periodic Here is a version for f32, but SIMD versions should be the same.
This gives a short instruction sequence with moderate latency (All 4 cycles per instruction on Skylake).
ie. About 48 cycle latency with a 6 cycle throughput. The method uses the Newton polynomial method I mentioned with special attention to The table-makers dilemma limits us to 2-3 ulp (measured from the maximum of +/- 1.0) The same kernel executed in f64, convered to f32 gives < 0.5 ulp for the full 0..PI*2 range. I've been planning to write a paper on this and may get around to it in some future life. Note that we can approximate the whole range in a single polynomial. Using sin-cos and It is expected that in a loop the compiler will put together 4 or more similar operations:
Does in fact do this. Let me know what you think. I can make some similar kernels for the other "standard" functions |
Implemented |
I am working on sin, cos, exp, log. I'm hoping we can do tan, sec, csc without divides. For inverse trig, a good atan2 is a good place to start as all the rest can be derived I can go through the doctor_syn code if you want to do some more. |
Could someone help me to add these implementations. Where should they go? How do we test them formally etc. Any suggestions welcome. |
I've added a PR #126 @programmerjake @calebzulawski I've only added the sin(f32) function as a placeholder, but I can generate others using |
I'd advocate for them going in https://salsa.debian.org/Kazan-team/vector-math which I'm planning on merging into
For I have a dedicated CI runner for vector-math (shared between all Kazan and Libre-SOC projects), so I just run the tests for an hour. |
Started thread on llvm-dev: https://lists.llvm.org/pipermail/llvm-dev/2021-June/150965.html |
We now have most of the basic libm functions in scalar form. https://github.com/extendr/doctor-syn/blob/main/tests/libm.rs I need to add a scalar to simd transform, more accurate versions, Examples are preserving the input of sin(x) for low magnitude of x. Different standards disagree on where the errors should occur I have also considered generating LLVM code to generate the IR. |
is there a testsuite that compares the Rust Vector Math library against glibc/musl/Microsoft runtime versions? |
Sorry, some of those functions are horribly broken: |
not yet...there is a test suite that compares it's accuracy against the correct mathematical results (or as close as is easyish to get -- it tests |
On the accuracy front, do we have any plans to support different levels of accuracy? I'm aware that users may have different requirements depending on their domain -- e.g. scientific/engineering vs machine learning. Having different implementations available could be quite useful, e.g.
For context: I'm relatively inexperienced with floating point approximations, but I have previously done some hacking here and here on a fast I believe the approach, like Andy's one, is scalable in terms of time/accuracy by including more coefficients. |
We should support different accuracies and levels of pedantry if we want
game Devs to use the functions.
My current libm is pedantry free and good to a fixed max error which is
expected game dev behaviour.
We can add more cycles to make the library POSIX compliant, but this is not
desirable behaviour for the expert user and so should be configurable.
An example is cos and sin which use a lookup table in POSIX impls to handle
four quadrants. This carries a very high cost to
Get one more bit of accuracy, possibly four times slower.
…On Sun, 13 Jun 2021, 11:48 Michael Barber, ***@***.***> wrote:
On the accuracy front, do we have any plans to support different levels of
accuracy? I'm aware that users may have different requirements depending on
their domain -- e.g. scientific/engineering vs machine learning.
Having different implementations available could be quite useful, e.g.
- accurate
- fast
- really fast and dirty
For context: I'm relatively inexperienced with floating point
approximations, but I have previously done some hacking here
<https://github.com/mike-barber/rust-fast-linear-estimator/blob/master/fast-linear-estimator/src/exp_approx_avx.rs>
and here
<https://github.com/mike-barber/rust-fast-linear-estimator/blob/master/fast-linear-estimator/src/exp_approx.rs>
on a fast exp implementation useful for imprecise things like calculating
a softmax quickly. This example is probably somewhere in the fast or really
fast and dirty region, and I wouldn't suggest re-using it as corners have
been cut :)
I believe the approach, like Andy's one, is scalable in terms of
time/accuracy by including more coefficients.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#109 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAL36XFFZ6HLGF7YGK4PZRLTSSEH7ANCNFSM43RZE7PA>
.
|
The |
You are right, this is a good way to get an accurate cos and sin and indeed has lower latency. If you are in a loop that has been unrolled, each of the operations will be done four or more times I wouldn't want to call what would be preferred. Even the modulus and scaling steps would be considered In practice, making a super fast function will be rendered useless by the memory access speed which |
I was inspired by @programmerjake to try the quadrant version of sin/cos The conditional negation of the cos and sin results is a bit hacky and could As a result of the shorter range, we can also reduce the number of terms The godbolt output looks not too bad, although LLVM is not scheduling the result optimally. I also need to add a simdifying transform to the functions.
|
WIP implementation: https://salsa.debian.org/Kazan-team/vector-math
Problem Statement:
Vector Math functions are not widely available and can’t be used from
core
due to needing to link to the external vector math library, therefore we are building a math library for Rust that can be used everywhere and doesn’t require external dependencies allowing it to be used incore
.This allows
SimdF32::sin()
and similar to then work everywhereSimdF32
or similar works (basically everywhere).This includes WebAssembly (with or without the
simd128
extension), Microcontrollers, etc.Also, where actual SIMD instructions are available, it will be faster than just calling
f32::sin
a bunch of times.We will want to extend LLVM to use our Vector Math library as the fallback implementation for LLVM's vector math intrinsics (e.g.
llvm.sin.v8f32
).The reason we want to go through all the hassle of getting LLVM to generate calls to our library is then because LLVM can then generate the native
sin
instruction(s) where supported and call our library otherwise. Leaving it up to LLVM to decide is by far the best option since it can do const-propagation and other optimizations based on its knowledge of howsin
ought to behave, and LLVM is the best spot to make target-specific decisions.This means we can’t just use LLVM’s existing features for a vector math library but require it to learn how to generate calls to our Vector Math library when native instructions aren't available.
Example of adding libcall support to LLVM: https://reviews.llvm.org/D53927
Implementation:
https://salsa.debian.org/Kazan-team/vector-math
The Vector Math library is being written in
#![no_std]
Rust, along with an abstraction layer allowing the implemented functions to also be used in Kazan (a Vulkan GPU driver being developed for Libre-SOC, who is helping funding development, see bug on Libre-SOC's bugtracker).The abstraction layer will have four implementations, the first three of which are currently implemented:
core::simd
.Kazan needs all the functions that Vulkan requires (basically all the sin, cos, tan, atan, sinpi, cospi, exp, log2, pow, etc. functions), if we work together on the same library we could also use it for Rust's
std::simd
, potentially saving a bit of work. Both Kazan and rustc share backends (LLVM and cranelift) and would like to support having vector math functions inlined into calling code, giving a potentially substantial performance boost.Implemented functions so far:
f16
/f32
/f64
:abs
,copy_sign
,trunc
,round_to_nearest_ties_to_even
,floor
,ceil
ilogb
sin_pi
,cos_pi
,sin_cos_pi
,tan_pi
sqrt_fast
u8
/u16
/u32
/u64
/i8
/i16
/i32
/i64
:count_leading_zeros
count_trailing_zeros
count_ones
The text was updated successfully, but these errors were encountered: