Use simd_* intrinsics instead of llvm intrinsics where possible #788

bjorn3 · 2019-07-30T12:45:46Z

_mm_sqrt_ps: llvm.x86.sse.sqrt.ps -> simd_fsqrt
_mm_min_ps: llvm.x86.sse.min.ps -> simd_fmin
...

This makes it easier for other codegen backends to support stuff, as all variants of the same operation are nicely bundled together. (eg single and double precision sqrt)

The text was updated successfully, but these errors were encountered:

gnzlbg · 2019-07-30T13:09:17Z

We used to do that, but only for arm and x86 we needed to add ~10.000 intrinsics. This meant modifying rustc every time a new intrinsic should be added, and doing CI of the intrinsics in rustc, which would add a dozen build jobs to run the tests on less common targets, and increase CI time of rustc build jobs by quite a while (I think the largest tests here take ~20 min).

It would be simpler if other backends could just either hijack the llvm intrinsics, or add their own #[link_foo_intrinsic] attributes that can be used here to link against other intrinsics. Particularly for the more complex algorithms exposed that require calling different intrinsics.

bjorn3 · 2019-07-30T13:11:42Z

Particularly for the more complex algorithms exposed that require calling different intrinsics.

I meant the simple ones which already have a simd_* platform-intrinsic.

gnzlbg · 2019-07-30T13:11:55Z

This PR was the last one removing support for that: rust-lang/rust#57416 in case you are considering adding it again. If that happens, we could incrementally remove the #[link_llvm_intrinsics] from this crate, and start calling those.

gnzlbg · 2019-07-30T13:13:10Z

I meant the simple ones which already have a simd_* platform-intrinsic.

Your list appeared unbounded, it would then maybe be helpful if you could write a complete list of operations that you want to propose. The packed_simd crate already implements all the basic stuff portably, and for that the currently available intrinsics mostly suffice.

gnzlbg · 2019-07-30T13:15:41Z

The packed_simd crate already implements all the basic stuff portably, and for that the currently available intrinsics mostly suffice

Expanding on that, it implements the two operations you mention, and they currently produce optimal code generation with LLVM. For doing a min we just use < and simd_select. For doing a sqrt we just use the vector llvm.fsqrt intrinsics. We probably need to add an LLVM pass to lower those to libmvec calls at some point, packed_simd just optionally links a vector math library, and lowers those to FFI calls, but that's not the best place to do that (it should happen somewhere in the code generation backend).

bjorn3 · 2019-07-30T13:16:20Z

Your list appeared unbounded, it would then maybe be helpful if you could write a complete list of operations that you want to propose.

I understand that it is practically impossible to do this for all intrinsics, but at least the following platform-intrinsics are already supported by the llvm codegen backend:

stdarch/crates/core_arch/src/simd_llvm.rs

Lines 5 to 58 in 65c9fa5

    
           extern "platform-intrinsic" { 
        
               pub fn simd_eq<T, U>(x: T, y: T) -> U; 
        
               pub fn simd_ne<T, U>(x: T, y: T) -> U; 
        
               pub fn simd_lt<T, U>(x: T, y: T) -> U; 
        
               pub fn simd_le<T, U>(x: T, y: T) -> U; 
        
               pub fn simd_gt<T, U>(x: T, y: T) -> U; 
        
               pub fn simd_ge<T, U>(x: T, y: T) -> U; 
        
               pub fn simd_shuffle2<T, U>(x: T, y: T, idx: [u32; 2]) -> U; 
        
               pub fn simd_shuffle4<T, U>(x: T, y: T, idx: [u32; 4]) -> U; 
        
               pub fn simd_shuffle8<T, U>(x: T, y: T, idx: [u32; 8]) -> U; 
        
               pub fn simd_shuffle16<T, U>(x: T, y: T, idx: [u32; 16]) -> U; 
        
               pub fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U; 
        
               pub fn simd_shuffle64<T, U>(x: T, y: T, idx: [u32; 64]) -> U; 
        
               pub fn simd_shuffle128<T, U>(x: T, y: T, idx: [u32; 128]) -> U; 
        
               pub fn simd_insert<T, U>(x: T, idx: u32, val: U) -> T; 
        
               pub fn simd_extract<T, U>(x: T, idx: u32) -> U; 
        
               pub fn simd_cast<T, U>(x: T) -> U; 
        
               pub fn simd_add<T>(x: T, y: T) -> T; 
        
               pub fn simd_sub<T>(x: T, y: T) -> T; 
        
               pub fn simd_mul<T>(x: T, y: T) -> T; 
        
               pub fn simd_div<T>(x: T, y: T) -> T; 
        
               pub fn simd_rem<T>(x: T, y: T) -> T; 
        
               pub fn simd_shl<T>(x: T, y: T) -> T; 
        
               pub fn simd_shr<T>(x: T, y: T) -> T; 
        
               pub fn simd_and<T>(x: T, y: T) -> T; 
        
               pub fn simd_or<T>(x: T, y: T) -> T; 
        
               pub fn simd_xor<T>(x: T, y: T) -> T; 
        
               pub fn simd_reduce_add_unordered<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_mul_unordered<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_add_ordered<T, U>(x: T, acc: U) -> U; 
        
               pub fn simd_reduce_mul_ordered<T, U>(x: T, acc: U) -> U; 
        
               pub fn simd_reduce_min<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_max<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_min_nanless<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_max_nanless<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_and<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_or<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_xor<T, U>(x: T) -> U; 
        
               pub fn simd_reduce_all<T>(x: T) -> bool; 
        
               pub fn simd_reduce_any<T>(x: T) -> bool; 
        
               pub fn simd_select<M, T>(m: M, a: T, b: T) -> T; 
        
               pub fn simd_select_bitmask<M, T>(m: M, a: T, b: T) -> T; 
        
               pub fn simd_fmin<T>(a: T, b: T) -> T; 
        
               pub fn simd_fmax<T>(a: T, b: T) -> T; 
        
               pub fn simd_fsqrt<T>(a: T) -> T; 
        
               pub fn simd_fma<T>(a: T, b: T, c: T) -> T;

bjorn3 · 2019-07-30T13:19:18Z

It would be simpler if other backends could just either hijack the llvm intrinsics, or add their own #[link_foo_intrinsic] attributes that can be used here to link against other intrinsics. Particularly for the more complex algorithms exposed that require calling different intrinsics.

Hijacking the llvm intrinsics is what I currently do, but every combination of op, lane_type and lane_count has its own llvm intrinsic, leading to combinatory explosion of the intrinsics to be implemented.

gnzlbg · 2019-07-30T13:23:37Z

I understand that it is practically impossible to do this for all intrinsics,

This makes me wonder, why are you trying to support core::arch instead of core::simd ? core::simd only exposes the portable operations, and does so on all arches, so it probably makes sense to support that first.

If there is a way to implement some of the core::arch intrinsics in terms of core::simd, we could do that.

bjorn3 · 2019-07-30T13:25:12Z

why are you trying to support core::arch

Because everybody uses core::arch, as it's the only stable way to use SIMD. That includes crates like memchr which are depended upon by many other crates.

bjorn3 · 2019-07-30T13:27:17Z

If there is a way to implement some of the core::arch intrinsics in terms of core::simd, we could do that.

+1

gnzlbg · 2019-07-30T13:45:22Z

Also, at least for the two intrinsics mentioned:

_mm_sqrt_ps: llvm.x86.sse.sqrt.ps -> simd_fsqrt

The specific intrinsics only work for the particular vector length, type, and one specific #[target_feature] .

A generic intrinsic would need to support both f32 and f64 and all supported vector lengths for those types (4, 8, and 16 for 32-bit), for all combinations of target features available.

For the LLVM backend, exposing a generic simd_fsqrt would require:

generate an llvm.fsqrt.xy from the front-end
writing and registering a custom LLVM pass that replaces llvm.fsqrt.xy in LLVM-IR before the backend, e.g., doing something like what is manually done here: https://github.com/rust-lang-nursery/packed_simd/blob/accf702d478010adb72bcba980bb59b9cf218d7d/src/codegen/math/float/sqrt.rs#L56
probably add an option to rustc to allow customizing the libm library used, like GCC and Clang do

I think that work is worth doing, and it would improve the current situation a bit, where we handle all of this in a library (e.g. by enabling some optimizations that are not possible today).

_mm_min_ps: llvm.x86.sse.min.ps -> simd_fmin

The main reason I point you to packed_simd, is because it has already solved most of the problems you are trying to solve: https://github.com/rust-lang-nursery/packed_simd/blob/master/src/api/ops/vector_float_min_max.rs#L13 The intrinsic you are requesting, already exist, and it is precisely called simd_fmin.

bjorn3 · 2019-07-30T13:49:33Z

The intrinsic you are requesting, already exist, and it is precisely called simd_fmin.

I was actually requesting it to be used in _mm_min_ps, not to add it :)

For the LLVM backend, exposing a generic simd_fsqrt would require:

generate an llvm.fsqrt.xy from the front-end

This is already done by cg_llvm for simd_fsqrt and friends:

https://github.com/rust-lang/rust/blob/9a239ef4ded03d155c72b68b5a2dd7aff013e141/src/librustc_codegen_llvm/intrinsic.rs#L1330

bjorn3 · 2019-07-30T13:50:44Z

writing and registering a custom LLVM pass that replaces llvm.fsqrt.xy in LLVM-IR before the backend, e.g., doing something like what is manually done here: https://github.com/rust-lang-nursery/packed_simd/blob/accf702d478010adb72bcba980bb59b9cf218d7d/src/codegen/math/float/sqrt.rs#L56

Doesn't llvm do that itself?

bjorn3 · 2019-07-30T13:52:30Z

The main reason I point you to packed_simd, is because it has already solved most of the problems you are trying to solve

I understand, but until core::arch is implemented in terms of packed_simd it doesn't help much, as many crates use core::arch instead.

gnzlbg · 2019-07-30T13:54:06Z

I was actually requesting it to be used in _mm_min_ps, not to add it :)

Ahhh! Duh, I thought you were proposing adding new generic intrinsics for all the exposed operations.

If you just want to use the generic ones when they are appropriate and they do work, you can do just that, and our test suite should make sure that the exact same instruction is generated. Just keep in mind that sometimes the target-specific intrinsics might mean something different than the generic ones (e.g. for floating-point, things like the precision might be different).

Doesn't llvm do that itself?

No, not at all. The simd_fsqrt<T> intrinsic isn't generic, in that it is not intended to work for all Ts that are SIMD types. It only works for some particular T and particular target-feature combinations, e.g., for f32x4 if the appropriate target-feature is available, for f32x8 when avx2 is available, and that's pretty much it.

core::arch only exposes these cases IIRC (except for some of the mathematical intrinsics that we do not implement yet like _mm_invsqrt_pd), so for these cases using simd_fsqrt should "just work".

bjorn3 · 2019-07-30T13:54:33Z

By the way, it turns out that cg_llvm implements more simd_f* intrinsics, than seem to be defined in either core_arch or packed_simd:

https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_llvm/intrinsic.rs#L1340-L1381

gnzlbg · 2019-07-30T13:55:24Z

By the way, it turns out that cg_llvm implements more simd_f* intrinsics, than seem to be defined in either core_arch or packed_simd:

Yes, most (or all?) of those are used by packed_simd - the ones that aren't is because they don't make sense without workarounds, and adding workarounds takes time (e.g. there are PRs to packed simd to implement some of those).

bjorn3 · 2019-07-30T13:57:16Z

so for these cases using simd_fsqrt should "just work".

🎉

gnzlbg · 2019-07-30T14:00:37Z

I'm closing this as the issue has been "resolved". Feel free to submit PRs to change whichever parts of core::arch can use some of the already existing generic intrinsics to use those. I hope we can get CI back up soon, so that we can start merging PRs again.

gnzlbg closed this as completed Jul 30, 2019

bjorn3 mentioned this issue Jul 31, 2019

Use more simd_* intrinsics #790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use simd_* intrinsics instead of llvm intrinsics where possible #788

Use simd_* intrinsics instead of llvm intrinsics where possible #788

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019

bjorn3 commented Jul 30, 2019

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019

Use simd_* intrinsics instead of llvm intrinsics where possible #788

Use simd_* intrinsics instead of llvm intrinsics where possible #788

Comments

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 • edited Loading

gnzlbg commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019

bjorn3 commented Jul 30, 2019

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 • edited Loading

bjorn3 commented Jul 30, 2019

gnzlbg commented Jul 30, 2019

gnzlbg commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading

bjorn3 commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading

gnzlbg commented Jul 30, 2019 •

edited

Loading