-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement AVX-512 intrinsics #310
Comments
Dissecting one of the interesting intrinsics here: /// Compute the absolute value of packed 8-bit integers in a,
/// and store the unsigned results in dst using writemask k (elements
/// are copied from src when the corresponding mask bit is not set).
__m512i _mm512_mask_abs_epi8 (__m512i src, __mmask64 k, __m512i a); the So it would be nice to know how does exactly all of this works in LLVM because Another difference with AVX2 is that if we want to use a mask in AVX2 to select values from two This affects boolean vectors / masks, because |
I don't understand how it works, but there are possibly relevant slides from the 2017 LLVM meeting, maybe they are useful: https://llvm.org/devmtg/2017-03//assets/slides/avx512_mask_registers_code_generation_challenges_in_llvm.pdf Another possibly relevant point is that AVX512VL extends the mask registers (and the corresponding intrinsics) to 128- and 256-bit vectors. But, at least when using these from C, LLVM will currently just use blend instructions instead of masks: https://godbolt.org/g/FjU1Xn |
That's a really nice test. Do you know if there is an LLVM bug open for it? I haven't been able to find any. |
Hi, has there been any new developments since this was last active? I would like to contribute AVX-512 intrinsics, but I'm not sure what (if anything) is blocking it, so if anyone has any pointers I'd be happy to help! |
You can add any intrinsic that does not use If you want to add an intrinsic that uses |
Clang defines masks like typedef unsigned char __mmask8;
typedef unsigned short __mmask16;
typedef unsigned int __mmask32;
typedef unsigned long long __mmask64; So maybe just a wrapper struct without pub struct __mmask8(u8);
pub struct __mmask16(u16);
pub struct __mmask32(u32);
pub struct __mmask64(u64); |
Cool! I'll give it a try some time this week, RustConf permitting. |
Should AVX-512 intrinsics be split into modules corresponding to their feature flag? This seems sensible except that I'm not sure how it should interact with the AVX512VL extension, since it seems weird to have the 512/256/128-bit versions of the same intrinsic in different places. |
@hdevalence we currently split the functionality in modules corresponding to their target-feature flag and/or cpuid flag. I expect avx512f, avx512vl, etc. to be their own modules like they are in clang. This stuff is decided on a 1:1 basis though, whoever sets the PR can get the conversation started. Are there any technical reasons to split it in any other way? |
Hmm, but the VL flag is orthogonal to the other flags, so for instance the |
@hdevalence in clang they live in an EDIT: typically the ones that require |
I don't think we should worry about these. Some of these did not support SSE4.2 and IIRC AVX2 either (only AVX-512), and we can't target them with LLVM IIRC.
Sure. If once we start this way we discover that putting these into their own modules makes things clearer, we can always do that later. |
I started working on this in this branch: https://github.com/hdevalence/stdsimd/tree/avx512 (very rough work). I'm not sure how to encode the masks. Considering In the allintrinsics gist these appear around L20966-20970 as Does anyone know if there's an equivalent of |
Took a quick look, some notes:
|
We have a
We could add a select intrinsic that takes an integer instead of a vector and does this (or extend the current |
Update, I found the It seems like the general pattern is to remove masked versions of builtins and to use selects instead, so maybe it would be good to write a macro that generates the masked versions, and also maybe a macro that generates the VL versions. |
Oops, I didn't refresh the page before I posted that, sorry for the confusion. |
I updated that branch with definitions of |
Would this have to be implemented lower in rustc, or is it something that could be done in stdsimd? |
IIRC the There is no support in Rust for vectors of The easiest thing would be to add a new intrinsic, e.g., |
Sounds good to me. |
If anyone has time to implement such an intrinsic (I don't know how to do it myself), I'd like to start adding some AVX512 intrinsics. |
Sorry, it is on my backlog. I tried to get started with it a couple of times, but always ran out of HDD space trying to compile rustc. I'll try to make some space and get it done today. |
@hdevalence update: i can't compile rustc anymore - takes too long (freezes my pc at some point), too much ram (8gb of ram is not enough), too much hdd space (50 Gb of free space isn't enough apparently), etc. so I can't implement anything there properly anymore. It was always a pain to modify rustc due to the high requirements, but I've tried now for almost two days to get a full stage1 build done of |
After additional work, I realized that the problem was that I had the constification wrong. I had tried constifying both const args reaching out but had apparently gotten it wrong. The PR isn't finished, but the comparisons are linking properly now. |
Hi, I try to implement _mm512_and_epi32 in crates/core_arch/src/x86/avx512f.rs pub unsafe fn _mm512_and_epi32(a: __m512i, b: __m512i) -> __m512i { #[link_name = "llvm.x86.avx512.mask.pand.d.512"] The test is
When I run cargo test, it shows "(signal: 11, SIGSEGV: invalid memory reference)" I tried to compile _mm512_and_epi32 with clang, and it works. |
I suggest using a debugger to look at the disassembly of the crashing code. |
Update:
The rustc generate vpandd instruction. |
I try to implement _mm512_cvt_roundps_ph (__m512 a, int sae). Should we follow the clang or only accept 4 and 8? |
I checked both Clang and GCC and they both pass the full 8 bits on to the underlying instruction: https://www.felixcloutier.com/x86/vcvtps2ph |
Ok. Thanks. The document I checked is |
I try to implement _mm512_mask_extractf32x4_ps (__m128 src, __mmask8 k, __m512 a, int imm8) The simd_select_bitmask(mask, extract, src) shows mismatched lengths: mask length My question is I should implement a u4 or otherwise? |
You can just mask to keep only the bottom 4 bits and use |
|
Is any plan to support 4bit or 2 bit integer type in the future? AVX512F uses a lot of 4bit(32x4) or 2bit(64x2) masks on _mm_mask_xxxxx instructions which inputs and outputs are 128 bit. |
I had a look in the compiler and it seems that this is a bug in the implementation of @minybot @bjorn3 Would one of you be willing to make a PR to fix this in rustc? The relevant code is here: https://github.com/rust-lang/rust/blob/f3c923a13a458c35ee26b3513533fce8a15c9c05/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1272 |
I try to modify simd_select_bitmask to use 4bit mask if the output is f32x"4"
However, it shows "error: failed to parse bitcode for LTO module: Bitwidth for integer type out of range (Producer: 'LLVM11.0.0-rust-dev' Reader: 'LLVM 11.0.0-rust-dev')" So, bx.select is only accept 8bit or more? |
You can't truncate |
There is another solution without touching simd_select_bitmask. |
I just went ahead and fixed the issue in rust-lang/rust#77504. |
I test it, and it works when the mask size is 4. |
For Mask operation in avx512 such as _kadd_mask32, it adds two masks. |
No, but it's fine since we don't guarantee a particular instruction is used for an intrinsic: we leave it to LLVM to decide whether it is better to use a |
While working on a private project, I needed masked loading, so I wanted to prepare a PR with implementations for
What is the "need i1" part? I have not found any explanation there. Currently, I am tempted to implement masked loading like in (as an example) /// Load packed 32-bit integers from memory into dst using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 64-byte boundary or a general-protection exception may be generated.
///
/// [Intel's documentation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_mask_load_epi32&expand=3305)
#[inline]
#[target_feature(enable = "avx512f")]
#[cfg_attr(test, assert_instr(vmovdqa32))]
pub unsafe fn _mm512_mask_load_epi32(src: __m512i, k: __mmask16, mem_addr: *const i32) -> __m512i {
let loaded = ptr::read(mem_addr as *const __m512i).as_i32x16();
let src = src.as_i32x16();
transmute(simd_select_bitmask(k, loaded, src))
} which follows how |
This is incorrect since To support this properly we need to call an LLVM intrinsic directly. However this intrinsic uses a vector of |
Makes sense. Many thanks for the explanation. |
Another possible implementation for #[inline]
pub unsafe fn _mm512_mask_loadu_epi32(src: __m512i, mask: __mmask16, ptr: *const i32) -> __m512i {
let mut result: __m512i = src;
asm!(
"vmovdqu32 {io}{{{k}}}, [{p}]",
p = in(reg) ptr,
k = in(kreg) mask,
io = inout(zmm_reg) result,
options(nostack), options(pure), options(readonly)
);
result
} If such an implementation would be ok maintenance wise I could try preparing a PR that adds the missing avx512f this way. |
Sounds good! |
Just coming from the discussion: rust-lang/portable-simd#28. Regarding the separation of avx512f intrinsics and and target_feature=avx512f, now, I have enough interest and time to investigate it. |
I expect that we will be stabilizing AVX-512 soon, thanks to the hard work of many people in implementing the full set of AVX-512 intrinsics in stdarch. |
General instructions for this can be found at #40, but the list of AVX-512 intrinsics is quite large! This is intended to help track progress but you'll likely want to talk to us out of band to ensure that everything is coordinated.
Intrinsic lists: https://gist.github.com/alexcrichton/3281adb58af7f465cebee49759ae3164
The text was updated successfully, but these errors were encountered: