Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal path to stabilization #159

Closed
3 tasks
alexcrichton opened this issue Oct 28, 2017 · 81 comments
Closed
3 tasks

Minimal path to stabilization #159

alexcrichton opened this issue Oct 28, 2017 · 81 comments

Comments

@alexcrichton
Copy link
Member

I wanted to open a sort of tracking issue for getting stdsimd into the standard library and on the path to stable Rust. That's sort of what we've wanted all along! What I propose here may seem a bit radical, but hear me out!

First and foremost I see the purpose of the stdsimd crate to be providing definitions of the vendor intrinsics. These functions are all unsafe as previously discussed and use #[target_feature] to implement themselves. Vendors (like Intel) define how many functions are here and what their names are. The important thing about these intrinsics is that how they're defined is compiler dependent. This, in turn, forces them to be implemented with unstable functionality, namely we must move this to the standard library in one form or another to be stable.

Next up if we have vendor intrinsics we also have the types! Each vendor intrinsic has types associated with it, although we are thinking of differing from the upstream vendors with more "principled" types like u8x16 instead of __m128i. As a result these types are also defined in the standard library.

After this, however, I'd propose we stop. This issue is all about the minimal path to stabilization (while still being useful) and I believe that type definitions plus intrinsics is sufficient. I say this namely because that's all C/C++ have, type definitions and intrinsics. I personally very much want to stabilize more convenient names like f32x4::splat or u16x8::extract where I don't have to remember janky names from Intel, but this is always backwards compatible to add at a later date. I think we have a lot more to gain from a little bit of SIMD now rather than "super portable SIMD right out the gate" much later. I hope to soon after the first pass of stabilization start looking to stabilizing the more convenient and powerful APIs (like Add for u32x8).

Once we have this then I think we also need to stabilize the #[target_feature] attribute and its associated features. This namely involves finishing implementing RFC 2045, a slight tweak to #[target_feature] syntax.

Up to this point I believe the last remaining issue is "what happens on the classical ABI mismatch issue". Namely, what does rustc do with this function:

#[target_feature = "+avx"]
fn foo(a: u64x8) -> u64x8 {
    a
}

unsafe fn bar() {
    foo(mem::zeroed());
}

(or something like that)

Notably here the bar function does not have the avx feature enabled, so the ABI it will invoke foo with differs than that of the ABI that foo is expecting, hence causing a mismatch. I would propose a different strategy from RFC 2045 talked about at Rust Belt Rust with some others, namely inserting shims. This I believe solves the function pointer problem as well! Namely:

  • If you invoke a function with a different set of target features and you have an argument that's ABI-relevant, the compiler inserts a shim which "fixes" the invocation. Namely it would fix the above to:
#[target_feature = "+avx"]
fn foo(a: u64x8) -> u64x8 {
    a
}

#[target_feature = "+avx"]
fn foo_shim(a: &u64x8, out: &mut u64x8) {
    *out = foo(*a);
}

unsafe fn bar() {
    foo_shim(mem::zeroed());
}
  • All function pointers have the "default" ABI in that they have no target features enabled. This means that if you coerced foo to a function pointer above you would actually get a foo_shim function pointer, regardless of where the function pointer is created.

When this is all assembled I believe we've got a very short path to stable SIMD on Rust today. This path should be sound so long as you have the right CPU feature checks in place, extensible to add more portable extension in the future, and complete in that we have accomodated for the existence of all vendor intrinsics.

So in summary, I believe we have three blockers for stabilization:

  • Agree on the minimal API consisting purely of unsafe vendor intrinsics and type definitions, no trait impls or other methods on the type definitions stabilized yet (they can and will be added later!)
  • Implement the minor tweaks to the #[target_feature] attribute as specified in RFC 2045
  • Implement generation of "shim" functions in trans to work around the "ABI mismatch" problem

Once this is all done and agreed on I think we can also look to moving this crate into the standard library. I believe that would entail basically adding a src/stdsimd submodule pointing to this repository, adding a pub mod simd to src/libcore/lib.rs which points to this crate's root (probably setting some sort of #[cfg] along the way so this crate knows what's happening), and then finally adding a bunch of stability annotations all throughout the crate.


Ok so that may be a bit much, but I'm curious what others think!

cc @gnzlbg
cc @BurntSushi
cc @nikomatsakis
cc @withouboats
cc @hdevalence

@aturon
Copy link
Member

aturon commented Oct 28, 2017

cc @withoutboats (typo in issue)

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 28, 2017

I agree 100% with everything you said.

I'd like to add one thing:

I think we have a lot more to gain from a little bit of SIMD now rather than "super portable SIMD right out the gate" much later.

I think we should prototype a "super portable SIMD library" that uses stdsimd for the intrinsics before we stabilize anything. Nothing fancy, just some vector arithmetic, sqrt, and not much more, just to check whether it can be built on top of stdsimd or not, and if not, what the problems are.

I've started already to do this and will commit it as an example/ (or if it gets too big, as its own crate as soon as we do the next release of stdsimd), but I think doing this in parallel will either give us confidence about the design or warning signs about things we might have to tune.

Just to be clear: I think a portable SIMD library does not belong in std (there are many ways to do that, with different trade-offs), but I think we should make sure that whatever we stabilize is at least on the right path to allow building those libraries on stable Rust some day.

And I have one nitpick:

I would propose a different strategy from RFC 2045 talked about at Rust Belt Rust with some others, namely inserting shims.

I think that inserting shims is a great idea! The shims are a zero-cost abstraction, that is, if you want to do this there is no more efficient way of doing it. However, if I ever want to avoid the cost of the shims, I'd like to have enough control to do that.

So I would prefer if the compiler would hard error when this happens (so that I know where these cases are), and we would provide some functionality to manually tell the compiler "insert shims here". We can always make it automatic in the future without breaking backward compatibility.

Anyhow, I don't want to derail this thread with the nitpick, we can discuss that somewhere else.

@withoutboats
Copy link

@alexcrichton This plan sounds solid to me. 👍

In terms of ordering work, does anything block putting the intrinsics into libcore behind a feature flag right now? While all of these items need to be done before stabilizing anything, I think they could be done in any order.

@gnzlbg

I think we should make sure that whatever we stabilize is at least on the right path to allow building those libraries on stable Rust some day.

Since we're talking about just exposing the lowest-level platform intrinsics, I don't see a lot of choices between what paths we're on. This proposal seems like stabilizing just the things there are no alternatives between.

Do you have concrete examples of things you think are uncertain now and that you don't want to lock down through stabilization?

However, if I ever want to avoid the cost of the shims, I'd like to have enough control to do that.

Isn't this supported by using the cfg based system (which I think is fully implemented and stable today) instead of the runtime system? That is, you cfg out the functions unless the target supports these features without any shimming.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 28, 2017

@withoutboats check out the examples/nbody.rs which currently uses static feature detection. We would like to improve that to select the best instruction at run-time. We can already do that today, but it is a bit unclear whether how that looks is what we want.

For example, a safe SIMD library can do the runtime detection on each intrinsic call, but that's basically adding an atomic load to each simd operation in the library. Ideally, the dispatch would happen at a higher level (maybe even at the level of fn main()). It is unclear how to write a portable SIMD library that would support doing this at this point.

Isn't this supported by using the cfg based system (which I think is fully implemented and stable today) instead of the runtime system? That is, you cfg out the functions unless the target supports these features without any shimming.

By cost, I meant the cost of converting e.g. 2x 128bit-wide registers into a 256 bit one and vice-versa. The shims do this conversion, which is not free. This conversion can be done manually today, e.g., by manually writing the shims. The question is whether we want to induce this cost "silently" or whether we want instead offer a way to automatically generate these shims, but such that the user has to say that it wants them. I'd rather start conservatively, but either way we are moving in the right direction. Once the shim generation is implemented we can revise whether we want to generate them automatically, or error and let the user decide (ideally in such a way that a library that does it automatically can be built on top of it).


FWIW, this means that we can stabilize the intrinsics, the vector types, and #[target_feature] but that maybe we should probably explore writing high-level SIMD wrappers a bit before stabilizing cfg(target_feature) and cfg_feature_enabled.

@alexcrichton
Copy link
Member Author

@withoutboats correct yeah, I think we could make inroads immediately on including this library into libcore.

@gnzlbg

I'd agree that we should play around with this interface some more! That's sort of what we expect to naturally happen as this becomes more broadly available on nightly (e.g. included into libcore). I'm a little skeptical like @withoutboats, however, that this will turn up a whole lot and I wouldn't want to necessarily block on a proof-of-concept library (depending on the circumstances)

In my mind the shims are basically necessary for this to be a "reasonable" feature in terms of usability and expectations of how Rust code works "in the large". I'd imagine that the impact of such shims would show up very rarely in practice.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 29, 2017

Yes, I don't think we should block moving this into core on a prototype of a higher level library. From core to Rust stable there is more than enough time to do that.

I'd imagine that the impact of such shims would show up very rarely in practice

Yes, I also expect that these shims will be inserted very rarely in practice, and when it does, what most people will want is the shims to be generated anyways.

@BurntSushi
Copy link
Member

BurntSushi commented Oct 30, 2017

@alexcrichton Thanks for this proposal! Broadly speaking, I'm OK with this. I would however like to note some costs so that we're all on the same page.

In particular, if we omit everything except the type definitions themselves, then I think the following is true:

  • Building a portable SIMD library (outside of std, without LLVM's help) will be hard. @stoklund talks about this here: https://internals.rust-lang.org/t/getting-explicit-simd-on-stable-rust/4380/209 --- That in turn means that the current simd library won't have an obvious path to stabilization. (This last point isn't terribly concerning, since it's not clear what role the simd crate will play anyway. But still, if someone wanted to go out and write their own Add impls, for example, then it's not trivial.)
  • Using many of the intrinsics will be even more painful, since casting between vector types feels like a pretty common thing to do. Without the various as methods or From impls that we have today, I think folks will be forced to transmute or pointer cast. I think @sfackler had some thoughts on this.

I do think this is a fine first step though!

@sfackler
Copy link
Member

Yeah - I was porting StreamVByte from 32 to 64 bit values. The encode kernel interprets its vectors as all of the u8x32, u32x8, i32x8, and u64x4 types. I have to convert at least one of the vector arguments to every single intrinsic call.

I don't know how common that stye of SIMD workload is, but the type safety was a net negative. If the Into impls went away it'd be even worse.

@alexcrichton
Copy link
Member Author

@BurntSushi excellent points! I'm just trying to thread the needle as "quickly as possible" in the sense of building up a conservative base that can be experimented with while ensuring we can continue to add features. I'd hope enabling the simd crate as-is today would be the next stop of stabilization for SIMD in Rust!

@sfackler an interesting case study! Did the original code rely on implicit conversions done by the C compiler? I think we're slighly more principled than __m256i for example (having a bunch of widths other than that and typing the intrinsics more strictly than the Intel documentation says). In that case we're definitely going to be more painful than with C, but if the C intrinsics have different types themselves then it's sort of like porting line-for-line arithmetic in C to Rust, it just likely needs more casts.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 30, 2017 via email

@sfackler
Copy link
Member

@alexcrichton From what I can tell, the C code is able to work entirely with _m128is, with a union to type pun bytes and ints out: https://github.com/lemire/streamvbyte/blob/master/src/streamvbyte.c#L183.

@hdevalence
Copy link
Contributor

My understanding is that many of the From casts currently in the stdsimd crate are already effectively transmutes/bitcasts -- for instance, i32x8 vs i64x4. Is this correct? Does this mean that the question isn't whether people will transmute or not, but just whether they'll do it using stdsimd-provided impls?

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 30, 2017 via email

@alexcrichton
Copy link
Member Author

In general I just wanted to make a proposal which had stabilization as an option in the near future. We can of course add more requirements to that, for example auditing the current API of u32x8, casts and trait impls and all. I fear that will delay stabilization further, however, and at this point I feel we have more to benefit from getting something out there rather than being perma-unstable.

@hdevalence
Copy link
Contributor

@gnzlbg I'm not sure I understand, what do you mean by more principled?

@alexcrichton
Copy link
Member Author

@BurntSushi @gnzlbg do y'all have thoughts specifically on @sfackler's concern above about how the "type safety" on the Intel intrinsics is making them less ergonomic? The alternative being that we switch to precisely the signatures that Intel publishes and then somehow providing constructors via a different mechanism.

@sfackler
Copy link
Member

The other (much more minor) concern I'd have with type-safe Intel intrinsics is around figuring out what those types actually are - for things like shuffles we use unsigned types, where I think Clang uses signed types for example.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 1, 2017

do y'all have thoughts specifically on @sfackler's concern above about how the "type safety" on the Intel intrinsics is making them less ergonomic?

My thought is that @sfackler is correct: the type safety makes the intrinsic less ergonomic to use since now you have to write as_... /mem::transmute/into to use them. If you are porting C code to Rust, this is going to result in compilation errors, and for which the compiler will tell you exactly how to perform the conversion. Note also that @sfackler 's code exclusively uses .into instead of .as or mem::transmute. Having to do this does really look annoying: the code is not better because of it.

My experience is, however, completely different to @sfackler , but that is because my code never uses .into to perform the conversions, but the .as_... cast instead. This makes the code more explicit and reminds me (the reader) on what "units" each X86 intrinsic operates on. I found this better than the C model, but I agree that if you are going to end up writing into all over the place, then yes, that's pointless.

The alternatives I see are to either kill .into and have only .as_ (which I am in favor of), or to use the __m128i, __m256i, ... types on X86. We could do the later, by adding those types and switching the intrinsics to work on those, but we need to keep all other types because f32x4 is what ARM NEON uses.

I'll be fine either way (we can build higher-level wrappers either way) but I prefer the ARM types (e.g. f32x4) with explicit conversions for the lowest level of the stack.


@sfackler

The other (much more minor) concern I'd have with type-safe Intel intrinsics is around figuring out what those types actually are

Could you elaborate? What is it about these types that you found hard to figure out?


@hdevalence

By more principled I meant that we can use .as_ casts for element-wise as conversions, and mem::transmute for same-vector-width conversions between vectors of different types. This is basically how the rest of Rust works, so I don't really think we should be adding any as/.into magic here.

@BurntSushi
Copy link
Member

I think my preference is to initially try using more granular types, even for the Intel intrinsics. If more reports like @sfackler's come up before stabilization, then perhaps we can consider an alternative approach. Additionally, I am sympathetic to @gnzlbg's reasoning of keeping the units explicit. That feels like a good thing to me.

@BurntSushi
Copy link
Member

The other (much more minor) concern I'd have with type-safe Intel intrinsics is around figuring out what those types actually are - for things like shuffles we use unsigned types, where I think Clang uses signed types for example.

Yeah, in most cases, the types are obvious, but you're right in that some cases, it's less clear, especially signedness. Some intrinsics also operate on "128 bit vectors" without any specific regard to lane size, and we currently use i8x16 for those.

@hsivonen
Copy link
Member

hsivonen commented Nov 1, 2017

These functions are all unsafe as previously discussed

What's the rationale for this? Why wouldn't functions that are well-behaved with all inputs be safe?

After [defining types and function for vendor intrisics], however, I'd propose we stop.

As noted in the thread on the internals forum, basic bitwise ops, basic lane-wise arithmetic and basic lane-wise comparisons are already provided by LLVM in a cross-ISA way, so pretending that they, too, are vendor-dependent makes things more difficult on every layer.

Since it appears that stdsimd already provides these in the cross-ISA simd module and LLVM covers this functionality in a cross-ISA way, it is simpler not to put this functionality behind vendor functions. Since plausible non-LLVM back ends are likely to provide a cross-ISA abstraction for these, there's no realistic risk of Rust getting severely painted into an LLVM-dependent corner because of these. Therefore, I think there'd be negative value in excluding these from the initial stabilization.

Additionally, it would be very practical for the initial stable API to provide functions for checking that a vector whose lanes logically represent booleans (i.e. the vector is the output of basic comparison operation) has all lanes true or at least one lane true. These functions have trivial but ISA-specific implementations on SSE2 and Aarch64 but non-trivial implementations on ARMv7+NEON. The standard library should shield Rust programmers from having to implement these for ARMv7+NEON on their own.

There's a problem though in the context of the current stdsimd types: If there aren't types for "boolean vectors" as outputs of operations that logically return lane-wise booleans, in order to be efficient, operations that assume lane-wise booleans have to be defined as having memory-safe but otherwise arbitrary results if the input doesn't conform to the form where false is all bits of the lane zero and true is all bits of the lane one.

The simd crate already has a solution that captures the requirement using the Rust type system: boolean vectors.

From the point of view of wishing to use SIMD functionality and currently using the simd crate, it feels very frustrating that problems that already have solutions in the simd crate are treated as if they weren't solved problems in order to be minimize commitment to API surface.

I'd much prefer the cut-off be set according to what have been solved problems for a couple of years in the simd crate instead of setting it according to a theoretical notion of "minimal".

So, concretely, I propose the following to be the items in the initial stabilization:

  1. Lane-aware types, such as u8x16 and f32x4. (Already in stdsimd).
  2. Implementations of basic bitwise ops and basic lane-wise arithmetic via the appropriate Rust traits. (Already in stdsimd).
  3. Implementation of basic comparison methods named according to Rust conventions. (Already in stdsimd).
  4. Have the return types of the comparison methods be boolean vector types. (Already in the simd crate but not in stdsimd.)
  5. Provide all() and any() checks that take the output of a comparison result from point 3 above and return an ALU boolean. For the inputs to these to be constrained in a Rustic way by the type system, point 4 above is needed. (Already in the simd crate but not in stdsimd.)
  6. Methods on the lane-aware types for reinterpreting them as each other without changing their bits. (Already in stdsimd).
  7. Functions corresponding to vendor intrinsics for functionality not covered above.

(P.S. If encoding_rs was reimplemented on top of the base described in the above list, the code would end up with more ISA-specific conditionals than now. To keep the level of cross-ISA generality that encoding_rs currently enjoys via unstable features, the stable features would also need to include cross-ISA unpack of u8x16 into two u16x8s by zero extension and pack two u16x8s into u8x16 by discarding the high halves of each u16 lane.)

@sfackler
Copy link
Member

sfackler commented Nov 1, 2017

@gnzlbg

Could you elaborate? What is it about these types that you found hard to figure out?

For example, we use u8x16 for _mm_shuffle_epi8, but clang works with i8x16: https://rust-lang-nursery.github.io/stdsimd/x86_64/stdsimd/vendor/fn._mm_shuffle_epi8.html https://github.com/llvm-mirror/clang/blob/410d429c629018d3d6b48983af1a6f6678c21edf/test/CodeGen/ssse3-builtins.c#L94

It doesn't actually matter since the instruction doesn't depend on the signedness of the values but it's a bit inconsistent. Having to look in clang codegen tests for what the vector types should be is a bit weird.

@hsivonen

Why wouldn't functions that are well-behaved with all inputs be safe?

The functions are only well-defined if they're running on hardware that implements them. There's a pretty extensive discussion of this on the internals thread: https://internals.rust-lang.org/t/getting-explicit-simd-on-stable-rust/4380.

@hsivonen
Copy link
Member

hsivonen commented Nov 1, 2017

The functions are only well-defined if they're running on hardware that implements them.

Any more specific reference to the internals thread? It seem to me that e.g. ARM vendor functions should be unavailable when compiling for an x86 target instead of being unsafe when compiling for an ARM target.

@BurntSushi
Copy link
Member

@hsivonen Compile time isn't the issue. Runtime is. Comparing ARM with x86 isn't the interesting the case. The target_feature RFC has details: https://github.com/gnzlbg/rfcs/blob/target_feature/text/0000-target-feature.md#unconditional-code-generation-target_feature The TL;DR is that we explicitly need the ability to potentially execute vendor intrinsics on platforms that don't support them, and if the user messes the runtime detection up, then you get UB.

And yes, we all think it's unfortunate.

@withoutboats
Copy link

@hsivonen to be a bit more concrete, when it comes to simd x86 is a lowest common denominator; most x86 simd instructions are not available on all x86 CPUs because they've been added in more recent iterations. For authors distributing binaries (e.g. ripgrep) it would be infeasible to ask them to distribute a different binary for every set of simd instructions a machine could have (and to ask their users to determine which simd instructions their own machine has).

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 1, 2017

@sfackler

For example, we use u8x16 for _mm_shuffle_epi8, but clang works with i8x16

That is LLVM, this is clang:

static __m128i __mm_shuffle_epi8(__m128i __a, __m128i __b);

Note that __m128i is an "integer vector". It doesn't care whether the integers are signed or unsigned, the behavior depends on the intrinsic. It is actually a bit more "insane" than that because intrinsics like _mm_loadl_epi64 take an __m128i* but this pointer does not need to be aligned (while a __m128i should be) and it doesn't need to point to an _m128i either (it can point to any 64 bit integer).

The ARM NEON API is (IMO) a bit better and requires us to provide the strongly-typed vector types. Since we need these anyways, the cost of providing them for X86 as well is low. Whether we should do it is another matter, but if we decide to do it we are necessarily going to introduce some friction with the C API of the x86 intrinsics because it is very close to being completely untyped.


@hsivonen

These functions have trivial but ISA-specific implementations on SSE2 and Aarch64 but non-trivial implementations on ARMv7+NEON. The standard library should shield Rust programmers from having to implement these for ARMv7+NEON on their own.

it feels very frustrating that problems that already have solutions in the simd crate are treated as if they weren't solved problems in order to be minimize commitment to API surface.

There are many ways to implement a higher-level SIMD library. Instead of trying to push a particular controversial solution through the RFC process, we are trying to find the minimal solution that allows building zero-cost, efficient, portable, and easy to use higher-level SIMD libraries on stable Rust.

@hsivonen
Copy link
Member

hsivonen commented Nov 2, 2017

https://github.com/gnzlbg/rfcs/blob/target_feature/text/0000-target-feature.md#unconditional-code-generation-target_feature The TL;DR is that we explicitly need the ability to potentially execute vendor intrinsics on platforms that don't support them, and if the user messes the runtime detection up, then you get UB.

The RFC says "(see the "On the unsafety of #[target_feature]" section).", but there's no such section, so it's still unclear to me why executing an illegal instruction needs to be UB and can't be trusted to cause program termination that's as safe as panic=abort.

(If this is answered in the internals forum thread, it's really hard to find, because Discourse 1) breaks ctrl-F and 2) even when using the browser Find feature via the menu, it doesn't work, because Discourse doesn't put the whole discussion it the DOM at the same time.)

There are many ways to implement a higher-level SIMD library.

I think putting the types in the standard library is the right call, but it also means that trait implementations for things like Add and BitOr belong there and can't be provided by a separate higher-level SIMD library.

Instead of trying to push a particular controversial solution through the RFC process, we are trying to find the minimal solution that allows building zero-cost, efficient, portable, and easy to use higher-level SIMD libraries on stable Rust.

I think it's very unfortunate if controversy avoidance is prioritized ahead of making good use of what LLVM provides (cross-ISA basic arithmetic, bitwise ops and comparisons) or what has been shown to be good use of the Rust type system for a couple of year by the simd crate (boolean vectors as a way to signal that a vector promises to have all bits of each lane either set or unset). (In terms of type system usage, I think the design should optimize for the future of writing new cross-ISA portable Rust code over ease of porting pre-existing ISA-specific C code.)

I think it's a bad outcome for Rust if solutions to solved problems are rejected. Controversy avoidance isn't a great way to arrive at good design.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 2, 2017 via email

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 2, 2017 via email

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 6, 2017

(I figured out how to search the original thread . As far as I can tell, the rationale for unsafety was the risk of SIGILL. Per the previous paragraph, I disagree with the notion that reliable process termination is UB or not safe for Rust purposes. And even if SIGILL was conceded to be UB, the right place for unsafe would be the point of transition to a function with more target_features and not each operation that might raise SIGILL.)

This is incorrect. The risk is not that a SIGILL can be raised, but rather, that the CPU doesn't raise it.

Quoting the RFC (which I recommend you to read):

Calling a function annotated with #[target_feature] on a host that does not
support the feature invokes undefined behavior in LLVM, the assembler, and
possibly the hardware See this comment.

That is, calling a function on a target that does not support its feature set is
undefined behavior and this RFC cannot specify otherwise. The main reason is that target_feature is a promise from the user to the toolchain and the hardware, that the code will not be reached in a CPU that does not support the feature. LLVM, the assembler, and the hardware all assume that the user will not violate this contract, and there is little that the Rust compiler can do to make this safer:

  • The Rust compiler cannot emit a compile-time diagnostic because it cannot know whether the user is going to run the binary in a CPU that supports the features or not.
  • A run-time diagnostic always incurs a run-time cost, and is only possible iff the absence of a feature can be detected at run-time (the "Future Extensions" section of this RFC discusses how to implement "Run-time diagnostics" to detect this, when possible).

However, the --target-feature/--target-cpu compiler options allows one to implicitly generate binaries that reliably run into undefined behavior without needing any unsafe annotations at all, so the answer to the question "Should #[target_feature] be safe/unsafe?" is indeed a hard one.

The main differences between #[target_feature] and --target-feature/--enable-feature are the following:

  • --target-feature/--enable-feature are "backend options" while #[target_feature] is part of the language
  • --target-feature/--enable-feature is specified by whoever compiles the code, while #[target_feature] is specified by whoever writes the code
  • compiling safe Rust code for a particular target, and then running the binary on that target, can only produce undefined behavior iff #[target_feature] is safe.

This RFC chooses that the #[target_feature] attribute only applies to unsafe fns, so that if one compiles safe Rust source code for a particular target, and then runs the binary on that particular target, no unsafety can result.

Note that we can always make #[target_feature] safe in the future without breaking backwards compatibility, but the opposite is not true. That is, if somebody figures out a way of making #[target_feature] safe such that the above holds, we can always make that change.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 6, 2017

@hsivonen which particular portable SIMD intrinsics would you like the stdsimd crate to expose?

@hsivonen
Copy link
Member

hsivonen commented Nov 6, 2017

The risk is not that a SIGILL can be raised, but rather, that the CPU doesn't raise it.

OK. I covered that scenario in my previous comment and my conclusion stands: The unsafe bit should be the case where the callee has target_features that the caller doesn't have. Other than that, there should be no need for unsafe arising from the instructions used being ISA extensions. Therefore, the functions representing vendor-defined operations shouldn't be unsafe.

which particular portable SIMD intrinsics would you like the stdsimd crate to expose?

The simd_foo stuff from platform-intrinsics (possibly with the exception of simd_cast; I'm not familiar enough with it to say).

For the non-shuffle parts, it would be preferable for them to be exposed as methods (trait implementations where possible) on the SIMD types (as seen in the simd crate and, except for the unideal return type for the comparisons, in the simd module stdsimd).

The shuffles (and simd_cast if included) would probably need to framed is a distinct feature compared to the rest to highlight that the performance characteristics of the shuffles are very dependent on the quality of implementation of the back end. (I'm aware that this makes shuffles more controversial than the bitops, basic lane-wise arithmetic and lane-wise comparisons, but I'd still like to have portable shuffles on the stable channel, because my code uses portable shuffles. I'm also aware that the index array being required to have only constant values is type-wise a rather odd thing for Rust, but it seems that rustc already has code to enforce that requirement.)

To be clear: While I'd prefer there not to be vendor functions for the operations that are covered by the non-shuffle, non-cast operations in simd_foo set of operations in platform-intrinsics, I'm not in any way objecting to providing vendor functions for operations that happen to overlap with LLVM's capability of optimizing shuffles.

Additionally, methods all() and any() (as seen on the boolean vectors in the simd crate) should be part of the design even if they don't map to intrinsics in LLVM at the moment. I think the output of the comparison operations should have a type system-level indication that they fulfill the preconditions of all() and any(). That is, I think there should be types for boolean vectors as seen in the simd crate (unlike in the simd module of the stdsimd crate).

(Considering that all() and any() are on track to becoming operations on their own right in WebAssembly SIMD, I wouldn't be surprised if future LLVM provided intrinsics for these in order to have an efficient WASM mapping. Until such time, these methods would need ISA-specific implementations in stdsimd.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 6, 2017

The simd_foo stuff from platform-intrinsics (possibly with the exception of simd_cast; I'm not familiar enough with it to say).

But those are already exposed... or which one is not exposed?

For the non-shuffle parts, it would be preferable for them to be exposed as methods (trait implementations where possible)

Isn't this is already the case? (modulo bugs)

Additionally, methods all() and any() (as seen on the boolean vectors in the simd crate) should be part of the design even if they don't map to intrinsics in LLVM at the moment.

Do you want these methods for other types beyond boolean vectors?

@hsivonen
Copy link
Member

hsivonen commented Nov 6, 2017

The simd_foo stuff from platform-intrinsics (possibly with the exception of simd_cast; I'm not familiar enough with it to say).

But those are already exposed... or which one is not exposed?

For the non-shuffle parts, it would be preferable for them to be exposed as methods (trait implementations where possible)

Isn't this is already the case? (modulo bugs)

They are exposed in the simd module of the stdsimd crate, so they are exposed as far as stdsimd exists on nightly. I meant I want them in the standard library as stdsimd graduates there for use from stable Rust.

Additionally, methods all() and any() (as seen on the boolean vectors in the simd crate) should be part of the design even if they don't map to intrinsics in LLVM at the moment.

Do you want these methods for other types beyond boolean vectors?

No. Just boolean vectors. AFAICT, the reason for boolean vector types to exist is to indicate conformance to the precondition of these two methods.

Trying to formulate meaning for these methods without the precondition would make them less efficient, at least in the SSE2 case.

@gnzlbg
Copy link
Contributor

gnzlbg commented Nov 6, 2017

@hsivonen are there any reasons, beyond convenience, for boolean vectors to be part of the stdsimd crate?

@hsivonen
Copy link
Member

hsivonen commented Nov 6, 2017

@hsivonen are there any reasons, beyond convenience, for boolean vectors to be part of the stdsimd crate?

The comparison methods on the types that should be part of stdsimd should return boolean vectors, so those methods need to be able to declare boolean vectors as their return types.

@hsivonen
Copy link
Member

hsivonen commented Nov 7, 2017

The unsafe bit should be the case where the callee has target_features that the caller doesn't have. Other than that, there should be no need for unsafe arising from the instructions used being ISA extensions. Therefore, the functions representing vendor-defined operations shouldn't be unsafe.

I wrote a draft for this in RFC form.

@alexcrichton
Copy link
Member Author

Ok so for an update on the x86 side of things (not the portable side of things yet) I believe we're currently at:

  • Tons of intrinsics (all?) implemented.
  • All intrinsics take the vendor types (__m128i and such) rather than the portable types
  • The vendor types (on x86) have a quite minimal API, basically just giving you Debug as a "nice to have"

In that sense the major missing piece is still "this all needs to be sound" with the shim insertion by rustc.

Some notes I'd like to take down about the state of x86 as well and its future prospects

  • All implemented intrinsics are verified against Intel's documentation to have the same signature. There are exceptions, however, along with interpretation of void* as *mut u8.
  • Tons of intrinsics have #[assert_instr] to prevent them from regressing over time. Unfortunately some are missing #[assert_instr] when Intel lists and instruction and others are asserting a different instruction than what Intel specifies. Note that we think at least that all intrinsics are behaviorally the same.
  • Intrinsics are currently grouped into i586/i686/x86_64 groups largely. This division is somewhat arbitrary though. For example x86_64 intrinsics often refer to a 64-bit integer argument, and the corresponding instruction can't work on x86 b/c there's no 64-bit integer registers. There are intrinsics, though, that are in i586 which reference i64 arguments, for example. Additionally the i586/i686 distinction has to do with the two targets, but in theory I believe this distinction shouldn't exist because that's what stdsimd is all about, layering target features...

I'd be relatively confident about stabilizing these intrinsics in that it's (a) easy to fix bugs, (b) easy to resolve disputes by looking at Intel's docs, and (c) runs a pretty low risk of us making a mistake we can't correct (API wrong, etc). The most worrisome part to me is the availablility of intrinsics across targets. For example why are some intrinsics using i64 available on i686 but some aren't? All in all the most worrisome part to me is pretty low risk in terms of stabilization.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 21, 2018 via email

@alexcrichton
Copy link
Member Author

@gnzlbg oh no worries I'm not trying to place blame or anything, I'm still quite a fan of the current organization! It definitely cuts down on #[cfg] traffic. My main concerns about it are:

  • We've got intrinsics in x86_64 I think primarily because Intel documents them as using 64-bit registers. If Intel doesn't document an instruction, though, such as _mm256_insert_epi64, then we don't have a way to automatically sort it into the x86_64 bucket. For example with that intrinsic, would we be comfortable stabilizing availability on x86 platforms? Unsure!
  • Right now there's a difference between the availability of intrinsics on i686-unknown-linux-gnu and i586-unknown-linux-gnu. In theory though there shouldn't be any distinction here! As you've noted i686 should just equal i586 + sse2, so in theory all of the intrinsics should be available for the i586 target, just requiring the sse2 feature. For whatever reason though we're hitting a lot of LLVM errors and we're unable to make this transition.

So for me there's sort of two action items for my "main concern" about the platform availability. One would be to discuss the distinction between x86/x86_64. Do we want to forbid intrinsics on x86 if Intel documents them as using 64-bit registers? I'm personally leaning towards "no" here because it seems that C typically has them on both 32 and 64-bit platforms (presumably with emulation on 32?). That, coupled with the fact that Intel doesn't seem to consistently document intrinsics as needing 64-bit registers or not, makes me think that it's probably worthwhile to not draw the distinction.

The next item would be to investigate allowing the i686 module, as-is, to exist on the i586-unknown-linux-gnu target. I'm not actually sure what this would entail, but we'd probably need to dig around in LLVM or something like that.

Overall that would actually remove the distinction between i586/i686/x86_64, and that'd mean at least that all the Intel/AMD vendor intrinsics would be available unconditionally on all x86 platforms (naming is hard). That's nice at least from a stabilization point of view, but I'm curious what others think!

@hdevalence
Copy link
Contributor

Maybe this would be better in another issue rather than in this mega-thread: is there already discussion about the best Rust type signatures for the Intel intrinsics? I looked in the stdsimd issues list and didn't see one.

@hsivonen
Copy link
Member

hsivonen commented Jan 23, 2018

All intrinsics take the vendor types (__m128i and such) rather than the portable types

Why? Not going with lane-aware types (and boolean vector types even) will make things harder in the future (yet another layer of wrappers). (See my previous comments here and my #Rust2018 post.) This is a case where taking incremental steps makes things harder overall that going directly to the what the simd crate has already demonstrated we should be aiming for.

(To be clear, I want to have the intrinsics, too, but I'm really worried about them not using the lane-aware types and about having them before having the portable part of the simd crate on non-nightly Rust.)

@alexcrichton
Copy link
Member Author

@hdevalence we had a short-lived issue at #263 for moving to the Intel types, although I'm not aware of another location for ongoing discussion about the best types otherwise.

@hsivonen I think #263 should have some more rationale, but does that help?

@hdevalence
Copy link
Contributor

The discussion I was looking for is in #251 (linked from #263), and after reading it I think that the decision to match the upstream definitions exactly is the right one.

I don't think that it's a good idea to block stabilizing vendor intrinsics until the development of a "portable" SIMD library happens. In fact I don't think they're connected at all and that it's a mistake to try to couple them together.

@hsivonen
Copy link
Member

I think #263 should have some more rationale, but does that help?

It helps me to understand the decision, yes. Thank you.

I'm still unhappy about it, though, because even if lane-aware wrappers are developed eventually, having the non-lane-aware versions around makes it more likely that there's going to be vendor-specific code even for stuff that could be written portably using lane-aware types and the features the simd crate exposes.

That is, I believe lane-aware types nudge towards portability for the parts that can be portable, so you end up with ISA-specific divergence for small building blocks instead of whole algorithms having totally separate code for different ISAs.

I don't think that it's a good idea to block stabilizing vendor intrinsics until the development of a "portable" SIMD library happens. In fact I don't think they're connected at all and that it's a mistake to try to couple them together.

I might agree if the simd crate didn't exist, but when it exists and it's working so well for me, it's so frustrating that it doesn't get rubber-stamped for stable (with the vendor-specific bits that overlap with the systematic vendor intrinsic support effort removed or hidden).

Having non-portable SIMD on stable and portable SIMD on nightly puts pressure on crates to go with non-portable code in order to have SIMD on stable. It's not good for the ecosystem to steer crates to be Intel-specific when they could be supporting other ISAs, most notable the ARM ones.

The portable stuff has real uses. Today, in the Firefox context on x86_64, I measured a 97.54% execution time reduction for our "Does this UTF-16 string contain right-to-left characters?" function with Thai input from replacing C++ that didn't use explicit SIMD with portable simd crate-using Rust code. The difference between explicit SIMD and code written as ALU code is less impressive on aarch64, but it's a great thing that aarch64 support came as an essentially free by-product. (And ARMv7+NEON will get automatically fixed once any() and all() for ARMv7+NEON in the simd crate are fixed. Once we get Ion support for aarch64 in SpiderMonkey, we might actually ship the aarch64 code.)

@petrochenkov
Copy link

Sorry for a bikesheddy question (and it probably was discussed before), but why is the crate named "simd" and not "vendor_intrinsics" or something, if it's supposed to give access to intrinsics in general, not just SIMD-related ones, especially given that interfaces are not even uNxM based now.
(This is kinda similar to the gcc crate that provided access to all C/C++ compilers.)

@BurntSushi
Copy link
Member

BurntSushi commented Jan 24, 2018

@petrochenkov Because the purpose of the crate has evolved independently of the name. When I first started this, the intent was to provide both vendor intrinsics and a minimal platform independent SIMD API. Given the evolution of purpose, a name change is likely a good idea.

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 24, 2018 via email

@hsivonen
Copy link
Member

Henri there are over 1000 x86 SIMD intrinsics without counting AVX-512 (and many more if you count PowerPC, ARM, MIPS, Spark, Etc).

Like I said, I now understand the rationale for the approach even though I'm not happy about it.

Obviously, researching the lane configuration for everything is a big task, so there's the matter of prioritization: Making maximal intrinsic exposure the priority or making it the priority to do the commonality (i.e. 128-bit vectors) really well typing-wise up front. At this point, I don't think it productive for me to oppose the prioritization that has been chosen.

It would make me very happy (and many other people as well) if we ever have a strongly typed SIMD solution available in std. The only thing preventing that from happening is somebody writing an RFC for it. Nothing more, nothing less.

So we could have a subset (vendor-specific methods removed) of the simd crate on stable if I just wrote it up in RFC form?

@scottlamb
Copy link

There's something I'm missing in this discussion. If the vendor intrinsics are stabilized (using the vendor-specific types), can't the simd crate be rebuilt on top of those with its current abstractions? why would it still require nightly?

@hdevalence
Copy link
Contributor

@scottlamb The reason is because the vendor intrinsics are themselves already implemented in terms of LLVM's vector types. For instance, the _mm256_shuffle_epi32 intrinsic is implemented in terms of an architecture-independent LLVM shuffle, which eventually lowers to the desired vpshufd instruction.

@alexcrichton
Copy link
Member Author

@hdevalence

The discussion I was looking for is in #251 (linked from #263), and after reading it I think that the decision to match the upstream definitions exactly is the right one.

Ok cool, thanks for looking over the discussion!


@hsivonen

Thanks for the thoughts! I'd at least personally still like to have some discussion of the portable types before we stabilize anything to see how they fit in the bigger picture and all, but for me at least all the vendor intrinsics are currently in the state that they "should" be in (vendor types, defined by Intel, automatically verified, etc).

I think I may be less certain than @gnzlbg about an RFC here (but it probably can't hurt!). Would y'all perhaps be up for discussion on an issue solely focused around "What would a first pass of stabilization look like for portable types?" The motivation you oulined @hsivonen with running a risk of commonly being "accidentally less portable" is pretty convincing to me in terms of considering at least a minimal-ish subset of APIs for stabilization. (that being said, I'm also still not certain how we'd want to schedule that with stabilizing the vendor intrinsics, but there's time yet still!).

Although this may also be left to in-person discussion amongst stakeholders rather than an issue. I hope to do that when we're a little closer with the technical hurdles taken care of. So maybe not an issue just yet but a discussion soon!


@scottlamb

If the vendor intrinsics are stabilized (using the vendor-specific types), can't the simd crate be rebuilt on top of those with its current abstractions? why would it still require nightly?

The simd crate uses different underlying infrastructure in LLVM to auto-magically get the best implementation based on the compilation target. While it's possible that we could replicate such logic in a user-defined crate I believe the point has been brought up historically that it's a huge amount of work (and not necessarily possible). The real crucial feature of simd is that it's built with compiler-aware knowledge of what's going on so optimizations can happen under the hood and whatnot.

@alexcrichton
Copy link
Member Author

I've submitted to rust-lang/rust#47743 to solve the shim/ABI problem we've been experiencing. We're... edging ever closer to removing blockers for stablization!

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 28, 2018

So we could have a subset (vendor-specific methods removed) of the simd crate on stable if I just wrote it up in RFC form?

@hsivonen As I see it, the path that has the most chances of success is:

  • stabilize intrinsics following the vendors spec type-wise (we are close to being here for x86 at least).
  • stabilize portable vector types in parallel, maybe after we stabilize the x86 intrinsics while working on the ARM, PowerPC, MIPS, ... intrinsics (because these vendor intrinsics are strongly-typed).

Once we have these two things stable, re-implementing the mid-level simd crate and higher-level libraries like faster should be possible in stable Rust (or we should at least be very close to that).

In particular, the author of faster has expressed interest in helping pushing for a strongly-typed SIMD layer around the weakly-typed vendor intrinsics, that higher-level SIMD libraries like faster can build upon. This mid-level layer would probably look a lot like the simd crate and/or the stdsimd crate version 0.0.4 and maybe such a layer makes it into std someday through the RFC process.

If someone is interested in working on this, I think that writing a single monolithic RFC for it right now is not the best idea because that RFC would be huge and mix way to many controversial and "in progress" things so progress and discussion will be very slow.

OTOH writing an RFC / crate proposing an API for portable vector types and their API (e.g. boolean types, reductions, conversions, etc.) is something that I think we want to stabilize soonish anyways, can be done right now in parallel to all other efforts, and something that any future SIMD mid-layer can build upon. The stdsimd crate already implements some of this, and so does the simd crate.

@hsivonen
Copy link
Member

Once we have these two things stable, re-implementing the mid-level simd crate

Why re-implementing? Would it not be OK to put the existing simd crate code--without the x86, arm and aarch64 modules exposed publicly--under core without re-implementing it (i.e. letting it use platform_intrinsics instead the vendor intrinsics for stuff like simd_add)?

@gnzlbg
Copy link
Contributor

gnzlbg commented Jan 28, 2018

Why re-implementing? Would it not be OK to put the existing simd crate code--without the x86, arm and aarch64 modules exposed publicly--under core without re-implementing it (i.e. letting it use platform_intrinsics instead the vendor intrinsics for stuff like simd_add)?

You can try to write an RFC for putting the simd crate in core without providing an implementation of it that works in stable Rust. However, under the assumption that by that point such a stable implementation of the simd crate will be possible, I'd expect reasonable people to want to see such an implementation during the RFC process. At that point, why write an RFC at all? It's a lot of work (more work than just making the simd crate work on stable) and it does not really add any value.

@alexcrichton
Copy link
Member Author

Ok! I'm going to close this now in favor of rust-lang/rfcs#2325, but thanks for the discussion everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests