Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl special functions for SIMD #14

Open
27 of 42 tasks
programmerjake opened this issue Sep 30, 2020 · 23 comments
Open
27 of 42 tasks

Impl special functions for SIMD #14

programmerjake opened this issue Sep 30, 2020 · 23 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@programmerjake
Copy link
Member

programmerjake commented Sep 30, 2020

Need all of:

  • div_euclid/rem_euclid
  • clamp
  • max/min
  • rotate_left/rotate_right
  • swap_bytes/reverse_bits
  • saturating_add/saturating_sub
  • saturating_neg/saturating_abs
  • saturating_mul
  • wrapping_add/wrapping_sub/wrapping_mul/wrapping_pow
  • wrapping_div/wrapping_rem/wrapping_div_euclid/wrapping_rem_euclid
  • wrapping_neg/wrapping_abs
  • overflowing_add/overflowing_sub
  • overflowing_mul
  • overflowing_div/overflowing_div_euclid
  • overflowing_rem/overflowing_rem_euclid
  • overflowing_neg/overflowing_abs
  • overflowing_shl/overflowing_shr
  • from_be/from_le/to_be/to_le
  • to_be_bytes/to_le_bytes/from_be_bytes/from_le_bytes
  • {to,from}_ne_bytes

for integers:

  • leading_zeros/trailing_zeros
  • leading_ones/trailing_ones
  • count_ones/count_zeros
  • pow
  • overflowing_pow
  • saturating_pow
  • wrapping_shl/wrapping_shr

for floats:

  • trig./hyperbolic functions: impl trig for core::simd #6
  • recip
  • mul_add
  • powi/powf
  • to_int_unchecked
  • to_degrees/to_radians
  • sqrt
  • cbrt
  • hypot
  • exp/exp2/ln/log/log2/log10
  • exp_m1/ln_1p

for signed integers and floats:

  • abs
  • signum
  • copysign
  • is_positive/is_negative

See also #109

@Lokathor
Copy link
Contributor

Lokathor commented Sep 30, 2020

I don't believe we have non-overflowing/non-wrapping ops actually.

That is, we only have the wrapping version.

@programmerjake
Copy link
Member Author

Having ops that panic on overflow (like Rust's standard integer ops in debug mode) seems like something that would be useful for debugging, even if it has a runtime penalty. It could be disabled by Release mode, like usual.

@workingjubilee workingjubilee added A-floating-point Area: Floating point numbers and arithmetic C-feature-request Category: a feature request, i.e. not implemented / a PR labels Sep 30, 2020
@calebzulawski
Copy link
Member

I would like to add as_slice and as_array functions to this list.

I think we need to be careful with rotate_left and rotate_right--it's unfortunate that std uses the same name for rotating slice elements and rotating bits (both of these cases apply to SIMD vectors)

@programmerjake
Copy link
Member Author

how about naming them rotate_lanes_left/right and rotate_bits_left/right?

@Lokathor
Copy link
Contributor

Lokathor commented Oct 5, 2020

Update: removing floor/ceil/round/trunc/fract from the list, opened #23 instead.

@thomcc
Copy link
Member

thomcc commented Oct 6, 2020

for floats:

* [ ]  trig./hyperbolic functions: #6

* ...

* [ ]  cbrt

* ...

* [ ]  exp/exp2/ln/log/log2/log10

* [ ]  exp_m1/ln_1p

So... Is there a reason that these are considered required rather than nice to have? Are there architectures that offer this?

I'm not opposed to it (I was working on a SSE cbrt yesterday, so I agree these aren't useless, but there's also a lot of work, and users quite reasonably might want to make different performance/accuracy tradeoffs here. Also, god, properly supporting rounding modes in these is a whole damn can of worms — but hopefully we'll just continue with the good ol rust standby of pretending rounding mode can never change).

Anyway, if we're going for ieee754 recommended operations, there are some missing from the recommended set as of 754-2019. I've attached a screenshot of the relevant table.

Screenshot of IEEE754-2019 additional recommended operations Table 9.1—Additional mathematical operations Table 9.1—Additional mathematical operations (continued)

Note: rSqrt there is the accurately-rounded of inverse sqrt — specifically, it's not an equivalent to _mmN_rsqrt_ps (it is equivalent to the _mmN_invsqrt_ps that you can get in some places), which is approximated (but we should still expose an approximate rsqrt, since e.g. intel supports it and inverse sqrt is a very common operation in some areas).

@programmerjake
Copy link
Member Author

for floats:

* [ ]  trig./hyperbolic functions: #6

* ...

* [ ]  cbrt

* ...

* [ ]  exp/exp2/ln/log/log2/log10

* [ ]  exp_m1/ln_1p

So... Is there a reason that these are considered required rather than nice to have? Are there architectures that offer this?

I just went down the list of functions on f32.

Libre-SOC may potentially provide vector instructions for all of the functions you mentioned, we are almost certainly providing instructions for exp, exp2, ln, log2. IIRC AMDGPU provides some exponential and logarithm functions.

@thomcc
Copy link
Member

thomcc commented Oct 6, 2020

Hm, okay. Some concerns I'd have, mostly since you mentioned GPUs (which tend to answer these questions by picking whatever is fastest — and honestly somewhat fairly, a lot of these are super expensive to handle correctly in SIMD code):

  1. Are non-finite inputs handled properly? If not, how improper?

    • -ffast-math-style UB?
    • consistent-but-garbage results?
    • consistent-but-fixable results? (e.g. wrong sign when returning nan or whatever)
  2. Ditto, but for other out-of-domain inputs — like neg inputs to sqrt.

  3. Are denormals (other than zero) handled properly?

    • Here proper just means "correct result".
    • I'm only excluding zero because it's unfathomable that it would be broken on 0.0 (assuming 0.0 is part of the function's domain).
  4. Is the current rounding mode respected?

    • If applicable, are other relevant aspects of the fp env respected?
    • Note: This is probably not relevant on GPUs, but it is for us (I think? *).
  5. Does the function produce a precise (max error within 1ulp) result, or is it approximated?

And if not, what do we do?

Also relevant to our fallback: I don't think I've ever seen SIMD implementations of this stuff that actually is true for all these. The vectorclass code linked elsewhere appears not to handle all of this (but I didn't look too closely and perhaps it's doing it by structuring the code so it's handled automatically), and IIRC, sleef didn't used to but maybe it does now.

And to be clear, I'm not saying our fallback implementation has to handle these issues (although certainly we would in an ideal world), but if it doesn't that should be intentional.

Also, I guess the fallback could just be extracting each lane and calling libm on it (although this would either require rust libm, which is pretty slow, or force this stuff into libstd).

* Regarding 4, I vaguely remember hearing it was UB in rust to change the float env? Possibly because LLVM can't fully handle it, or constant propagation, or who knows. Perhaps we don't really need to handle this if that's the case. I also don't know if this is actually true.

@Lokathor
Copy link
Contributor

Lokathor commented Oct 6, 2020

yeah llvm ignores floating point environment currently during optimization, so if we do anything other than the same we get code that changes based on optimization level, which is classic UB.

they're developing alternative llvm ir that would let you follow the fp environment, but currently it's not ready (last i heard around the start of the year).

@thomcc
Copy link
Member

thomcc commented Oct 6, 2020

Personally, IME changing fpenv is a huge headache and you're better off structuring your code so that it's not needed, even if that means you have to do some computations negated or whatever.

This is the one of these I'm least willing to go to bat for as something we should support at all (in truth, I'd be happy for someone to tell me it's totally unsupported and code can assume default rounding mode). This certainly makes the impl of these functions simpler / easier to test).

That said IDK, the Rust libm seems to handle it... I assume we need to also. (And I mean, it might be a part of floating point I don't like, but it is a part of it)

... Also, I just realized I forgot to mention fp status registers and triggering the right fp exceptions, if relevant. Anyway, just assume that list of concerns is #[non_exhaustive]

@Lokathor
Copy link
Contributor

Lokathor commented Oct 6, 2020

Oh, libm is just wrong in that area. Most of our libm code is just blindly copied from C. The thing is that libm gets too little attention for anyone to care, so oh well.

@programmerjake
Copy link
Member Author

AMDGPU supports infinities, NaNs (though I don't know which values it produces), signed zeros, and different rounding modes. It has 1ULP accuracy for exp2, and log2. Other exp/log instructions are implemented in terms of those.

Libre-SOC will have at least 2 modes, one which is only as accurate as Vulkan requires (though if we can provide more than that without much more hardware, we probably will), and one which is supposed to provide the correctly-rounded results specified by IEEE 754 for all supported rounding modes. The second mode may just trap to a software implementation for some of the more complex instructions though, so could be very slow. We haven't decided yet.

@workingjubilee
Copy link
Member

I think it makes sense to right now go with saying that exposing special float ops on SIMD types should currently be a relatively strong statement of "you probably can't beat this speed/accuracy tradeoff" and then implementing the rest (and weighing different speed/accuracy tradeoffs) can be its own ongoing/extended discussion.

So if all the relevant vector processors reasonably consistently provide fast and accurate exp/log functions, then we want to expose those right away, and start to set aside other things we know will require more thought.

@workingjubilee
Copy link
Member

I was not able to find integral pow functions on Intel or Arm intrinsic lists, and so have struck them from the lists.
There are hardware accelerated floating point operations for this, of course.

@Lokathor
Copy link
Contributor

Lokathor commented May 1, 2021

I think we should have Pow on the extended list, wherever that is, even if it is always "library provided" and never actually hardware.

@workingjubilee
Copy link
Member

workingjubilee commented May 2, 2021

It would be useful to carve up things between what we can expect to have efficient/fast hardware acceleration for and those that are reasonable but software-only, yes, for the sake of prioritization.

workingjubilee added a commit that referenced this issue Jun 23, 2021
Add various fns
- Sum/Product traits
- recip/to_degrees/to_radians/min/max/clamp/signum/copysign; #14
- mul_add: #14, fixes #102
@TennyZhuang
Copy link

Why wrap_* were removed here? In my opinion, wrap_* should always do overflow check and return an Option<Simd<_>>, which is different from the behavior of primitive ops (not check on release, check and may panic on debug).

@workingjubilee
Copy link
Member

Simd<T, N> is implicitly Simd<Wrapping<T>, N>. What you describe is the behavior of the checked_* ops.

@ghost
Copy link

ghost commented Jul 14, 2022

(I didn't find a tracking issue for checked_*, which is where I would have commented; should you open one?)

It is quite reasonably expected that checked_* operations would be slower than the wrapping equivalents, but I'm not sure what implementations you all have in mind for most checked_* operations?

E.g. addition, checked_add(x, y) should only perform an estimated ~3-5 extra operations, given that the branch statement is if SIMD::saturating_add(x, y) == x + y?

@programmerjake
Copy link
Member Author

bitwise rotate left/right came up in #328 (comment) (actually most of that issue was discussing rotations rather than chacha20)

@avhz
Copy link

avhz commented Feb 25, 2024

Any updates on this issue? I would offer to help, but I suspect it's above my skill level.
But I'm particularly interested in using special functions for SIMD floats (exp, log, etc).

@calebzulawski
Copy link
Member

No updates--the place to start will be adding more intrinsics to the compiler and then use them in the StdFloat trait.

@avhz
Copy link

avhz commented Feb 26, 2024

Can't promise anything, but I'll take a look at what's currently done, and if it seems achievable I'll have a go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

No branches or pull requests

7 participants