You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's unfortunately not as clear a win as it might seem on the surface. From the docs:
Using mul_add may be more performant than an unfused multiply-add if the target architecture has a dedicated fma CPU instruction. However, this is not always true, and will be heavily dependant on designing algorithms with specific target hardware in mind.
FMA is not enabled by default (on x86_64 at least), it needs to be enabled by users (e.g. with -C target-feature=+fma). If FMA is not enabled using mul_add will be considerably slower - you can see the different asm generated here https://rust.godbolt.org/z/1WEd9hqTP. If FMA is not available there will be jump to an IEEE-754 conformant implementation that probably looks something like this https://github.com/rust-lang/libm/blob/master/src/math/fmaf.rs.
Possibly glam could add an internal conditional mul_add which uses FMA if it is enabled but falls back to an unfused (a * b) + c if FMA is not enabled. However this would mean glam would give different results depending if it was compiled with or without FMA. This would be a problem for users that are interested in determinism (for example if some supported architectures had FMA and some didn't), so it would be necessary to add a feature to disable glam's conditional FMA in the name of determinism. It's possible, but it gets a little complicated.
Is there a reason why
mul_add
is not used more often in glam? An example isglam-rs/src/f64/dquat.rs
Lines 659 to 662 in 1ea8163
Using
mul_add
would improve the accuracy and performance (performance on most platforms, see docs)Are you open to PRs that utilize
mul_add
?The text was updated successfully, but these errors were encountered: