turn back on allowing immediately assigned to decls not be NaN initialized #1029

SteveBronder · 2021-11-09T19:52:43Z

Summary

This turns back on allowing the allow_uninitialized_decls as a default optimization. The previous version was trying to be too clever with figuring out if a decl initialization was done inside of if/while/block statements.

This also adds the integration stan program unenforce-initialize.stan to the compiler optimizations test set which shows the generated C++ from some odd examples of inits we need to catch and set to NA such as using a variable on the lhs and rhs during an decl and assignment like real x = x;

Submission Checklist

Run unit tests
Documentation
- If a user-facing facing change was made, the documentation PR is here:
- OR, no user-facing changes were made

Release notes

Turn back on allowing immediately assigned to decls not be NaN initialized

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

WardBrian · 2021-11-09T22:14:07Z

This might be a bit of a big ask, but I think we should try to check that this optimization is actually doing anything meaningful at the assembly code level after the C++ is compiled. I’m just spitballing, but my first instinct is that gcc probably does this kind of thing already and it would be good to look into it

SteveBronder · 2021-11-09T22:59:25Z

So the result is kind of interesting! When I read your comment I was like "Oh that's probably correct", but both compilers seem to miss that the NA mem is tossed and it doesn't need allocated actually. Below is the godbolt of comparing (at least a toy example) of how we currently do this vs. the new version.

https://godbolt.org/z/3GW1ET461

On the LHS is our current and in the middle pane is this PR with a diff in the ASM on the RHS. Starting at main: and going through the paths of each I'm seeing malloc called twice on the current version (in main and on branch .L31 which is called from .L7 -> .L5 -> Eigen::PlainObjectBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::resize(long, long). The current code also has to do two traversals via Eigen::internal::call_dense_assignment_loop(...). You can click the compiler dropdown for each version in the bottom left and see that for clang they also miss on this in pretty much the same way.

WardBrian · 2021-11-09T23:36:42Z

Interesting! It definitely makes a difference for Eigen types. For scalars it seems like the asm is identical, and for std::vector it is a negligible difference, but it seems like the Eigen differences will make this definitely worthwhile. I wonder if this is something the Eigen folks themselves have looked at, because it seems like something the compiler definitely tries to do for other types.

We should really revisit #549 soon. I'm not sure if the --O convention makes sense for us of if we should do it case by case, something like --O=+uninit_decls,-partial_eval,+lazy_motion. At least at first that would let us expose these in a really modular way without needing to decide what is the default or what belongs on each level. Not sure how it would work with the command line parser, but we've had an option issue to refactor that since forever (I think it's waiting on #1019 / a Core_kernel update)

WardBrian · 2021-11-09T23:53:31Z

Oh also, do you know if this kind of analysis has been applied to any of the other optimizations? I know stuff like lazy code motion exists in gcc/clang, though obviously some of our partial eval wishes (like automatically creating log1m calls) is Stan-specific

SteveBronder · 2021-11-10T06:08:51Z

Interesting! It definitely makes a difference for Eigen types. For scalars it seems like the asm is identical, and for std::vector it is a negligible difference, but it seems like the Eigen differences will make this definitely worthwhile. I wonder if this is something the Eigen folks themselves have looked at, because it seems like something the compiler definitely tries to do for other types.

Oh neat! Can you share the godbolt link for that?

We should really revisit #549 soon. I'm not sure if the --O convention makes sense for us of if we should do it case by case, something like --O=+uninit_decls,-partial_eval,+lazy_motion.

Yeah I'd really like that! Tbh I wouldn't hate #955 being at O1 for a release cycle. I'd looked through the optimizations in that PR and I think all those should pretty much be on by default in that they are always good to do and gcc will not be able to figure those things out. I think #549 was really just waiting for a thumbs up from @rybern or at least just to figure out next steps.

Oh also, do you know if this kind of analysis has been applied to any of the other optimizations? I know stuff like lazy code motion exists in gcc/clang, though obviously some of our partial eval wishes (like automatically creating log1m calls) is Stan-specific

tmk no that never happened. When I read the C++ for some of these I'm like hmmm, maybe? The lazy code motion one always looks a bit odd to me. But yeah it would be nice to have things like partial evaluation safely on as well.

WardBrian · 2021-11-10T14:20:45Z

Scalars: https://godbolt.org/z/WefEajnox
ASM is identical is the compiler detects that x is never used with the original value

Vector:
https://godbolt.org/z/r57xncqGd

ASM is different but only does one allocation in both as far as I can tell

WardBrian · 2021-11-10T14:23:27Z

I'm a strong believer in optimization only after profiling (or, in a weaker sense, at least looking at the assembly with a rough 'less assembly is probably faster than more assembly' metric). Especially with how clever modern compilers can be, it's not difficult to end up in a situation where the code you've written feels like it should be better, but the compiler isn't able to optimize it as much so it's actually worse.

I think this and #955 are both good examples of things we know we can do better than gcc on, because of what we know about how we're using the matrices and the results of compiling our existing code

SteveBronder · 2021-11-10T15:36:50Z

I'm a strong believer in optimization only after profiling (or, in a weaker sense, at least looking at the assembly with a rough 'less assembly is probably faster than more assembly' metric).

I agree with the former but the latter I would be careful of "less assembly == better"! For instance, you can have one instance which has a generic asm loop doing multiplies, and another instance which branches off of the packet size to call a loop of x86, SSE, or AVX multiply instructions. The second one will be longer but much faster since it's using vectorized instructions. Another example is when the compiler will replace an expensive mul instruction with a few add and bitshift instructions. Mul is very expensive while a few adds and a bitshift is very cheap

WardBrian · 2021-11-10T15:46:28Z

Yes exactly, you can’t just count lines. But, if you see fewer mul, that’s probably better in a side by side.

At any rate, I don’t think we’ve currently done much of either

SteveBronder · 2021-11-10T17:17:37Z

Oh also slightly modifying the std::vector<> version to use a std::vector<double> for assignment to v instead of an initializer list {} yields the same as Eigen where it calls new twice for the NA filling version but only once for the non-NA filling version

https://godbolt.org/z/h3WGb6ovb

The scalar version won't double allocate because it's a static size of 1

SteveBronder · 2021-11-19T21:04:38Z

@serban-nicusor-toptal do you know where the little button went on jenkins/Blue Ocean that let you restart the tests?

rok-cesnovar · 2021-11-19T21:05:57Z

I am still seeing it:

SteveBronder · 2021-11-19T21:09:38Z

Not there for me :(

rok-cesnovar · 2021-11-19T21:11:31Z

That is weird. Restarted them, Nic will know more on why you are not seeing it.

serban-nicusor-toptal · 2021-11-19T21:52:29Z

We've recently made some changes to the ACL of Jenkins, I've fixed your permissions and you should see it now.

SteveBronder · 2021-11-19T23:30:56Z

Thank you!

SteveBronder · 2021-12-10T18:37:23Z

@WardBrian would you mind taking a look at this?

WardBrian · 2021-12-10T19:27:43Z

I really don't know if I'm familiar enough with the optimization code to review - I would be relying pretty heavily on the test output rather than reviewing the actual changes

…-decls

SteveBronder · 2021-12-21T15:27:24Z

imo I think that's fine if you don't mind. The optim here really doesn't use any of the optimization library stuff. We just traverse the MIR looking for a Decl followed by an Assignment. If that assignment is to the full object in the Decl we know we do not need to fill the matrix with NA values since the next line is filling the object and so we just change the Decl's tag telling the rest of the compiler that it does not need to be filled with NA values

WardBrian · 2021-12-21T18:15:53Z

Might again be something more for #549, but lets say hypothetically this introduces a bug - can the user disable it easily? If it's enabled by default/for O0

SteveBronder · 2021-12-21T18:47:20Z

Oh your right, actually I do like the idea of first doing #549 and then doing this PR

WardBrian · 2021-12-21T18:50:10Z

I think the primary thing that PR needs is doc? Unless we want to mirror what I did in #1058 and allow individual configuration on top of the O levels

SteveBronder · 2021-12-21T22:19:19Z

Umm, I think we just want the O levels. I can make a PR in the docs Repo with Ryans docs tmrw and fix up anything #549 needs to get merged in

SteveBronder · 2022-01-13T15:54:41Z

@WardBrian would you mind giving this a look? Would be a v nice optim to have at O1!

test/integration/good/compiler-optimizations/cpp.expected

src/stan_math_backend/Transform_Mir.ml

test/integration/good/compiler-optimizations/cpp.expected

WardBrian · 2022-01-13T20:19:15Z

test/integration/good/compiler-optimizations/cpp.expected

+      Eigen::Matrix<double, -1, -1> X_tp2 =
+         Eigen::Matrix<double, -1, -1>::Constant(10, 10,
+           std::numeric_limits<double>::quiet_NaN());
      stan::model::assign(lcm_sym9__, stan::math::exp(X_d),
        "assigning variable lcm_sym9__");
      stan::model::assign(X_tp2, lcm_sym9__, "assigning variable X_tp2");
-      Eigen::Matrix<double, -1, -1> X_tp3;
+      Eigen::Matrix<double, -1, -1> X_tp3 =


These also seem like they're wrong re-introductions of the initializer. Do we have no way around this temporary here? The original model doesn't feature control flow here

I thought so at first as well but if you look at the next lines

stan::model::assign(lcm_sym10__, stan::math::exp(lcm_sym9__), "assigning variable lcm_sym10__"); stan::model::assign(X_tp3, lcm_sym10__, "assigning variable X_tp3");

The lazy code motion assigns lcm_sym10__ right below this and then assigns X_tp3. That fails the check so NA's get filled here. We could run this before lazy code motion but I'm not sure if things can go from needing to be initalized to not initialized and vice versa during the different propogations and lcm.

So previously it was running earlier?

No before it had different logic. The previous scheme was, once we hit a decl, to iterate though the remaining statements and as long as we see a full assignment before it was used then we would allow it to be not initialized. But the logic is simpler now that we just look at the statement right below the decl. If it is not an assignment then we force it to fill with NA values. It's lazier but it's much safer. imo I like this simpler approach because it hits the bigger misses right now. For instance it stops the params from being filled with NA values (since they are always filled through the deserializer)

SteveBronder added 2 commits November 9, 2021 14:30

turn back on allow uninitialized decls

3d47acf

fix block and SList

6ab621f

WardBrian added the optimization label Nov 9, 2021

SteveBronder mentioned this pull request Nov 11, 2021

[WIP] Optimization level interface #549

Merged

update to master

2b6d7a1

WardBrian mentioned this pull request Nov 23, 2021

Try to merge master in a PR before we verify changes. stan-dev/jenkins-shared-libraries#2

Merged

update to master

3222e82

Merge remote-tracking branch 'upstream/master' into fix/turnon-uninit…

6bca07c

…-decls

merge in O1 PR

60e9009

WardBrian mentioned this pull request Jan 4, 2022

Add optims to detect when SoA matrices can be used #955

Merged

3 tasks

update to master

f89b193

WardBrian reviewed Jan 13, 2022

View reviewed changes

test/integration/good/compiler-optimizations/cpp.expected Show resolved Hide resolved

WardBrian reviewed Jan 13, 2022

View reviewed changes

src/stan_math_backend/Transform_Mir.ml Outdated Show resolved Hide resolved

WardBrian reviewed Jan 13, 2022

View reviewed changes

test/integration/good/compiler-optimizations/cpp.expected Show resolved Hide resolved

make _flat__ tmp variables not initialized

2972fc9

WardBrian reviewed Jan 13, 2022

View reviewed changes

WardBrian approved these changes Jan 14, 2022

View reviewed changes

SteveBronder merged commit 4f899ea into stan-dev:master Jan 14, 2022

WardBrian mentioned this pull request Mar 20, 2023

[BUG] Compiler optimizations can prevent bounds checks #1295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turn back on allowing immediately assigned to decls not be NaN initialized #1029

turn back on allowing immediately assigned to decls not be NaN initialized #1029

SteveBronder commented Nov 9, 2021 •

edited by WardBrian

Loading

WardBrian commented Nov 9, 2021

SteveBronder commented Nov 9, 2021

WardBrian commented Nov 9, 2021 •

edited

Loading

WardBrian commented Nov 9, 2021

SteveBronder commented Nov 10, 2021

WardBrian commented Nov 10, 2021

WardBrian commented Nov 10, 2021

SteveBronder commented Nov 10, 2021

WardBrian commented Nov 10, 2021

SteveBronder commented Nov 10, 2021

SteveBronder commented Nov 19, 2021

rok-cesnovar commented Nov 19, 2021

SteveBronder commented Nov 19, 2021

rok-cesnovar commented Nov 19, 2021

serban-nicusor-toptal commented Nov 19, 2021

SteveBronder commented Nov 19, 2021

SteveBronder commented Dec 10, 2021

WardBrian commented Dec 10, 2021

SteveBronder commented Dec 21, 2021

WardBrian commented Dec 21, 2021

SteveBronder commented Dec 21, 2021

WardBrian commented Dec 21, 2021

SteveBronder commented Dec 21, 2021

SteveBronder commented Jan 13, 2022

WardBrian Jan 13, 2022

SteveBronder Jan 13, 2022

WardBrian Jan 13, 2022

SteveBronder Jan 13, 2022 •

edited

Loading

turn back on allowing immediately assigned to decls not be NaN initialized #1029

turn back on allowing immediately assigned to decls not be NaN initialized #1029

Conversation

SteveBronder commented Nov 9, 2021 • edited by WardBrian Loading

Summary

Submission Checklist

Release notes

Copyright and Licensing

WardBrian commented Nov 9, 2021

SteveBronder commented Nov 9, 2021

WardBrian commented Nov 9, 2021 • edited Loading

WardBrian commented Nov 9, 2021

SteveBronder commented Nov 10, 2021

WardBrian commented Nov 10, 2021

WardBrian commented Nov 10, 2021

SteveBronder commented Nov 10, 2021

WardBrian commented Nov 10, 2021

SteveBronder commented Nov 10, 2021

SteveBronder commented Nov 19, 2021

rok-cesnovar commented Nov 19, 2021

SteveBronder commented Nov 19, 2021

rok-cesnovar commented Nov 19, 2021

serban-nicusor-toptal commented Nov 19, 2021

SteveBronder commented Nov 19, 2021

SteveBronder commented Dec 10, 2021

WardBrian commented Dec 10, 2021

SteveBronder commented Dec 21, 2021

WardBrian commented Dec 21, 2021

SteveBronder commented Dec 21, 2021

WardBrian commented Dec 21, 2021

SteveBronder commented Dec 21, 2021

SteveBronder commented Jan 13, 2022

WardBrian Jan 13, 2022

Choose a reason for hiding this comment

SteveBronder Jan 13, 2022

Choose a reason for hiding this comment

WardBrian Jan 13, 2022

Choose a reason for hiding this comment

SteveBronder Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

SteveBronder commented Nov 9, 2021 •

edited by WardBrian

Loading

WardBrian commented Nov 9, 2021 •

edited

Loading

SteveBronder Jan 13, 2022 •

edited

Loading