Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

turn back on allowing immediately assigned to decls not be NaN initialized #1029

Merged
merged 8 commits into from
Jan 14, 2022

Conversation

SteveBronder
Copy link
Contributor

@SteveBronder SteveBronder commented Nov 9, 2021

Summary

This turns back on allowing the allow_uninitialized_decls as a default optimization. The previous version was trying to be too clever with figuring out if a decl initialization was done inside of if/while/block statements.

This also adds the integration stan program unenforce-initialize.stan to the compiler optimizations test set which shows the generated C++ from some odd examples of inits we need to catch and set to NA such as using a variable on the lhs and rhs during an decl and assignment like real x = x;

Submission Checklist

  • Run unit tests
  • Documentation
    • If a user-facing facing change was made, the documentation PR is here:
    • OR, no user-facing changes were made

Release notes

Turn back on allowing immediately assigned to decls not be NaN initialized

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

@WardBrian
Copy link
Member

This might be a bit of a big ask, but I think we should try to check that this optimization is actually doing anything meaningful at the assembly code level after the C++ is compiled. I’m just spitballing, but my first instinct is that gcc probably does this kind of thing already and it would be good to look into it

@SteveBronder
Copy link
Contributor Author

So the result is kind of interesting! When I read your comment I was like "Oh that's probably correct", but both compilers seem to miss that the NA mem is tossed and it doesn't need allocated actually. Below is the godbolt of comparing (at least a toy example) of how we currently do this vs. the new version.

https://godbolt.org/z/3GW1ET461

On the LHS is our current and in the middle pane is this PR with a diff in the ASM on the RHS. Starting at main: and going through the paths of each I'm seeing malloc called twice on the current version (in main and on branch .L31 which is called from .L7 -> .L5 -> Eigen::PlainObjectBase<Eigen::Matrix<double, -1, -1, 0, -1, -1> >::resize(long, long). The current code also has to do two traversals via Eigen::internal::call_dense_assignment_loop(...). You can click the compiler dropdown for each version in the bottom left and see that for clang they also miss on this in pretty much the same way.

@WardBrian
Copy link
Member

WardBrian commented Nov 9, 2021

Interesting! It definitely makes a difference for Eigen types. For scalars it seems like the asm is identical, and for std::vector it is a negligible difference, but it seems like the Eigen differences will make this definitely worthwhile. I wonder if this is something the Eigen folks themselves have looked at, because it seems like something the compiler definitely tries to do for other types.

We should really revisit #549 soon. I'm not sure if the --O convention makes sense for us of if we should do it case by case, something like --O=+uninit_decls,-partial_eval,+lazy_motion. At least at first that would let us expose these in a really modular way without needing to decide what is the default or what belongs on each level. Not sure how it would work with the command line parser, but we've had an option issue to refactor that since forever (I think it's waiting on #1019 / a Core_kernel update)

@WardBrian
Copy link
Member

Oh also, do you know if this kind of analysis has been applied to any of the other optimizations? I know stuff like lazy code motion exists in gcc/clang, though obviously some of our partial eval wishes (like automatically creating log1m calls) is Stan-specific

@SteveBronder
Copy link
Contributor Author

Interesting! It definitely makes a difference for Eigen types. For scalars it seems like the asm is identical, and for std::vector it is a negligible difference, but it seems like the Eigen differences will make this definitely worthwhile. I wonder if this is something the Eigen folks themselves have looked at, because it seems like something the compiler definitely tries to do for other types.

Oh neat! Can you share the godbolt link for that?

We should really revisit #549 soon. I'm not sure if the --O convention makes sense for us of if we should do it case by case, something like --O=+uninit_decls,-partial_eval,+lazy_motion.

Yeah I'd really like that! Tbh I wouldn't hate #955 being at O1 for a release cycle. I'd looked through the optimizations in that PR and I think all those should pretty much be on by default in that they are always good to do and gcc will not be able to figure those things out. I think #549 was really just waiting for a thumbs up from @rybern or at least just to figure out next steps.

Oh also, do you know if this kind of analysis has been applied to any of the other optimizations? I know stuff like lazy code motion exists in gcc/clang, though obviously some of our partial eval wishes (like automatically creating log1m calls) is Stan-specific

tmk no that never happened. When I read the C++ for some of these I'm like hmmm, maybe? The lazy code motion one always looks a bit odd to me. But yeah it would be nice to have things like partial evaluation safely on as well.

@WardBrian
Copy link
Member

Scalars: https://godbolt.org/z/WefEajnox
ASM is identical is the compiler detects that x is never used with the original value

Vector:
https://godbolt.org/z/r57xncqGd

ASM is different but only does one allocation in both as far as I can tell

@WardBrian
Copy link
Member

I'm a strong believer in optimization only after profiling (or, in a weaker sense, at least looking at the assembly with a rough 'less assembly is probably faster than more assembly' metric). Especially with how clever modern compilers can be, it's not difficult to end up in a situation where the code you've written feels like it should be better, but the compiler isn't able to optimize it as much so it's actually worse.

I think this and #955 are both good examples of things we know we can do better than gcc on, because of what we know about how we're using the matrices and the results of compiling our existing code

@SteveBronder
Copy link
Contributor Author

I'm a strong believer in optimization only after profiling (or, in a weaker sense, at least looking at the assembly with a rough 'less assembly is probably faster than more assembly' metric). 

I agree with the former but the latter I would be careful of "less assembly == better"! For instance, you can have one instance which has a generic asm loop doing multiplies, and another instance which branches off of the packet size to call a loop of x86, SSE, or AVX multiply instructions. The second one will be longer but much faster since it's using vectorized instructions. Another example is when the compiler will replace an expensive mul instruction with a few add and bitshift instructions. Mul is very expensive while a few adds and a bitshift is very cheap

@WardBrian
Copy link
Member

Yes exactly, you can’t just count lines. But, if you see fewer mul, that’s probably better in a side by side.

At any rate, I don’t think we’ve currently done much of either

@SteveBronder
Copy link
Contributor Author

Oh also slightly modifying the std::vector<> version to use a std::vector<double> for assignment to v instead of an initializer list {} yields the same as Eigen where it calls new twice for the NA filling version but only once for the non-NA filling version

https://godbolt.org/z/h3WGb6ovb

The scalar version won't double allocate because it's a static size of 1

@SteveBronder
Copy link
Contributor Author

@serban-nicusor-toptal do you know where the little button went on jenkins/Blue Ocean that let you restart the tests?

@rok-cesnovar
Copy link
Member

I am still seeing it:

image

@SteveBronder
Copy link
Contributor Author

Not there for me :(

image

@rok-cesnovar
Copy link
Member

That is weird. Restarted them, Nic will know more on why you are not seeing it.

@serban-nicusor-toptal
Copy link
Contributor

We've recently made some changes to the ACL of Jenkins, I've fixed your permissions and you should see it now.

@SteveBronder
Copy link
Contributor Author

Thank you!

@SteveBronder
Copy link
Contributor Author

@WardBrian would you mind taking a look at this?

@WardBrian
Copy link
Member

I really don't know if I'm familiar enough with the optimization code to review - I would be relying pretty heavily on the test output rather than reviewing the actual changes

@SteveBronder
Copy link
Contributor Author

imo I think that's fine if you don't mind. The optim here really doesn't use any of the optimization library stuff. We just traverse the MIR looking for a Decl followed by an Assignment. If that assignment is to the full object in the Decl we know we do not need to fill the matrix with NA values since the next line is filling the object and so we just change the Decl's tag telling the rest of the compiler that it does not need to be filled with NA values

@WardBrian
Copy link
Member

Might again be something more for #549, but lets say hypothetically this introduces a bug - can the user disable it easily? If it's enabled by default/for O0

@SteveBronder
Copy link
Contributor Author

Oh your right, actually I do like the idea of first doing #549 and then doing this PR

@WardBrian
Copy link
Member

I think the primary thing that PR needs is doc? Unless we want to mirror what I did in #1058 and allow individual configuration on top of the O levels

@SteveBronder
Copy link
Contributor Author

Umm, I think we just want the O levels. I can make a PR in the docs Repo with Ryans docs tmrw and fix up anything #549 needs to get merged in

@SteveBronder
Copy link
Contributor Author

@WardBrian would you mind giving this a look? Would be a v nice optim to have at O1!

Comment on lines +944 to +950
Eigen::Matrix<double, -1, -1> X_tp2 =
Eigen::Matrix<double, -1, -1>::Constant(10, 10,
std::numeric_limits<double>::quiet_NaN());
stan::model::assign(lcm_sym9__, stan::math::exp(X_d),
"assigning variable lcm_sym9__");
stan::model::assign(X_tp2, lcm_sym9__, "assigning variable X_tp2");
Eigen::Matrix<double, -1, -1> X_tp3;
Eigen::Matrix<double, -1, -1> X_tp3 =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These also seem like they're wrong re-introductions of the initializer. Do we have no way around this temporary here? The original model doesn't feature control flow here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so at first as well but if you look at the next lines

      stan::model::assign(lcm_sym10__, stan::math::exp(lcm_sym9__),
        "assigning variable lcm_sym10__");
      stan::model::assign(X_tp3, lcm_sym10__, "assigning variable X_tp3");

The lazy code motion assigns lcm_sym10__ right below this and then assigns X_tp3. That fails the check so NA's get filled here. We could run this before lazy code motion but I'm not sure if things can go from needing to be initalized to not initialized and vice versa during the different propogations and lcm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previously it was running earlier?

Copy link
Contributor Author

@SteveBronder SteveBronder Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No before it had different logic. The previous scheme was, once we hit a decl, to iterate though the remaining statements and as long as we see a full assignment before it was used then we would allow it to be not initialized. But the logic is simpler now that we just look at the statement right below the decl. If it is not an assignment then we force it to fill with NA values. It's lazier but it's much safer. imo I like this simpler approach because it hits the bigger misses right now. For instance it stops the params from being filled with NA values (since they are always filled through the deserializer)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants