Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/issue 1062 ode speedup #1066

Merged
merged 36 commits into from
May 10, 2019
Merged

Feature/issue 1062 ode speedup #1066

merged 36 commits into from
May 10, 2019

Conversation

wds15
Copy link
Contributor

@wds15 wds15 commented Nov 17, 2018

Ok, I found a way which I think is totally fine with our AD logic. We are still getting a nice 18% or so speedup on the sir example.

Summary

The intent of the PR is to speedup the ODE integration. The approach is to avoid repeated allocation of the parameters vector theta_ on the nested AD stack. The nested AD is carried out when calling the coupled_ode_system functor. Since the parameters will never change during ODE integration we do not need to have these on the nested AD stack. The speedup of the SIR performance benchmark when running in 5 repetitions is 19%:

develop: stat_comp_benchmarks/benchmarks/sir/sir.stan,98.8624837399
this PR: stat_comp_benchmarks/benchmarks/sir/sir.stan,80.3885923386

[10:11:37][sebi@sebastians-macbook-pro-1:~/work/performance-tests-cmdstan]$ ./comparePerformance.py performance.csv develop_performance.csv
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 0.81)

Tests

test/unit/math/rev/mat/functor/integrate_ode_adams_prim_test.cpp
test/unit/math/rev/mat/functor/integrate_ode_adams_rev_test.cpp
test/unit/math/rev/mat/functor/integrate_ode_bdf_rev_test.cpp
test/unit/math/rev/mat/functor/integrate_ode_cvodes_grad_rev_test.cpp
test/unit/math/rev/mat/functor/integrate_ode_bdf_prim_test.cpp
test/unit/math/rev/arr/functor/integrate_ode_rk45_grad_test.cpp
test/unit/math/rev/arr/functor/integrate_ode_rk45_tooMuchWork_test.cpp
test/unit/math/rev/arr/functor/integrate_ode_rk45_test.cpp
test/unit/math/prim/arr/functor/integrate_ode_rk45_test.cpp
test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp
test/unit/math/prim/arr/functor/coupled_ode_system_test.cpp

Side Effects

Faster ODE integration.

Checklist

  • Math issue make coupled_ode_system more efficient #1062

  • Copyright holder: Sebastian Weber

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested (no behavior as tested so far is changed)

@wds15 wds15 changed the title Feature/issue 1062 ode speedup WIP Feature/issue 1062 ode speedup Jan 7, 2019
@wds15 wds15 changed the title WIP Feature/issue 1062 ode speedup Feature/issue 1062 ode speedup Jan 27, 2019
@wds15
Copy link
Contributor Author

wds15 commented Jan 27, 2019

Bump. Once tests finish this can be reviewed and merged I think.

@@ -72,7 +72,7 @@ struct coupled_ode_system<F, double, var> {
: f_(f),
y0_dbl_(y0),
theta_(theta),
theta_dbl_(value_of(theta)),
theta_copy_(theta),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @wds15. This is not the right copy; we need to create a copy that is disconnected from theta and is on our no_chain stack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test that there's nothing added to the stack on construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am using the nonstack thing now (really ugly to get things there!). However, I am not sure how to test and why to test that nothing gets on construction onto the stack. Ok, I could use ChainableStack::instance().var_stack.size()... but that is a not an exported API and is used nowhere in any test which I could find such that I am not sure that this can be tested given our API. If you think it is important to test, can you suggest a scheme to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind... I figured that we can test the stack size changes if we nest things.

... this is one of the annoyances of our nesting system - things are slightly inconsistent wrt to the non-nested stack; at least to me.

So the coupled_ode_system is now tested to not put stuff on the AD stack upon construction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're in the guts of the autodiff stack and you're playing with things that live globally and within the nested stack. You should be testing that what you're doing is consistent with what you think should be happening. Don't just assume the code you've written works... that will get us into a lot of trouble.

Also, since we're now dealing with threading and mpi and gpu and all sorts of other complications that could be unsafe, it makes a lot of sense to make sure these unit tests actually trigger failures if the basic assumptions of validity of autodiff stack through nested operations falls apart.

Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I think it's the doc that needs to be updated. But I think the changes are pretty good.

@@ -45,7 +45,7 @@ struct coupled_ode_system<F, double, var> {
const F& f_;
const std::vector<double>& y0_dbl_;
const std::vector<var>& theta_;
const std::vector<double> theta_dbl_;
std::vector<var> theta_copy_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please rename this to something descriptive? This is meant to be a copy of the theta, but placed on the global stack, but essentially treated as temporaries where we can reset the adjoints. I don't know what a good name is, but just calling it theta_copy_ really isn't that great.

The old name was ok because it implies we've lost the connections between theta_ and theta_dbl_ by demoting it to a double.

@@ -45,7 +45,7 @@ struct coupled_ode_system<F, double, var> {
const F& f_;
const std::vector<double>& y0_dbl_;
const std::vector<var>& theta_;
const std::vector<double> theta_dbl_;
std::vector<var> theta_copy_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For the top-level comment... I can't comment through the PR interface on lines of code that aren't really affected.)

Can you document what this implementation does? I think it'd help anyone else trying to maintain the code.

  • that the theta is copied and put on the stack, but are not meant to be autodiffed
  • that there's use of nested autodiff to compute the sensitivities

msgs_(msgs) {}
msgs_(msgs) {
for (auto& p : theta)
theta_copy_.emplace_back(var(new vari(value_of(p), false)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know p is a stan::math::var, why not use v.val() directly? I think it'd be a little clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the value_of is actually clearer as I don't need to recal the details of the var class. Calling value_of always gives me a double - no matter what. The val method feels like an internal function whereas the value_of is a clear API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to explain why I find this so much more confusing than if this was:

for (var&& p : theta) 
  theta_copy_.emplace_back(var(new vari(p.val(), false));

And this is coming from me having to read this code to figure it out.

  1. value_of() is a function that's designed to work on double, var, fvar<double>, fvar<var>, and fvar<fvar<var>>. So I'm looking at value_of(p) and trying to figure out... "what is p?"
  2. auto& p instead of var&& p doesn't help me at all. I know that p is one of theta, so theta is a collection, but what is it? Since I'm flipping between base template classes that can handle all autodiff types, it's not always easy to keep track that this is a rev only implementation. It'd really just help to have this be var&& p so the type is obvious.
  3. theta... now I need to go and find where this is defined. C++ is a structured language, but it actually allows the declaration of struct and class member variables either at the top, below, or anywhere in between. And now I've got to search. I was looking at this using the GitHub interface, so the actual declaration was off of the interface; in order for me to go look at this, I had to go and check out the branch, open the file, all to find that theta is declared as std::vector<var>.

I think the thing that we're not seeing eye-to-eye on is the use of value_of. I find it much easier to reason about something that's only valid for var instead of having to work to figure out whether the object passed in is one of many different types. (It's just a little more mental work the reader has to do when it can all be made explicit as an alternative.)

(By the way, your statement is incorrect -- value_of does not always give a double. It'll give the type of the value, which for fvar<var> will be var.)

Hopefully you've gotten some context on why this is harder to read for a reviewer or anyone else coming to the code later on. This is minor, but the idea still stands -- make it easier for the next person to read. It takes effort now, but it's so worth it later on.

Copy link
Contributor Author

@wds15 wds15 Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... value_of is defined differently in my head... I always forget about higher order things. With the correct definitions the val call is better, yeah. Sorry, but I am not used to meaning of value_of.

Honestly, I do expect the reviewer to look at the file and not just the diff from git. It is really hard to expect our changes to be self-explanatory on a single diff line. The auto vs actual type ... I read the opposite elsewhere. Your third point I do not quite understand. I am looping over theta which is the argument to the constructor.

EDIT: The compiler only accepts

for (const var& p : theta)

So I went with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The third point wasn't actually there... you're right. I should have looked at the constructor signature, not the actual type of the variable.

stan/math/rev/core/var.hpp Show resolved Hide resolved
@charlesm93 charlesm93 self-requested a review February 20, 2019 22:39
@charlesm93
Copy link
Contributor

I can hop in and complete the review. I'll do a review of the code and make sure all of @syclik 's requests have been addressed.

@charlesm93
Copy link
Contributor

I discussed with @wds15. This is not the right copy; we need to create a copy that is disconnected from theta and is on our no_chain stack.

I get the basic idea based on the PR description but I'm not sure what the glitch was here. This makes me worry I'm missing important subtleties. @wds15 maybe we can discuss over a quick video call.

Copy link
Contributor

@charlesm93 charlesm93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the queries from the previous review have been addressed, with a minor exception, see line 133 of coupled_ode_system.hpp. The unit tests look fine, in particular they check that the size of the stack does not change. See my additional comments below.

@wds15
Copy link
Contributor Author

wds15 commented Feb 21, 2019

@charlesm93 The idea is to have the parameters on a stack which is disconnected to overall AD tree. Such a place is the nonstack AD tape which I never knew before talking to @syclik about it (this explains the comment you quote).

I added some comments as you requested in the code. Maybe we shortly discuss this at the meeting later in case you have more questions?

@charlesm93
Copy link
Contributor

charlesm93 commented Feb 22, 2019

Ok, this all looks good to me. The only minor point missing is an explanatory comment on line 133 of coupled_ode_system.hpp. Once this is done, we should be good to go.

Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesm93, thanks for following up on this!

It just dawned on me that we could have nest a couple times. I believe this PR will add a copy of the variables on the stack when we could easily just remove them when we leave the function. Your call on whether that makes sense. Thanks again. (and thanks, @wds15, for putting this together!)

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

Ok. So here is an approach which can work and has the best of both approaches. The idea is:

  1. in the constructor of coupled_ode_system we call start_nested and then put the theta onto the nested no chain AD stack.
  2. you would expect to put the recover_memory_nested into the destructor of coupled_ode_system, but that will not work as then the decoupling operation places things on the nested stack which gets cleared away.

Right now I address the 2. thing by doing the recover_memory_nested in the decoupled states operation.... which should not stay like this. So if this approach has merits to you, then we should turn the decouple_states operation into a static function of coupled_ode_system and call it after getting rid of the coupled_ode_system instance. This could work and give us

  • full independence of the outer AD tree
  • nothing left behind

If that makes sense to you, I would refactor accordingly.

(I still think that the approach to leave things behind on the no chain outer AD stack is cleanest and the best compromise... but you feel obvious strongly against about that while I am strongly against modifying the outer AD tree given we head into a parallel world)

@syclik
Copy link
Member

syclik commented Apr 1, 2019

(I still think that the approach to leave things behind on the no chain outer AD stack is cleanest and the best compromise... but you feel obvious strongly against about that while I am strongly against modifying the outer AD tree given we head into a parallel world)

I'm not strongly opposed to the approach of leaving things behind on the AD stack. It is a design decision and I think that's probably the cleanest and we should go with that. I think that should be the safest thing to do given the circumstances.

In general, we shouldn't just leave things behind. If we do, we should clearly state what's going on and why we've done it this way. That's what I feel strongly about, not the fact that we do it. By questioning whether we can remove it, we've actually come a long way in the design discussion that we wouldn't have.

Are you ok with going back to putting things on the global stack on construction? We can't remove them on destruction. That means the constructor and the functor are the functions that need to be locked, but the AD stack can be modified outside of those times without a problem.

@wds15
Copy link
Contributor Author

wds15 commented Apr 1, 2019

Great. I much prefer to go back to the design which uses the no chain approach (but leave things behind).

I will revert the code and add some doc around that as to what is safe / what not / what the side-effect is.

At least in my head having the guarantee that starting a nested AD tree gives me an independent sub-tree is a very good one (I think @bob-carpenter said this once if I am not wrong). I would expect that this holds right now in Stan-math - and we should keep it like this.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Apr 1, 2019 via email

@wds15
Copy link
Contributor Author

wds15 commented May 5, 2019

I reverted to the make-nochain-copy approach described above which we agreed on. I feel much safer with this!

The only side-effect of this approach is that instances of the parameter vector theta will be left on the nochain AD stack (even after the coupled_ode_system instance is gone). Other than that I don't see how anything can break this approach in a parallel world where we are heading.

I added a note to the respective specialization which do this trick which explains what is done and what side-effects are. So this is ready for final review.

Tagging @syclik and @charlesm93 who are well familiar with the context here.

@wds15
Copy link
Contributor Author

wds15 commented May 8, 2019

Here we seem to have the same as issue as in #1072 such that I suggest the same here:

Either we can commit to a timeline, look for other reviewers or we drops this PR.

Tagging again @seantalts and @bob-carpenter for suggestions on this one.

syclik
syclik previously approved these changes May 8, 2019
Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have a couple of questions. It'd be great to address them just to have some answers, but the PR is great!

@@ -44,6 +44,13 @@ namespace math {
* method</a> as implemented in Boost's <code>
* boost::numeric::odeint::runge_kutta_dopri5</code> integrator.
*
* During ODE integration the global autodiff tape is continuously
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "the adjoints of the parameter vector are used for Jacobian calculations"? And how does that interact with concurrency?

test/unit/math/rev/arr/functor/coupled_ode_system_test.cpp Outdated Show resolved Hide resolved
@wds15
Copy link
Contributor Author

wds15 commented May 9, 2019

@syclik : The confusing doc you found leading to your question

What is meant by "the adjoints of the parameter vector are used for Jacobian calculations"? And how does that interact with concurrency?

Is still a left-over from the old approach which used the outer AD tape. The approach which we implement now using the no chain stack is completely safe wrt to concurrency in the usual sense. I removed that comment there from the doc (which was also put into the cvodes_integrator file).

Once passes tests this should then be fine to merge.

Looks like there is once more the need for your approval. Sorry for that (I haven't actively dismissed your approval; probably that was automatic due to the change set).

Thanks!

(BTW, with the forthcoming independent AD in the parallel design docs we will be able to write this optimisation in a way such that nothing will be left behind, I think ... but that will still take some time until it lands, of course)

@syclik
Copy link
Member

syclik commented May 9, 2019

Thanks for removing that comment -- I was scratching my head at that for a little bit, but I'm glad it was cleaned up.

Yes, the way the approvals work now, it needs to be re-approved once a commit has been made. It's a pessimistic view of the world, but it's to prevent someone maliciously committing whatever they want after getting an approval and merging that. I think it's the right approach to take for an open-source project like ours.

Copy link
Member

@syclik syclik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wds15
Copy link
Contributor Author

wds15 commented May 9, 2019

It makes sense to re-approve if there are changes...no objection to that. I just wasn't aware of it and always feel guilty to use peoples time if not actually needed.

@wds15 wds15 merged commit 0784a82 into develop May 10, 2019
@wds15
Copy link
Contributor Author

wds15 commented May 10, 2019

Whow! 6 month PR being around / 4 month under review or so / finally merged. Cool.

I know this PR bends our usual conventions... but it is worth it and at least I learned a lot about our AD reverse mode things.

Thanks @syclik for bearing with me on this one!

@wds15 wds15 deleted the feature/issue-1062-ode-speedup branch June 30, 2019 17:57
@mattfidler
Copy link

@wds15 doesn't this imply that time-varying covariates (ie. thetas that change) won't work correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants