Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid vari on chain-stack if var is constructed from an arithmetic type #1675

Merged
merged 11 commits into from
Feb 19, 2020

Conversation

wds15
Copy link
Contributor

@wds15 wds15 commented Feb 4, 2020

Summary

This PR augments the var constructor for primitive types. Whenever a var is constructed from a primitive we can put the respective vari on the nochain stack which avoids calls to the chain method when we propagate the chain rule.

Tests

Tests which check for the size of the AD tape have been adjusted to consider as AD tape size the size of the chain and the nochain stack taken together (as opposed to wrongly define the AD tape size being equal to the chain stack size only).

Side Effects

Should speedup programs by avoiding vars constructed from base types resulting in varis to land on the chain stack.

Checklist

  • Math issue vars created from primitives must not be put on chain stack #1694

  • Copyright holder: Sebastian Weber

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@wds15
Copy link
Contributor Author

wds15 commented Feb 4, 2020

@bob-carpenter is this totally off what I am trying here?

This would basically move all instances of vars which are created based on arithmetic base values (double, float, ...) to use vari implementation which are put on the nochain-stack. This should make the recent patches for the gradient functional obsolete (and do the optimization automatically everywhere).

@bob-carpenter
Copy link
Contributor

is this totally off what I am trying here?

Nope. I think it's exactly what we should do. I was about to open exactly the same feature request after finding another issue in the generated model code today.

@bob-carpenter
Copy link
Contributor

I created a somehwat related issue #1676 --- we'll need that even after this one as we only want one copy in the whole thing.

And can we get rid of the no-chain stack? Does it get used for anything?

@bbbales2
Copy link
Member

bbbales2 commented Feb 5, 2020

And can we get rid of the no-chain stack? Does it get used for anything?

Isn't that where these things get allocated?

This was the weird allocation stuff as I recall: https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/LDLT_alloc.hpp

@wds15
Copy link
Contributor Author

wds15 commented Feb 5, 2020

Yes, if we use the stacked = false vari constructor, then we land on the nochain stack. So we cannot remove this.

Some tests seem to make the assumption that the ad stack size is equal to the chain stack. I am going to change that to consider the ad stack size as the sum of the chain and nochain stack.

So far this looks good... and this may speedup out programs...let’s see.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.9 4.93 0.99 -0.68% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.96 -4.6% slower
eight_schools/eight_schools.stan 0.09 0.09 0.99 -1.27% slower
gp_regr/gp_regr.stan 0.22 0.22 1.02 1.64% faster
irt_2pl/irt_2pl.stan 6.07 6.07 1.0 -0.12% slower
performance.compilation 88.14 87.17 1.01 1.1% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.32 7.31 1.0 0.04% faster
pkpd/one_comp_mm_elim_abs.stan 20.95 20.55 1.02 1.91% faster
sir/sir.stan 90.52 89.65 1.01 0.96% faster
gp_regr/gen_gp_data.stan 0.04 0.05 0.99 -1.07% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 2.95 1.02 1.63% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.3 0.33 0.93 -7.59% slower
arK/arK.stan 1.75 1.73 1.01 1.3% faster
arma/arma.stan 0.8 0.66 1.22 18.15% faster
garch/garch.stan 0.63 0.59 1.07 6.49% faster
Mean result: 1.01550962845

Jenkins Console Log
Blue Ocean
Commit hash: b09d1a0


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@bob-carpenter
Copy link
Contributor

We can't get rid of it, but for a different reason.

All of the allocation is done in a single arena defined in memory/stack_alloc.hpp. All vari go there because we've overriden operator new for vari to use our arena.

The code for managing the stacks is in rev/core/autodiffstackstorage.hpp.

The autodiff stack keeps pointers into this raw memory so that we can traverse it in reverse order and call the chain() methods on each of the vari on the autodiff stack. But the actual object whose chain() method is being called resides in our memory arena.

The only time the no-chain stack is touched is to reset adjoint values to zero. We may need some of the adjoints for vari defined through primitives. The gradient() functional will set var based on double values passed in. Those adjoints we need to have reset to zero appropriately, because we'll read the final gradient out of them.

Other uses of var(primitive) would not have to put their vari on the no-chain stack. All that'll happen is that their adjoints will get incremented from the chain() method of the vari resulting from any expressions they participate in. So we could theoretically make a three-way distinction in the constructor: chain/no-chain no-chain-stack/no-no-chain-stack. That'd even be easy to retrofit. Only the dependent variables would need to be carefully constructed. But lots of our doc and what-not would need to change, because we couldn't just set var in programs and later fish out their adjoints unless we're careful to put them on the no-chain stack.

@wds15
Copy link
Contributor Author

wds15 commented Feb 5, 2020

@bob-carpenter I think I start to understand your "three way constructor"... but as you point out, that would be a lot of work.

Do you think we should move forward with what I am doing in this PR?

In case you agree, I suggest to also revert the recent changes to the gradient functional and the ODE stuff. This change which I am proposing here will ensure that all vars created from primitives will avoid chain calls, since the vari will land on the nochain stack. At least that's the idea. Do you agree?

(sorry for asking that many times, but this is operating at the "heart" of the AD rev core)

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 10, 2020 via email

@wds15
Copy link
Contributor Author

wds15 commented Feb 10, 2020

Yeah, I was a bit surprised to find such an optimization in our codebase which I looked over many times...which is why I repeatedly asked here.

Let me clean up the gradient and ode stuff, then i can file this pr for review.

@wds15 wds15 changed the title [WIP] avoid vari on chain-stack if var is constructed from an arithmetic type avoid vari on chain-stack if var is constructed from an arithmetic type Feb 10, 2020
@wds15
Copy link
Contributor Author

wds15 commented Feb 10, 2020

@bob-carpenter ok, once tests are passing you are welcome to review. Thanks.

This PR should speedup all those models where we cast (in functions, for example) doubles to var; I am curious if some of my models this change makes a noticeable difference as I have many functions and quite a few of these casts.

(I think the other optimization you seem to have in mind is not yet fully clear to me; maybe you can explain me next Thursday).

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 11, 2020 via email

@wds15
Copy link
Contributor Author

wds15 commented Feb 11, 2020

Ah, I see. So you want to save on the cost of set_adjoints_zero when we calculate Jacobians whenever constants are involved. That makes sense... but I wonder how many there. Maybe we can get this in here and then we check how many vanilla constants we have given that the no chain stack size is a measure of that.

EDIT: Actually, there are possibly even more saves here as the adjoint is not even needed, but I wonder if you can shave it off the vari implementation, I doubt that.

Also, this would be of use for Jacobian calculations which is not so relevant for the big task of Stan which is the gradient of the log-lik (and there is only ever one log-lik output).

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 5.0 4.77 1.05 4.7% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.45% slower
eight_schools/eight_schools.stan 0.09 0.09 1.05 5.03% faster
gp_regr/gp_regr.stan 0.22 0.22 1.01 1.18% faster
irt_2pl/irt_2pl.stan 6.07 6.05 1.0 0.35% faster
performance.compilation 88.5 87.11 1.02 1.57% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.49 7.29 1.03 2.74% faster
pkpd/one_comp_mm_elim_abs.stan 21.63 20.25 1.07 6.36% faster
sir/sir.stan 91.85 88.93 1.03 3.18% faster
gp_regr/gen_gp_data.stan 0.04 0.05 0.97 -3.06% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 2.99 1.01 0.54% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.33 0.32 1.04 3.39% faster
arK/arK.stan 1.75 1.72 1.01 1.39% faster
arma/arma.stan 0.8 0.79 1.02 1.68% faster
garch/garch.stan 0.58 0.53 1.11 9.51% faster
Mean result: 1.02573594295

Jenkins Console Log
Blue Ocean
Commit hash: c4d39e4


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor Author

wds15 commented Feb 11, 2020

Would be amazing if this buys 2.5% performance! Not sure if we can trust this though.

@SteveBronder
Copy link
Collaborator

You can run the test for multiple iters using the custom job here. That usually gives a better ballpark since some of these are v fast already

https://jenkins.mc-stan.org/job/CmdStan%20Performance%20Tests/job/Custom/

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 12, 2020 via email

@wds15
Copy link
Contributor Author

wds15 commented Feb 12, 2020

I see. So in the ideal case we do not even put those "constant-constants" onto any stack at all, right? That sounds interesting, indeed. So you think it is worthwhile to work on that? Looking at vari, we can probably come up with something where literal constants avoid any interaction with the stacks; that's probably right - and this is indeed a potential speedup. I don't think that resizing is such an issue, since only the first log-lik evaluation has to grow in memory while subsequent ones should just recycle memory (that would be my expectation).

The tests for Stan-math are all fine by now. The failing test is some upstream Windows cmdstan test which I just kicked to repeat itself.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.95 4.74 1.05 4.38% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.96 -4.22% slower
eight_schools/eight_schools.stan 0.09 0.09 0.98 -2.38% slower
gp_regr/gp_regr.stan 0.22 0.21 1.03 2.58% faster
irt_2pl/irt_2pl.stan 6.1 6.07 1.0 0.47% faster
performance.compilation 88.19 86.81 1.02 1.56% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.66 7.3 1.05 4.67% faster
pkpd/one_comp_mm_elim_abs.stan 21.65 20.13 1.08 7.0% faster
sir/sir.stan 89.77 88.88 1.01 0.99% faster
gp_regr/gen_gp_data.stan 0.05 0.05 1.01 0.96% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 2.98 1.01 0.85% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.34 0.33 1.03 3.07% faster
arK/arK.stan 1.74 1.74 1.0 -0.17% slower
arma/arma.stan 0.8 0.78 1.02 1.81% faster
garch/garch.stan 0.58 0.52 1.11 9.95% faster
Mean result: 1.02270163168

Jenkins Console Log
Blue Ocean
Commit hash: c4d39e4


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@rok-cesnovar
Copy link
Member

I ran the Custom job with a few more iterations https://jenkins.mc-stan.org/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/Custom/150/pipeline
and the speedup is more towards 1.5%. Still good tho.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.94 4.74 1.04 3.93% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.18% slower
eight_schools/eight_schools.stan 0.09 0.09 0.97 -3.6% slower
gp_regr/gp_regr.stan 0.23 0.21 1.07 6.14% faster
irt_2pl/irt_2pl.stan 6.05 6.15 0.98 -1.64% slower
performance.compilation 89.53 86.79 1.03 3.06% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.51 7.29 1.03 2.92% faster
pkpd/one_comp_mm_elim_abs.stan 20.94 22.14 0.95 -5.73% slower
sir/sir.stan 96.96 91.53 1.06 5.6% faster
gp_regr/gen_gp_data.stan 0.05 0.05 0.99 -1.23% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 2.99 1.0 0.48% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.31 0.33 0.94 -6.05% slower
arK/arK.stan 1.74 1.73 1.01 0.52% faster
arma/arma.stan 0.8 0.78 1.02 2.27% faster
garch/garch.stan 0.59 0.53 1.12 10.45% faster
Mean result: 1.01275340104

Jenkins Console Log
Blue Ocean
Commit hash: c4d39e4


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@wds15
Copy link
Contributor Author

wds15 commented Feb 12, 2020

@rok-cesnovar thanks... this can still be noise, but it looks as if we bias in a positive direction; anyway, we take it.

@bob-carpenter looks like Jenkins is now happy.

@mcol mcol linked an issue Feb 12, 2020 that may be closed by this pull request
@bob-carpenter
Copy link
Contributor

bob-carpenter commented Feb 12, 2020 via email

@wds15
Copy link
Contributor Author

wds15 commented Feb 18, 2020

@bob-carpenter anything else you need for reviewing this?

Copy link
Contributor

@bob-carpenter bob-carpenter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks.

@wds15
Copy link
Contributor Author

wds15 commented Apr 22, 2020

@bob-carpenter you mentioned that constant constants do not need to be on any AD stack... which is correct. Do we have an easy way to benchmark if that would buy us anything?

@bob-carpenter
Copy link
Contributor

Other than benchmarking the two implementations, no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vars created from primitives must not be put on chain stack
8 participants