avoid vari on chain-stack if var is constructed from an arithmetic type #1675

wds15 · 2020-02-04T21:05:55Z

Summary

This PR augments the var constructor for primitive types. Whenever a var is constructed from a primitive we can put the respective vari on the nochain stack which avoids calls to the chain method when we propagate the chain rule.

Tests

Tests which check for the size of the AD tape have been adjusted to consider as AD tape size the size of the chain and the nochain stack taken together (as opposed to wrongly define the AD tape size being equal to the chain stack size only).

Side Effects

Should speedup programs by avoiding vars constructed from base types resulting in varis to land on the chain stack.

Checklist

Math issue vars created from primitives must not be put on chain stack #1694
Copyright holder: Sebastian Weber

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

wds15 · 2020-02-04T21:22:03Z

@bob-carpenter is this totally off what I am trying here?

This would basically move all instances of vars which are created based on arithmetic base values (double, float, ...) to use vari implementation which are put on the nochain-stack. This should make the recent patches for the gradient functional obsolete (and do the optimization automatically everywhere).

bob-carpenter · 2020-02-04T23:30:41Z

is this totally off what I am trying here?

Nope. I think it's exactly what we should do. I was about to open exactly the same feature request after finding another issue in the generated model code today.

bob-carpenter · 2020-02-05T01:27:37Z

I created a somehwat related issue #1676 --- we'll need that even after this one as we only want one copy in the whole thing.

And can we get rid of the no-chain stack? Does it get used for anything?

bbbales2 · 2020-02-05T01:47:31Z

And can we get rid of the no-chain stack? Does it get used for anything?

Isn't that where these things get allocated?

This was the weird allocation stuff as I recall: https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/LDLT_alloc.hpp

wds15 · 2020-02-05T07:16:06Z

Yes, if we use the stacked = false vari constructor, then we land on the nochain stack. So we cannot remove this.

Some tests seem to make the assumption that the ad stack size is equal to the chain stack. I am going to change that to consider the ad stack size as the sum of the chain and nochain stack.

So far this looks good... and this may speedup out programs...let’s see.

…ck together

…stable/2017-11-14)

stan-buildbot · 2020-02-05T13:52:05Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.9	4.93	0.99	-0.68% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.96	-4.6% slower
eight_schools/eight_schools.stan	0.09	0.09	0.99	-1.27% slower
gp_regr/gp_regr.stan	0.22	0.22	1.02	1.64% faster
irt_2pl/irt_2pl.stan	6.07	6.07	1.0	-0.12% slower
performance.compilation	88.14	87.17	1.01	1.1% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.32	7.31	1.0	0.04% faster
pkpd/one_comp_mm_elim_abs.stan	20.95	20.55	1.02	1.91% faster
sir/sir.stan	90.52	89.65	1.01	0.96% faster
gp_regr/gen_gp_data.stan	0.04	0.05	0.99	-1.07% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	2.95	1.02	1.63% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.3	0.33	0.93	-7.59% slower
arK/arK.stan	1.75	1.73	1.01	1.3% faster
arma/arma.stan	0.8	0.66	1.22	18.15% faster
garch/garch.stan	0.63	0.59	1.07	6.49% faster
Mean result: 1.01550962845

Jenkins Console Log
Blue Ocean
Commit hash: b09d1a0

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

bob-carpenter · 2020-02-05T16:20:56Z

We can't get rid of it, but for a different reason.

All of the allocation is done in a single arena defined in memory/stack_alloc.hpp. All vari go there because we've overriden operator new for vari to use our arena.

The code for managing the stacks is in rev/core/autodiffstackstorage.hpp.

The autodiff stack keeps pointers into this raw memory so that we can traverse it in reverse order and call the chain() methods on each of the vari on the autodiff stack. But the actual object whose chain() method is being called resides in our memory arena.

The only time the no-chain stack is touched is to reset adjoint values to zero. We may need some of the adjoints for vari defined through primitives. The gradient() functional will set var based on double values passed in. Those adjoints we need to have reset to zero appropriately, because we'll read the final gradient out of them.

Other uses of var(primitive) would not have to put their vari on the no-chain stack. All that'll happen is that their adjoints will get incremented from the chain() method of the vari resulting from any expressions they participate in. So we could theoretically make a three-way distinction in the constructor: chain/no-chain no-chain-stack/no-no-chain-stack. That'd even be easy to retrofit. Only the dependent variables would need to be carefully constructed. But lots of our doc and what-not would need to change, because we couldn't just set var in programs and later fish out their adjoints unless we're careful to put them on the no-chain stack.

wds15 · 2020-02-05T22:44:47Z

@bob-carpenter I think I start to understand your "three way constructor"... but as you point out, that would be a lot of work.

Do you think we should move forward with what I am doing in this PR?

In case you agree, I suggest to also revert the recent changes to the gradient functional and the ODE stuff. This change which I am proposing here will ensure that all vars created from primitives will avoid chain calls, since the vari will land on the nochain stack. At least that's the idea. Do you agree?

(sorry for asking that many times, but this is operating at the "heart" of the AD rev core)

bob-carpenter · 2020-02-10T19:53:37Z

On Feb 5, 2020, at 5:44 PM, wds15 ***@***.***> wrote: @bob-carpenter I think I start to understand your "three way constructor"... but as you point out, that would be a lot of work. Do you think we should move forward with what I am doing in this PR?

Yes. We can do the more complicated thing later if you don't want to take it on now.

In case you agree, I suggest to also revert the recent changes to the gradient functional and the ODE stuff. This change which I am proposing here will ensure that all vars created from primitives will avoid chain calls, since the vari will land on the nochain stack. At least that's the idea. Do you agree?

Yes.

(sorry for asking that many times, but this is operating at the "heart" of the AD rev core)

No worries. I'd rather err on the side of us all being on the same page than the other way around. I'm still kicking myself for not seeing this 7 years ago!

wds15 · 2020-02-10T21:07:06Z

Yeah, I was a bit surprised to find such an optimization in our codebase which I looked over many times...which is why I repeatedly asked here.

Let me clean up the gradient and ode stuff, then i can file this pr for review.

…r constructor

…ev/math into feature/issue-speedup-rev

…dup-rev

wds15 · 2020-02-10T21:45:43Z

@bob-carpenter ok, once tests are passing you are welcome to review. Thanks.

This PR should speedup all those models where we cast (in functions, for example) doubles to var; I am curious if some of my models this change makes a noticeable difference as I have many functions and quite a few of these casts.

(I think the other optimization you seem to have in mind is not yet fully clear to me; maybe you can explain me next Thursday).

bob-carpenter · 2020-02-11T14:25:45Z

On Feb 10, 2020, at 4:45 PM, wds15 ***@***.***> wrote: @bob-carpenter ok, once tests are passing you are welcome to review. Thanks.

Will do. ...

(I think the other optimization you seem to have in mind is not yet fully clear to me; maybe you can explain me next Thursday).

Let me try one more time in writing. The reason we have the no-chain stack is to allow adjoints to be reset to 0 if we run multiple reverse passes (e.g., for Jacobian calculations using reverse mode). For true constants that aren't independent variables with respect to which we want derivatives, their adjoints are never used, so they don't need to go on the no-chain stack. That means fewer assignments and less memory allocation and copying in the no-chain stack container.

wds15 · 2020-02-11T15:03:05Z

Ah, I see. So you want to save on the cost of set_adjoints_zero when we calculate Jacobians whenever constants are involved. That makes sense... but I wonder how many there. Maybe we can get this in here and then we check how many vanilla constants we have given that the no chain stack size is a measure of that.

EDIT: Actually, there are possibly even more saves here as the adjoint is not even needed, but I wonder if you can shave it off the vari implementation, I doubt that.

Also, this would be of use for Jacobian calculations which is not so relevant for the big task of Stan which is the gradient of the log-lik (and there is only ever one log-lik output).

stan-buildbot · 2020-02-11T21:16:39Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	5.0	4.77	1.05	4.7% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.98	-2.45% slower
eight_schools/eight_schools.stan	0.09	0.09	1.05	5.03% faster
gp_regr/gp_regr.stan	0.22	0.22	1.01	1.18% faster
irt_2pl/irt_2pl.stan	6.07	6.05	1.0	0.35% faster
performance.compilation	88.5	87.11	1.02	1.57% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.49	7.29	1.03	2.74% faster
pkpd/one_comp_mm_elim_abs.stan	21.63	20.25	1.07	6.36% faster
sir/sir.stan	91.85	88.93	1.03	3.18% faster
gp_regr/gen_gp_data.stan	0.04	0.05	0.97	-3.06% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	2.99	1.01	0.54% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.33	0.32	1.04	3.39% faster
arK/arK.stan	1.75	1.72	1.01	1.39% faster
arma/arma.stan	0.8	0.79	1.02	1.68% faster
garch/garch.stan	0.58	0.53	1.11	9.51% faster
Mean result: 1.02573594295

Jenkins Console Log
Blue Ocean
Commit hash: c4d39e4

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

wds15 · 2020-02-11T22:01:51Z

Would be amazing if this buys 2.5% performance! Not sure if we can trust this though.

SteveBronder · 2020-02-11T22:22:19Z

You can run the test for multiple iters using the custom job here. That usually gives a better ballpark since some of these are v fast already

https://jenkins.mc-stan.org/job/CmdStan%20Performance%20Tests/job/Custom/

bob-carpenter · 2020-02-12T01:38:23Z

On Feb 11, 2020, at 10:03 AM, wds15 ***@***.***> wrote: Ah, I see. So you want to save on the cost of set_adjoints_zero when we calculate Jacobians whenever constants are involved.

As you mention, that's not going to happen often because we don't compute Jacobians for HMC or optimization. The bigger saving comes from not pushing the vari onto the stack at all. The pushback itself isn't free (actual copies plus cache pressure to swap in the stack), but the bigger cost is resizing. It'd be really cool if we could somehow get away with a base class without a virtual chain() function and hence no need for the extra vtable pointer.

That makes sense... but I wonder how many there. Maybe we can get this in here and then we check how many vanilla constants we have given that the no chain stack size is a measure of that.

I agree that it's fine to do it in stages. I don't want long-term plans to get in the way of short-term improvements as long as those short-term changes won't impede the long term goals.

wds15 · 2020-02-12T08:19:01Z

I see. So in the ideal case we do not even put those "constant-constants" onto any stack at all, right? That sounds interesting, indeed. So you think it is worthwhile to work on that? Looking at vari, we can probably come up with something where literal constants avoid any interaction with the stacks; that's probably right - and this is indeed a potential speedup. I don't think that resizing is such an issue, since only the first log-lik evaluation has to grow in memory while subsequent ones should just recycle memory (that would be my expectation).

The tests for Stan-math are all fine by now. The failing test is some upstream Windows cmdstan test which I just kicked to repeat itself.

stan-buildbot · 2020-02-12T08:45:43Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.95	4.74	1.05	4.38% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.96	-4.22% slower
eight_schools/eight_schools.stan	0.09	0.09	0.98	-2.38% slower
gp_regr/gp_regr.stan	0.22	0.21	1.03	2.58% faster
irt_2pl/irt_2pl.stan	6.1	6.07	1.0	0.47% faster
performance.compilation	88.19	86.81	1.02	1.56% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.66	7.3	1.05	4.67% faster
pkpd/one_comp_mm_elim_abs.stan	21.65	20.13	1.08	7.0% faster
sir/sir.stan	89.77	88.88	1.01	0.99% faster
gp_regr/gen_gp_data.stan	0.05	0.05	1.01	0.96% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	2.98	1.01	0.85% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.34	0.33	1.03	3.07% faster
arK/arK.stan	1.74	1.74	1.0	-0.17% slower
arma/arma.stan	0.8	0.78	1.02	1.81% faster
garch/garch.stan	0.58	0.52	1.11	9.95% faster
Mean result: 1.02270163168

Jenkins Console Log
Blue Ocean
Commit hash: c4d39e4

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

rok-cesnovar · 2020-02-12T08:47:52Z

I ran the Custom job with a few more iterations https://jenkins.mc-stan.org/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/Custom/150/pipeline
and the speedup is more towards 1.5%. Still good tho.

stan-buildbot · 2020-02-12T08:57:25Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	4.94	4.74	1.04	3.93% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-1.18% slower
eight_schools/eight_schools.stan	0.09	0.09	0.97	-3.6% slower
gp_regr/gp_regr.stan	0.23	0.21	1.07	6.14% faster
irt_2pl/irt_2pl.stan	6.05	6.15	0.98	-1.64% slower
performance.compilation	89.53	86.79	1.03	3.06% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	7.51	7.29	1.03	2.92% faster
pkpd/one_comp_mm_elim_abs.stan	20.94	22.14	0.95	-5.73% slower
sir/sir.stan	96.96	91.53	1.06	5.6% faster
gp_regr/gen_gp_data.stan	0.05	0.05	0.99	-1.23% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	2.99	1.0	0.48% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.31	0.33	0.94	-6.05% slower
arK/arK.stan	1.74	1.73	1.01	0.52% faster
arma/arma.stan	0.8	0.78	1.02	2.27% faster
garch/garch.stan	0.59	0.53	1.12	10.45% faster
Mean result: 1.01275340104

Jenkins Console Log
Blue Ocean
Commit hash: c4d39e4

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

wds15 · 2020-02-12T09:22:35Z

@rok-cesnovar thanks... this can still be noise, but it looks as if we bias in a positive direction; anyway, we take it.

@bob-carpenter looks like Jenkins is now happy.

bob-carpenter · 2020-02-12T20:18:43Z

I don't think that resizing is such an issue, since only the first log-lik evaluation has to grow in memory while subsequent ones should just recycle memory (that would be my expectation).

Good point. recover_memory() just uses a .clear() on the stacks. recover_memory_nested() uses .resize() to return back to old state. The doc for resize() says:

Vector capacity is never reduced when resizing to smaller size because that would invalidate all iterators

The doc for clear() says:

Leaves the capacity() of the vector unchanged

So there's not even a chance it'll get accidentally recovered.

wds15 · 2020-02-18T23:19:24Z

@bob-carpenter anything else you need for reviewing this?

bob-carpenter

Looks great. Thanks.

wds15 · 2020-04-22T14:07:59Z

@bob-carpenter you mentioned that constant constants do not need to be on any AD stack... which is correct. Do we have an easy way to benchmark if that would buy us anything?

bob-carpenter · 2020-04-22T14:55:20Z

Other than benchmarking the two implementations, no.

wds15 added 2 commits February 4, 2020 22:03

avoid vari on chain-stack if var is constructed from an arithmetic type

ec6d8c5

cpplint

280ac0d

weberse2 and others added 3 commits February 5, 2020 10:12

make consistently the change that stack size is chain and nochain sta…

4aee44f

…ck together

Merge commit '5207970d8233d29f7437ea06b93793fb05e20ff7' into HEAD

e102e5f

[Jenkins] auto-formatting by clang-format version 6.0.0 (tags/google/…

b09d1a0

…stable/2017-11-14)

wds15 added 5 commits February 10, 2020 22:31

revert nochain optimzations of ODE code as this is now part of the va…

12024e0

…r constructor

fix

72967a3

revert nochain optimzation for gradient functional which is now in va…

d5f1817

…r constructor

Merge branch 'feature/issue-speedup-rev' of https://github.com/stan-d…

d73745c

…ev/math into feature/issue-speedup-rev

Merge remote-tracking branch 'origin/develop' into feature/issue-spee…

f541961

…dup-rev

wds15 changed the title ~~[WIP] avoid vari on chain-stack if var is constructed from an arithmetic type~~ avoid vari on chain-stack if var is constructed from an arithmetic type Feb 10, 2020

get rid of const qualifier

c4d39e4

mcol linked an issue Feb 12, 2020 that may be closed by this pull request

vars created from primitives must not be put on chain stack #1694

Closed

bob-carpenter approved these changes Feb 19, 2020

View reviewed changes

bob-carpenter merged commit 029351b into develop Feb 19, 2020

SteveBronder mentioned this pull request Apr 16, 2020

Stan Math 3.2 release #1826

Closed

bbbales2 mentioned this pull request Apr 20, 2020

Stanc3 release for Cmdstan 2.23 stan-dev/stanc3#498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid vari on chain-stack if var is constructed from an arithmetic type #1675

avoid vari on chain-stack if var is constructed from an arithmetic type #1675

wds15 commented Feb 4, 2020 •

edited

Loading

wds15 commented Feb 4, 2020

bob-carpenter commented Feb 4, 2020

bob-carpenter commented Feb 5, 2020

bbbales2 commented Feb 5, 2020

wds15 commented Feb 5, 2020

stan-buildbot commented Feb 5, 2020

bob-carpenter commented Feb 5, 2020

wds15 commented Feb 5, 2020

bob-carpenter commented Feb 10, 2020 via email

wds15 commented Feb 10, 2020

wds15 commented Feb 10, 2020

bob-carpenter commented Feb 11, 2020 via email

wds15 commented Feb 11, 2020 •

edited

Loading

stan-buildbot commented Feb 11, 2020

wds15 commented Feb 11, 2020

SteveBronder commented Feb 11, 2020

bob-carpenter commented Feb 12, 2020 via email

wds15 commented Feb 12, 2020

stan-buildbot commented Feb 12, 2020

rok-cesnovar commented Feb 12, 2020

stan-buildbot commented Feb 12, 2020

wds15 commented Feb 12, 2020 •

edited

Loading

bob-carpenter commented Feb 12, 2020 via email

wds15 commented Feb 18, 2020

bob-carpenter left a comment

wds15 commented Apr 22, 2020

bob-carpenter commented Apr 22, 2020

avoid vari on chain-stack if var is constructed from an arithmetic type #1675

avoid vari on chain-stack if var is constructed from an arithmetic type #1675

Conversation

wds15 commented Feb 4, 2020 • edited Loading

Summary

Tests

Side Effects

Checklist

wds15 commented Feb 4, 2020

bob-carpenter commented Feb 4, 2020

bob-carpenter commented Feb 5, 2020

bbbales2 commented Feb 5, 2020

wds15 commented Feb 5, 2020

stan-buildbot commented Feb 5, 2020

bob-carpenter commented Feb 5, 2020

wds15 commented Feb 5, 2020

bob-carpenter commented Feb 10, 2020 via email

wds15 commented Feb 10, 2020

wds15 commented Feb 10, 2020

bob-carpenter commented Feb 11, 2020 via email

wds15 commented Feb 11, 2020 • edited Loading

stan-buildbot commented Feb 11, 2020

wds15 commented Feb 11, 2020

SteveBronder commented Feb 11, 2020

bob-carpenter commented Feb 12, 2020 via email

wds15 commented Feb 12, 2020

stan-buildbot commented Feb 12, 2020

rok-cesnovar commented Feb 12, 2020

stan-buildbot commented Feb 12, 2020

wds15 commented Feb 12, 2020 • edited Loading

bob-carpenter commented Feb 12, 2020 via email

wds15 commented Feb 18, 2020

bob-carpenter left a comment

Choose a reason for hiding this comment

wds15 commented Apr 22, 2020

bob-carpenter commented Apr 22, 2020

wds15 commented Feb 4, 2020 •

edited

Loading

wds15 commented Feb 11, 2020 •

edited

Loading

wds15 commented Feb 12, 2020 •

edited

Loading