-
-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid vari on chain-stack if var is constructed from an arithmetic type #1675
Conversation
@bob-carpenter is this totally off what I am trying here? This would basically move all instances of vars which are created based on arithmetic base values (double, float, ...) to use vari implementation which are put on the nochain-stack. This should make the recent patches for the |
Nope. I think it's exactly what we should do. I was about to open exactly the same feature request after finding another issue in the generated model code today. |
I created a somehwat related issue #1676 --- we'll need that even after this one as we only want one copy in the whole thing. And can we get rid of the no-chain stack? Does it get used for anything? |
Isn't that where these things get allocated? This was the weird allocation stuff as I recall: https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/LDLT_alloc.hpp |
Yes, if we use the stacked = false vari constructor, then we land on the nochain stack. So we cannot remove this. Some tests seem to make the assumption that the ad stack size is equal to the chain stack. I am going to change that to consider the ad stack size as the sum of the chain and nochain stack. So far this looks good... and this may speedup out programs...let’s see. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
We can't get rid of it, but for a different reason. All of the allocation is done in a single arena defined in The code for managing the stacks is in The autodiff stack keeps pointers into this raw memory so that we can traverse it in reverse order and call the The only time the no-chain stack is touched is to reset adjoint values to zero. We may need some of the adjoints for Other uses of |
@bob-carpenter I think I start to understand your "three way constructor"... but as you point out, that would be a lot of work. Do you think we should move forward with what I am doing in this PR? In case you agree, I suggest to also revert the recent changes to the (sorry for asking that many times, but this is operating at the "heart" of the AD rev core) |
On Feb 5, 2020, at 5:44 PM, wds15 ***@***.***> wrote:
@bob-carpenter I think I start to understand your "three way constructor"... but as you point out, that would be a lot of work.
Do you think we should move forward with what I am doing in this PR?
Yes. We can do the more complicated thing later if you don't want to take it on now.
In case you agree, I suggest to also revert the recent changes to the gradient functional and the ODE stuff. This change which I am proposing here will ensure that all vars created from primitives will avoid chain calls, since the vari will land on the nochain stack. At least that's the idea. Do you agree?
Yes.
(sorry for asking that many times, but this is operating at the "heart" of the AD rev core)
No worries. I'd rather err on the side of us all being on
the same page than the other way around.
I'm still kicking myself for not seeing this 7 years ago!
|
Yeah, I was a bit surprised to find such an optimization in our codebase which I looked over many times...which is why I repeatedly asked here. Let me clean up the gradient and ode stuff, then i can file this pr for review. |
…ev/math into feature/issue-speedup-rev
@bob-carpenter ok, once tests are passing you are welcome to review. Thanks. This PR should speedup all those models where we cast (in functions, for example) doubles to (I think the other optimization you seem to have in mind is not yet fully clear to me; maybe you can explain me next Thursday). |
On Feb 10, 2020, at 4:45 PM, wds15 ***@***.***> wrote:
@bob-carpenter ok, once tests are passing you are welcome to review. Thanks.
Will do.
...
(I think the other optimization you seem to have in mind is not yet fully clear to me; maybe you can explain me next Thursday).
Let me try one more time in writing. The reason we have the no-chain stack is to allow adjoints to be reset to 0 if we run multiple reverse passes (e.g., for Jacobian calculations using reverse mode). For true constants that aren't independent variables with respect to which we want derivatives, their adjoints are never used, so they don't need to go on the no-chain stack. That means fewer assignments and less memory allocation and copying in the no-chain stack container.
|
Ah, I see. So you want to save on the cost of EDIT: Actually, there are possibly even more saves here as the adjoint is not even needed, but I wonder if you can shave it off the vari implementation, I doubt that. Also, this would be of use for Jacobian calculations which is not so relevant for the big task of Stan which is the gradient of the log-lik (and there is only ever one log-lik output). |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Would be amazing if this buys 2.5% performance! Not sure if we can trust this though. |
You can run the test for multiple iters using the custom job here. That usually gives a better ballpark since some of these are v fast already https://jenkins.mc-stan.org/job/CmdStan%20Performance%20Tests/job/Custom/ |
On Feb 11, 2020, at 10:03 AM, wds15 ***@***.***> wrote:
Ah, I see. So you want to save on the cost of set_adjoints_zero when we calculate Jacobians whenever constants are involved.
As you mention, that's not going to happen often because we don't compute Jacobians for HMC or optimization.
The bigger saving comes from not pushing the vari onto the stack at all. The pushback itself isn't free (actual copies plus cache pressure to swap in the stack), but the bigger cost is resizing.
It'd be really cool if we could somehow get away with a base class without a virtual chain() function and hence no need for the extra vtable pointer.
That makes sense... but I wonder how many there. Maybe we can get this in here and then we check how many vanilla constants we have given that the no chain stack size is a measure of that.
I agree that it's fine to do it in stages. I don't want long-term plans to get in the way of short-term improvements as long as those short-term changes won't impede the long term goals.
|
I see. So in the ideal case we do not even put those "constant-constants" onto any stack at all, right? That sounds interesting, indeed. So you think it is worthwhile to work on that? Looking at The tests for Stan-math are all fine by now. The failing test is some upstream Windows cmdstan test which I just kicked to repeat itself. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
I ran the Custom job with a few more iterations https://jenkins.mc-stan.org/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/Custom/150/pipeline |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
@rok-cesnovar thanks... this can still be noise, but it looks as if we bias in a positive direction; anyway, we take it. @bob-carpenter looks like Jenkins is now happy. |
I don't think that resizing is such an issue, since only the first log-lik evaluation has to grow in memory while subsequent ones should just recycle memory (that would be my expectation).
Good point.
recover_memory() just uses a .clear() on the stacks.
recover_memory_nested() uses .resize() to return back to old state.
The doc for resize() says:
Vector capacity is never reduced when resizing to smaller size because that would invalidate all iterators
The doc for clear() says:
Leaves the capacity() of the vector unchanged
So there's not even a chance it'll get accidentally recovered.
|
@bob-carpenter anything else you need for reviewing this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks.
@bob-carpenter you mentioned that constant constants do not need to be on any AD stack... which is correct. Do we have an easy way to benchmark if that would buy us anything? |
Other than benchmarking the two implementations, no. |
Summary
This PR augments the
var
constructor for primitive types. Whenever avar
is constructed from a primitive we can put the respectivevari
on the nochain stack which avoids calls to thechain
method when we propagate the chain rule.Tests
Tests which check for the size of the AD tape have been adjusted to consider as AD tape size the size of the chain and the nochain stack taken together (as opposed to wrongly define the AD tape size being equal to the chain stack size only).
Side Effects
Should speedup programs by avoiding vars constructed from base types resulting in varis to land on the chain stack.
Checklist
Math issue vars created from primitives must not be put on chain stack #1694
Copyright holder: Sebastian Weber
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make test-math-dependencies
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested