-
-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244
Conversation
This introduces BACK again a bug such that threading won't work anymore on windows. In this regard, I am not sure of the slight performance regression is enough to trump that regression which we found in a single model so far. Could we first have an overview which models are affected? Moreover, there are two very hot candidates in terms of getting rid of the performance regression for the non-threaded case. If it's our standard procedure to revert things like, then sure, go ahead. |
Yeah, just talked about it at the Math meeting. SOP is to revert since we didn't know about this before, and then study it in more detail / try to fix it / analyze tradeoffs while develop is "clean." I hear what you're saying about the tradeoffs of the revert and breaking things again for Windows, but given that our performance testing coverage is woefully inadequate it seems like we should study this more to understand it better before we can actually decide to make that trade off. |
Ok... I am getting a bit tired working on this, but I will look for some motivation on the street... Most likely it's the all pointer design. Changing that will make things a bit inconsistent, but it will be manageable. With a bit of luck we can turn the instance() method into a raw pointer and that solves it (hopefully). EDIT: BTW, if this is our SOP - then we need by all means an easy way to generate the overview over the models which are affected by how much. EDIT2: And does this mean that the gold-standard is non-threaded, right? We never spelled that out, but reverting introduces a major performance regression for any threaded application. So what if I would submit a model which uses threading... you would not be able to make the same argument again (other than saying non-threading is the gold ref). |
I feel ya. @rok-cesnovar and @SteveBronder (and maybe others) had previously volunteered to help out with parallelism efforts, maybe you guys could work together on analyzing this and potentially fixing it or making the case that it's worth it? |
I am working on pinpointing where exactly on the branch this happens. I will see if I can find the fix or seek some help here or on discourse. |
Reverting 49d3583 and ace8424 has no effect on the schools-4 model. As @wds15 pointed out on the Discourse thread it could be due to the pointer access to the AD instance. And he is correct. I tried my best (or worst) with an improvised/handwave-y/justmakethiscompile way of ifdefing the autodiffstackstorage.hpp ( the result is here). I never dealt with these autodiffstackstorage stuff and just wanted to see if this was it. With this we are back to
|
Is making a PR with this ifdefs an option? Instead of reverting and then reapplying? Either way, I can make that happen tomorrow to avoid the regression for threading on Windows. |
@rok-cesnovar Thanks a lot for pinpointing this. We should be able to wait for this, if you ask me, but this is probably a matter of SOP and easier git magic; so let's have @seantalts and/or @syclik have a say here. In any case... handling things with pointers in the threading case and without in the non-threading case is not too nice. Since this only affects a single file it is probably fine - given the constraints we are under. I hope that this would be common understanding here. The art will be to write it as cleanly as possible. |
Yeah, with reverting and than applying we will also be able to double check that the fix is fine for all the performance tests in the batch so we are clear for good. |
reverting (annoying to have to deal with this again... but looks like it is useful...yack) |
I did only try the revert with g++. That is my bad. Will double check with g++ and clang + Windows. But not before tomorrow. |
No worries. This stuff can drive you crazy and is almost not manageable without huge automation. It sounds as if we want to avoid the by-method access and we may event need to get rid of the pointer for the non-threaded case. Getting rid of the pointer (and use two different things for threading / non-threading) in the non-threaded case is not too nice. Windows should work with the pointer for threading. I have never tested speed on Windows though (I just don't like to spend too much time under Windows). |
Shouldn’t we now expect to see speedups in ther performance tests which are kicked off automatically? Can someone point me to these please? |
Finally ran the tests with clang++
So it seems that Clang handles this better and doesnt need the ifdef solution. Running on Windows now. |
Thanks for confirming that. The crux is that reverting those two commits is a solution which is cannot be combined easily with the optimisation to have for the non-threaded case a non-pointer solution. And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations... |
So the github commenting mechanism isn't working yet, but Jenkins did run the stat_comp_benchmarks for performance and correctness in this PR against develop in this build: http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/downstream_tests/62/pipeline/18 Unfortunately the model with the problem is not in the stat_comp_benchmarks suite, but those results are here:
I'd be prone to ignore the stuff that looks like plus or minus 4 percent or so. It looks like it sped up the arma model significantly, but that model takes under a second to run anyway. I'll run a test locally where I run it many times and see what it says about that model. |
I ran the arma model locally,
and it looks like performance with the PR actually makes it slightly worse on my machine,
So I guess my conclusion here is not to worry about the arma model and focus on the eight schools models. I'll test those against this PR locally to make sure this PR addresses the issue we spotted. |
It seems as if the compiler used matters for this PR... what is being used in the Jenkins tests and would it be possible to run the performance tests on Jenkins with some specific compiler as an option maybe? |
I didn't see the compiler mattering at all in Rok's testing, did that pop up at some point? Jenkins is using clang++-6.0 on Mac and g++ 5 I believe on Linux; the relative performance tests for the PR can execute on either. |
I am afraid that different compilers react differently to specific code patterns. Here is what I think right now, but I could be wrong:
|
:( |
Currently we have a performance regression under Linux with gcc 5 with the schools model. If we revert 49d3583 and ace8424 then we even get a speedup with clang under Linux and under macOS. Thus, this brings us almost what we want. I would therefore suggest that we test if newer gcc versions manage to properly optimize things on the schools model with the revert mentioned. If newer gcc versions handle this (and do not show a performance regression), then we should revert those two commits and we end up with a version which only has a performance regression with older gcc versions. Does that sound like an option? |
On Windows the regression is a bit less obvious (<3%):
So yeah, 8schools on the combination of Linux+gcc is the one affected. |
I have gcc 7.3 on my Ubuntu system and it shows the same regression :/ |
I just finished a run on a branch where then two hashes quoted are revert under gcc 8. And then I get a fast run:
which is just as fast as the respective run with clang which took 376s with this. So I think this would be sensible to do. In fact, the original TLSv4 which was also approved did have these two hashes reverted. EDIT: gcc8 on macOS. |
g++-8 on Linux suffers from the same regression. Just tried it. I guess we are left with either ifdefing the pointer for the non-threading case (which, as you said, is on the not-nice side) or reverting the 2 commits with knowing that we have regression for g++ on Ubuntu for a few models. |
@syclik already asked me to just push this revert through, I've just been waiting for some really long running tests to show that it does indeed fix the issue. I think if you have a code proposal you want to put forward the first step is to create a PR with it so we can reference it. Does it supersede this revert PR or build on top of it? |
I just wanted to clarify; this wasn't a request to push this particular
revert through. This is a general policy that the Math library has: if it
breaks, we immediately fix it.
See:
https://github.com/stan-dev/stan/wiki/Stan-Software-Lifecycle-and-Development-Process#source-code-management
"Taken together, the testing, code review, and merge process ensures that
the development branch is always in a releasable state."
…On Fri, May 17, 2019 at 3:24 PM seantalts ***@***.***> wrote:
@syclik <https://github.com/syclik> already asked me to just push this
revert through, I've just been waiting for some really long running tests
to show that it does indeed fix the issue.
I think if you have a code proposal you want to put forward the first step
is to create a PR with it so we can reference it. Does it supersede this
revert PR or build on top of it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1244?email_source=notifications&email_token=AADH6F3QYYTQXL2CVNYVO33PV4A7DA5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVVUWCQ#issuecomment-493570826>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F7CBD62FMEG7RTGI23PV4A7DANCNFSM4HNM73RA>
.
|
Sure, the question is now "what is releasable" - this revert PR breaks threading on Windows but restores performance on (at least) a few models but no perf regression on many more. I don't want to interpret the policy incorrectly, so maybe @syclik do you want to either hit the merge button or suggest a path for rolling forward? My thought for rolling forward would still be to submit a PR with the proposed changes and let people test it, it's just whether that PR is built on top of the revert in the meantime or not. Shouldn't matter too much, though the best route doesn't seem immediately obvious so the revert PR first is attractive to maintain a sort of monotonically improving math library. |
Yup. That's exactly the right question to ask.
For any PR, if we have side effects that are bad, we should have made the
decision prior to the PR getting in. Here, we only approved the PR with the
understanding that it does not negatively impact existing single-core
performance. So, anything deviating from that seems reasonable to revert
first, then decide deliberately if that's acceptable.
I'm not saying we shouldn't take a 15% performance hit. (I think we
shouldn't, but this can ans should be discussed.) If we do, we should be
deliberate about it and understand what trade-off we're making.
…On Fri, May 17, 2019 at 7:32 PM seantalts ***@***.***> wrote:
Sure, the question is now "what is releasable" - this revert PR breaks
threading on Windows but restores performance on (at least) a few models
but no perf regression on many more.
I don't want to interpret the policy incorrectly, so maybe @syclik
<https://github.com/syclik> do you want to either hit the merge button or
suggest a path for rolling forward? My thought for rolling forward would
still be to submit a PR with the proposed changes and let people test it,
it's just whether that PR is built on top of the revert in the meantime or
not. Shouldn't matter too much, though the best route doesn't seem
immediately obvious so the revert PR first is attractive to maintain a sort
of monotonically improving math library.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1244?email_source=notifications&email_token=AADH6F5QYQ4MZBLRC4RLFTTPV46B5A5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWCMBQ#issuecomment-493626886>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F4VIATAG6QIDA23RRDPV46B5ANCNFSM4HNM73RA>
.
|
Yes. And in general now that we're measuring performance automatically across a much wider range of tests (than we had in Daily Stan Performance) we're going to have to start making these difficult decisions. If you made a decision in your last response I didn't catch it - what do you want us to do? |
Just follow policy. We revert. (This is not an exceptional case where any
single person had to make a decision on.)
…On Fri, May 17, 2019 at 8:36 PM seantalts ***@***.***> wrote:
Yes. And in general now that we're measuring performance automatically
across a much wider range of tests (than we had in Daily Stan Performance)
we're going to have to start making these difficult decisions.
If you made a decision in your last response I didn't catch it - what do
you want us to do?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1244?email_source=notifications&email_token=AADH6F43WQN65KZB6A43CFLPV5FQBA5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWEEPY#issuecomment-493634111>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6F6ZR7WR4L4JRMGLWQTPV5FQBANCNFSM4HNM73RA>
.
|
Okay. This revert PR is ready to merge whenever. |
The statement "does not slow down end-to-end" performance always related to our performance benchmark set of models. On this set there is no slow-down. This was my understanding and as such it is to me another instance of changing requirements which are very hard to (and frustrating) work with. If others understood it differently, fine, then have another PR. Moreover, we revert this PR and it reintroduces a major bug under Windows with threading. It's unclear how that plays with our guidelines. We should probably have a discussion on discourse about the place of threading. Either threading is deemed to be a side product (which makes Stan to me a useless toy for many of my applications) or Stan is a serious option to consider in real problems - but let's have that discussion on discourse. |
Oh man, my long-running tests finally finished (10 runs each of all the schools models) and on my macbook pro it looks like this revert PR didn't do anything vs develop... |
Macbook with clang? |
It seems like this is a linux + gcc problem only. And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading. BTW, clang speeds up with the changes as in the v6 PR from @rok-cesnovar (thanks for making that one!). |
Yes, macbook pro with clang.
No, the performance drop was first measured on a Mac with clang. Something else is going on - either this is the wrong PR, or there were multiple OS-dependent performance drops, or the Mac we run the performance tests on is weird in some way (this is probably true but not sure if it's the culprit here)...
If you think of it as maintaining 100% support and only offering improvements then it kind of makes sense. But yeah, you might want to make that trade off a different way sometime if you absolutely have to. |
Oh, I forgot: The clang does not like the indirect access via a method. That is reverted in the v6 branch from @rok-cesnovar which he filed. The SOPs are interpreted in ways I don't understand and this is certainly telling me I should work a lot less on Stan! In this case:
I am just observing here. If this schools model can trigger a revert it should by all means be part of what we test regularly. |
It was never released.
It's expensive to compile and run every model we have twice for the relative benchmarks. We don't do it for every PR commit, just on merge to develop. |
That helps a little. |
And we're still working these things out - maybe it would be worth doing the full set on an EC2 node for every PR push. But they're also pretty noisy, so it's not like the automated test triggered the revert, but it did alert us to look at it more carefully and then it was verified by a couple of people independently for this model. The SOP you're referring to is just that |
And in this case it's really annoying because we were making large changes to the performance suite at the same time that a few PRs of this transformative sort were merged in, haha. So we'd normally have much more clear records of which PR caused the performance regression. |
Yeah... I should ideally just code up new functions where there is no reference at all and it is completely compartmentalised... but this stuff is needed very badly... I think. I already have developed a very long breath with things, but this one is as of now almost funny if you look at it how difficult it is to get this TLS stuff in. |
Yeah, sorry about that. The work we've been doing on the performance regression suite is supposed to make this an easier, more transparent process, but it's also the first time on the project where we've systemically measured performance across a breadth of models. So we're for the first time having to make hard choices where we now realize something has a performance impact where a year ago we wouldn't have noticed until a user reported a bug (or possibly never). I'm definitely open to suggestions on how to make this better, too! |
I'm not saying we shouldn't take a 15% performance hit. (I think we
shouldn't, but this can ans should be discussed.) If we do, we should be
deliberate about it and understand what trade-off we're making.
If you're not saying it, I will. We shouldn't take a 15% performance hit!
|
@seantalts, thanks for articulating that so clearly. That's exactly what I'm thinking.
Exactly. To put it slightly differently, 0% of Windows users are negatively affected since the bug was never released. (They can always grab the branch and work on that knowing it's in flux.) There is an unknown percentage of users, but definitely greater than 0%, that would be affected by keeping this in. It does not give me good confidence that the exact conditions for when this happens can be specified. This was exactly the same behavior I saw when evaluating the PR, but was presented with enough benchmarks to convince me otherwise. Now that we see it again, I think it's reasonable to assume it's a real thing.
@wds15, could I ask you to be involved in putting together performance testing? The things you want to touch are deep in the stack and it'd be good to get your input on what needs to be measured and perhaps we can also come to a common understanding on what needs to be covered. And this is maybe the 2nd change to the autodiff stack. So we're in new territory for changing something so fundamental. It makes sense that we didn't create all the benchmarking for these things prior to this point.
Thanks, @bob-carpenter. |
First, the blanket statement that things get a 15% performance is way to bold, I think. We have seen only a 15% performance hit on the schools models when compiling using the TLSv5 changes under Linux with gcc. This is a very specialty case as it looks. We don't have evidence for a broader impact. A 15% performance hit in general is unacceptable, of course. I am still puzzled as to why we still let our Windows users down wrt to threading (it would work if TLSv5 stayed in) - but OK, seeing things relative to what has ever been release; then I start to see the logic. @syclik I can certainly give my inputs on performance testing, but I am not calling myself an expert on this - I do have some intuition by now, obviously. However, when we get there I will strongly suggest that threading performance is being taking into consideration. Right now threading suffers from a huge performance regression (relative to the TLSv5 being in). Above and beyond all that I thought that the issues are settled in the TLSv6 PR which reverts two changes from TLSv5 and this leads to even speedups when using clang (single-core). |
There's a little miscommunication. No one is stating there's a 15%
performance hit across the board. We've seen it for certain combinations of
hardware + compiler and this has been verified independently multiple
times. (If this is limited to a buggy version of g++, great... that's
something we can work with, but we haven't determined that.) What we do
know that it happens and for some, this will be a performance degradation.
If you think we can fix Windows separately, we should do that.
…On Tue, May 21, 2019 at 10:44 AM wds15 ***@***.***> wrote:
First, the blanket statement that things get a 15% performance is way to
bold, I think. We haven't seen only a 15% performance hit on the schools
models when compiling using the TLSv5 changes under Linux with gcc. This is
a very specialty case as it looks.
A 15% performance hit in general is unacceptable, of course.
I am still puzzled as to why we still let our Windows users down wrt to
threading (it would work if TLSv5 stayed in) - but OK, seeing things
relative to what has ever been release; then I start to see the logic.
@syclik <https://github.com/syclik> I can certainly give my inputs on
performance testing, but I am not calling myself an expert on this - I do
have some intuition by now, obviously. However, when we get there I will
strongly suggest that *threading* performance is being taking into
consideration. Right now threading suffers from a huge performance
regression (relative to the TLSv5 being in).
Above and beyond all that I thought that the issues are settled in the
TLSv6 PR which reverts two changes from TLSv5 and this leads to even
speedups when using clang (single-core).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1244?email_source=notifications&email_token=AADH6F6A57UZCPZSS5YHB4DPWQDFVA5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4ETAY#issuecomment-494422403>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADH6FYRHR2YQ6UZ665D6JDPWQDFVANCNFSM4HNM73RA>
.
|
Yes - that's more accurate. The Windows problems are solved with using a pointer as storage for the AD tape. Using a pointer appears to make it harder in the single-core case under some circumstance to do good optimisations for the compiler. I thought/hope that the TLSv6 PR solves that issue. At least my benchmarks, which I have documented on the TLSv6 PR, seem to indicate that the code put up there is not showing the 15% performance hit on Linux with gcc and the schools model (on our system)... I need to be very specific here... |
This reverts commit 5e76d67, reversing
changes made to ec52b8b.
Seems like this caused a 14% performance regression in example-models/bugs_examples/vol2/schools/schools-{2-4}.stan.
Ref: https://discourse.mc-stan.org/t/possible-performance-regression/8835