Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244

Merged
merged 1 commit into from
May 18, 2019

Conversation

seantalts
Copy link
Member

This reverts commit 5e76d67, reversing
changes made to ec52b8b.

Seems like this caused a 14% performance regression in example-models/bugs_examples/vol2/schools/schools-{2-4}.stan.
Ref: https://discourse.mc-stan.org/t/possible-performance-regression/8835

@wds15
Copy link
Contributor

wds15 commented May 16, 2019

This introduces BACK again a bug such that threading won't work anymore on windows.

In this regard, I am not sure of the slight performance regression is enough to trump that regression which we found in a single model so far.

Could we first have an overview which models are affected?

Moreover, there are two very hot candidates in terms of getting rid of the performance regression for the non-threaded case.

If it's our standard procedure to revert things like, then sure, go ahead.

@seantalts
Copy link
Member Author

seantalts commented May 16, 2019

Yeah, just talked about it at the Math meeting. SOP is to revert since we didn't know about this before, and then study it in more detail / try to fix it / analyze tradeoffs while develop is "clean."

I hear what you're saying about the tradeoffs of the revert and breaking things again for Windows, but given that our performance testing coverage is woefully inadequate it seems like we should study this more to understand it better before we can actually decide to make that trade off.

@wds15
Copy link
Contributor

wds15 commented May 16, 2019

Ok... I am getting a bit tired working on this, but I will look for some motivation on the street...

Most likely it's the all pointer design. Changing that will make things a bit inconsistent, but it will be manageable. With a bit of luck we can turn the instance() method into a raw pointer and that solves it (hopefully).

EDIT: BTW, if this is our SOP - then we need by all means an easy way to generate the overview over the models which are affected by how much.

EDIT2: And does this mean that the gold-standard is non-threaded, right? We never spelled that out, but reverting introduces a major performance regression for any threaded application. So what if I would submit a model which uses threading... you would not be able to make the same argument again (other than saying non-threading is the gold ref).

@seantalts
Copy link
Member Author

I feel ya. @rok-cesnovar and @SteveBronder (and maybe others) had previously volunteered to help out with parallelism efforts, maybe you guys could work together on analyzing this and potentially fixing it or making the case that it's worth it?

@rok-cesnovar
Copy link
Member

I am working on pinpointing where exactly on the branch this happens. I will see if I can find the fix or seek some help here or on discourse.

@rok-cesnovar
Copy link
Member

Reverting 49d3583 and ace8424 has no effect on the schools-4 model.

As @wds15 pointed out on the Discourse thread it could be due to the pointer access to the AD instance. And he is correct.

I tried my best (or worst) with an improvised/handwave-y/justmakethiscompile way of ifdefing the autodiffstackstorage.hpp ( the result is here). I never dealt with these autodiffstackstorage stuff and just wanted to see if this was it. With this we are back to

 Elapsed Time: 287.646 seconds (Warm-up)
               182.023 seconds (Sampling)
               469.669 seconds (Total)

@rok-cesnovar
Copy link
Member

Is making a PR with this ifdefs an option? Instead of reverting and then reapplying? Either way, I can make that happen tomorrow to avoid the regression for threading on Windows.

@wds15
Copy link
Contributor

wds15 commented May 16, 2019

@rok-cesnovar Thanks a lot for pinpointing this. We should be able to wait for this, if you ask me, but this is probably a matter of SOP and easier git magic; so let's have @seantalts and/or @syclik have a say here.

In any case... handling things with pointers in the threading case and without in the non-threading case is not too nice. Since this only affects a single file it is probably fine - given the constraints we are under. I hope that this would be common understanding here. The art will be to write it as cleanly as possible.

@rok-cesnovar
Copy link
Member

Yeah, with reverting and than applying we will also be able to double check that the fix is fine for all the performance tests in the batch so we are clear for good.

@wds15
Copy link
Contributor

wds15 commented May 16, 2019

reverting 49d3583 and ace8424 does have an effect on my system! See the forum.

(annoying to have to deal with this again... but looks like it is useful...yack)

@rok-cesnovar
Copy link
Member

I did only try the revert with g++. That is my bad. Will double check with g++ and clang + Windows. But not before tomorrow.

@wds15
Copy link
Contributor

wds15 commented May 16, 2019

No worries. This stuff can drive you crazy and is almost not manageable without huge automation. It sounds as if we want to avoid the by-method access and we may event need to get rid of the pointer for the non-threaded case. Getting rid of the pointer (and use two different things for threading / non-threading) in the non-threaded case is not too nice.

Windows should work with the pointer for threading. I have never tested speed on Windows though (I just don't like to spend too much time under Windows).

@wds15
Copy link
Contributor

wds15 commented May 17, 2019

Shouldn’t we now expect to see speedups in ther performance tests which are kicked off automatically? Can someone point me to these please?

@rok-cesnovar
Copy link
Member

Finally ran the tests with clang++

clang++ on develop
 Elapsed Time: 324.398 seconds (Warm-up)
               207.511 seconds (Sampling)
               531.909 seconds (Total)

clang++ with `49d3583` and `ace8424` reverted (made 2 runs)
 Elapsed Time: 296.09 seconds (Warm-up)
               187.73 seconds (Sampling)
               483.82 seconds (Total)
 Elapsed Time: 297.669 seconds (Warm-up)
               188.23 seconds (Sampling)
               485.9 seconds (Total)

So it seems that Clang handles this better and doesnt need the ifdef solution. Running on Windows now.

@wds15
Copy link
Contributor

wds15 commented May 17, 2019

Thanks for confirming that. The crux is that reverting those two commits is a solution which is cannot be combined easily with the optimisation to have for the non-threaded case a non-pointer solution.

And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations...

@seantalts
Copy link
Member Author

So the github commenting mechanism isn't working yet, but Jenkins did run the stat_comp_benchmarks for performance and correctness in this PR against develop in this build: http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/downstream_tests/62/pipeline/18

Unfortunately the model with the problem is not in the stat_comp_benchmarks suite, but those results are here:

+ ./comparePerformance.py develop_performance.csv performance.csv
('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.94)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.0)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 0.97)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.01)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 0.99)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.04)
('performance.compilation', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.01)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.06)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.04)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.21)
1.01635012423

I'd be prone to ignore the stuff that looks like plus or minus 4 percent or so. It looks like it sped up the arma model significantly, but that model takes under a second to run anyway. I'll run a test locally where I run it many times and see what it says about that model.

@seantalts
Copy link
Member Author

I ran the arma model locally,

Seans-MBP ~/scm/performance-tests-cmdstan (master) $ ./compare-git-hashes.sh "stat_comp_benchmarks/benchmarks/arma --runs 30" develop develop develop PR-1244

and it looks like performance with the PR actually makes it slightly worse on my machine,

+ ./comparePerformance.py develop_performance.csv performance.csv
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 0.92)
('performance.compilation', 1.03)
0.972083995854

So I guess my conclusion here is not to worry about the arma model and focus on the eight schools models. I'll test those against this PR locally to make sure this PR addresses the issue we spotted.

@wds15
Copy link
Contributor

wds15 commented May 17, 2019

It seems as if the compiler used matters for this PR... what is being used in the Jenkins tests and would it be possible to run the performance tests on Jenkins with some specific compiler as an option maybe?

@seantalts
Copy link
Member Author

I didn't see the compiler mattering at all in Rok's testing, did that pop up at some point? Jenkins is using clang++-6.0 on Mac and g++ 5 I believe on Linux; the relative performance tests for the PR can execute on either.

@wds15
Copy link
Contributor

wds15 commented May 17, 2019

I am afraid that different compilers react differently to specific code patterns. Here is what I think right now, but I could be wrong:

And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations...

@seantalts
Copy link
Member Author

:(

@wds15
Copy link
Contributor

wds15 commented May 17, 2019

Currently we have a performance regression under Linux with gcc 5 with the schools model. If we revert 49d3583 and ace8424 then we even get a speedup with clang under Linux and under macOS. Thus, this brings us almost what we want. I would therefore suggest that we test if newer gcc versions manage to properly optimize things on the schools model with the revert mentioned. If newer gcc versions handle this (and do not show a performance regression), then we should revert those two commits and we end up with a version which only has a performance regression with older gcc versions.

Does that sound like an option?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented May 17, 2019

On Windows the regression is a bit less obvious (<3%):

g++ Develop:
 Elapsed Time 320.069 seconds (Warm-up)
               165.872 seconds (Sampling)
               485.941 seconds (Total)
g++ Reverted `49d3583` and `ace8424`
 Elapsed Time: 322.308 seconds (Warm-up)
               162.52 seconds (Sampling)
               484.828 seconds (Total)
g++ reverted tlsv5
 Elapsed Time: 316.418 seconds (Warm-up)
               157.7 seconds (Sampling)
               474.118 seconds (Total)

clang++ develop
 Elapsed Time: 319.805 seconds (Warm-up)
               161.913 seconds (Sampling)
               481.718 seconds (Total)
clang++ reverted `49d3583` and `ace8424`
 Elapsed Time: 324.069 seconds (Warm-up)
               163.576 seconds (Sampling)
               487.645 seconds (Total)
clang++ reverted tlsv5
 Elapsed Time: 317.134 seconds (Warm-up)
               157.861 seconds (Sampling)
               474.995 seconds (Total)

So yeah, 8schools on the combination of Linux+gcc is the one affected.

@rok-cesnovar
Copy link
Member

Currently we have a performance regression under Linux with gcc 5 with the schools model.

I have gcc 7.3 on my Ubuntu system and it shows the same regression :/

@wds15
Copy link
Contributor

wds15 commented May 17, 2019

I just finished a run on a branch where then two hashes quoted are revert under gcc 8. And then I get a fast run:

 Elapsed Time: 228.18 seconds (Warm-up)
               143.031 seconds (Sampling)
               371.211 seconds (Total)

which is just as fast as the respective run with clang which took 376s with this.

So I think this would be sensible to do.

In fact, the original TLSv4 which was also approved did have these two hashes reverted.

EDIT: gcc8 on macOS.

@rok-cesnovar
Copy link
Member

g++-8 on Linux suffers from the same regression. Just tried it.

I guess we are left with either ifdefing the pointer for the non-threading case (which, as you said, is on the not-nice side) or reverting the 2 commits with knowing that we have regression for g++ on Ubuntu for a few models.

@seantalts
Copy link
Member Author

@syclik already asked me to just push this revert through, I've just been waiting for some really long running tests to show that it does indeed fix the issue.

I think if you have a code proposal you want to put forward the first step is to create a PR with it so we can reference it. Does it supersede this revert PR or build on top of it?

@syclik
Copy link
Member

syclik commented May 17, 2019 via email

@seantalts
Copy link
Member Author

Sure, the question is now "what is releasable" - this revert PR breaks threading on Windows but restores performance on (at least) a few models but no perf regression on many more.

I don't want to interpret the policy incorrectly, so maybe @syclik do you want to either hit the merge button or suggest a path for rolling forward? My thought for rolling forward would still be to submit a PR with the proposed changes and let people test it, it's just whether that PR is built on top of the revert in the meantime or not. Shouldn't matter too much, though the best route doesn't seem immediately obvious so the revert PR first is attractive to maintain a sort of monotonically improving math library.

@syclik
Copy link
Member

syclik commented May 17, 2019 via email

@seantalts
Copy link
Member Author

Yes. And in general now that we're measuring performance automatically across a much wider range of tests (than we had in Daily Stan Performance) we're going to have to start making these difficult decisions.

If you made a decision in your last response I didn't catch it - what do you want us to do?

@syclik
Copy link
Member

syclik commented May 18, 2019 via email

@seantalts
Copy link
Member Author

Okay. This revert PR is ready to merge whenever.

@stan-dev stan-dev deleted a comment from stan-buildbot May 18, 2019
@rok-cesnovar rok-cesnovar merged commit f49f224 into develop May 18, 2019
@wds15
Copy link
Contributor

wds15 commented May 18, 2019

The statement "does not slow down end-to-end" performance always related to our performance benchmark set of models. On this set there is no slow-down. This was my understanding and as such it is to me another instance of changing requirements which are very hard to (and frustrating) work with. If others understood it differently, fine, then have another PR.

Moreover, we revert this PR and it reintroduces a major bug under Windows with threading. It's unclear how that plays with our guidelines. We should probably have a discussion on discourse about the place of threading. Either threading is deemed to be a side product (which makes Stan to me a useless toy for many of my applications) or Stan is a serious option to consider in real problems - but let's have that discussion on discourse.

@seantalts
Copy link
Member Author

Oh man, my long-running tests finally finished (10 runs each of all the schools models) and on my macbook pro it looks like this revert PR didn't do anything vs develop...

@rok-cesnovar
Copy link
Member

Macbook with clang?

@wds15
Copy link
Contributor

wds15 commented May 18, 2019

It seems like this is a linux + gcc problem only. And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading.

BTW, clang speeds up with the changes as in the v6 PR from @rok-cesnovar (thanks for making that one!).

@seantalts
Copy link
Member Author

seantalts commented May 18, 2019

Yes, macbook pro with clang.

It seems like this is a linux + gcc problem only.

No, the performance drop was first measured on a Mac with clang. Something else is going on - either this is the wrong PR, or there were multiple OS-dependent performance drops, or the Mac we run the performance tests on is weird in some way (this is probably true but not sure if it's the culprit here)...

And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading.

If you think of it as maintaining 100% support and only offering improvements then it kind of makes sense. But yeah, you might want to make that trade off a different way sometime if you absolutely have to.
[edit] Also Daniel's paradigm that you revert something that you found had unintended consequences until you can better analyze them I think seems right. It's not that we don't want to make that trade-off, it's that we want to know we're making that trade-off.

@seantalts seantalts deleted the perf-regression branch May 18, 2019 13:00
@wds15
Copy link
Contributor

wds15 commented May 18, 2019

Oh, I forgot: The clang does not like the indirect access via a method. That is reverted in the v6 branch from @rok-cesnovar which he filed.

The SOPs are interpreted in ways I don't understand and this is certainly telling me I should work a lot less on Stan!

In this case:

  1. Why can we let Windows threading slide? I don't get it.
  2. If performance is put into stone with our reference set (for which everything seems ok) of the stat_comp_performance - then we can this schools model trigger a revert?

I am just observing here. If this schools model can trigger a revert it should by all means be part of what we test regularly.

@seantalts
Copy link
Member Author

Why can we let Windows threading slide? I don't get it.

It was never released.

If performance is put into stone with our reference set (for which everything seems ok) of the stat_comp_performance - then we can this schools model trigger a revert?

It's expensive to compile and run every model we have twice for the relative benchmarks. We don't do it for every PR commit, just on merge to develop.

@wds15
Copy link
Contributor

wds15 commented May 18, 2019

That helps a little.

@seantalts
Copy link
Member Author

And we're still working these things out - maybe it would be worth doing the full set on an EC2 node for every PR push. But they're also pretty noisy, so it's not like the automated test triggered the revert, but it did alert us to look at it more carefully and then it was verified by a couple of people independently for this model.

The SOP you're referring to is just that develop be releasable, which I think @syclik is reasonably interpreting to mean, among other things, "in a state where we understand the performance tradeoffs." So if we discover something new on merge to develop on a specific PR, we revert and test that PR more thoroughly to try to outline the performance tradeoffs we're making. Only a very small percentage of PRs will need this - you happen to be doing lots of transformative work to the very core of the autodiff system, which is both exciting and needs lots of analysis.

@seantalts
Copy link
Member Author

And in this case it's really annoying because we were making large changes to the performance suite at the same time that a few PRs of this transformative sort were merged in, haha. So we'd normally have much more clear records of which PR caused the performance regression.

@wds15
Copy link
Contributor

wds15 commented May 18, 2019

Yeah... I should ideally just code up new functions where there is no reference at all and it is completely compartmentalised... but this stuff is needed very badly... I think. I already have developed a very long breath with things, but this one is as of now almost funny if you look at it how difficult it is to get this TLS stuff in.

@seantalts
Copy link
Member Author

Yeah, sorry about that. The work we've been doing on the performance regression suite is supposed to make this an easier, more transparent process, but it's also the first time on the project where we've systemically measured performance across a breadth of models. So we're for the first time having to make hard choices where we now realize something has a performance impact where a year ago we wouldn't have noticed until a user reported a bug (or possibly never).

I'm definitely open to suggestions on how to make this better, too!

@bob-carpenter
Copy link
Contributor

bob-carpenter commented May 20, 2019 via email

@syclik
Copy link
Member

syclik commented May 21, 2019

The SOP you're referring to is just that develop be releasable, which I think @syclik is reasonably interpreting to mean, among other things, "in a state where we understand the performance tradeoffs." So if we discover something new on merge to develop on a specific PR, we revert and test that PR more thoroughly to try to outline the performance tradeoffs we're making. Only a very small percentage of PRs will need this - you happen to be doing lots of transformative work to the very core of the autodiff system, which is both exciting and needs lots of analysis.

@seantalts, thanks for articulating that so clearly. That's exactly what I'm thinking.

Why can we let Windows threading slide? I don't get it.

It was never released.

Exactly. To put it slightly differently, 0% of Windows users are negatively affected since the bug was never released. (They can always grab the branch and work on that knowing it's in flux.) There is an unknown percentage of users, but definitely greater than 0%, that would be affected by keeping this in.

It does not give me good confidence that the exact conditions for when this happens can be specified. This was exactly the same behavior I saw when evaluating the PR, but was presented with enough benchmarks to convince me otherwise. Now that we see it again, I think it's reasonable to assume it's a real thing.

Yeah, sorry about that. The work we've been doing on the performance regression suite is supposed to make this an easier, more transparent process, but it's also the first time on the project where we've systemically measured performance across a breadth of models. So we're for the first time having to make hard choices where we now realize something has a performance impact where a year ago we wouldn't have noticed until a user reported a bug (or possibly never).

@wds15, could I ask you to be involved in putting together performance testing? The things you want to touch are deep in the stack and it'd be good to get your input on what needs to be measured and perhaps we can also come to a common understanding on what needs to be covered.

And this is maybe the 2nd change to the autodiff stack. So we're in new territory for changing something so fundamental. It makes sense that we didn't create all the benchmarking for these things prior to this point.

If you're not saying it, I will. We shouldn't take a 15% performance hit!

Thanks, @bob-carpenter.

@wds15
Copy link
Contributor

wds15 commented May 21, 2019

First, the blanket statement that things get a 15% performance is way to bold, I think. We have seen only a 15% performance hit on the schools models when compiling using the TLSv5 changes under Linux with gcc. This is a very specialty case as it looks. We don't have evidence for a broader impact.

A 15% performance hit in general is unacceptable, of course.

I am still puzzled as to why we still let our Windows users down wrt to threading (it would work if TLSv5 stayed in) - but OK, seeing things relative to what has ever been release; then I start to see the logic.

@syclik I can certainly give my inputs on performance testing, but I am not calling myself an expert on this - I do have some intuition by now, obviously. However, when we get there I will strongly suggest that threading performance is being taking into consideration. Right now threading suffers from a huge performance regression (relative to the TLSv5 being in).

Above and beyond all that I thought that the issues are settled in the TLSv6 PR which reverts two changes from TLSv5 and this leads to even speedups when using clang (single-core).

@syclik
Copy link
Member

syclik commented May 21, 2019 via email

@wds15
Copy link
Contributor

wds15 commented May 21, 2019

Yes - that's more accurate.

The Windows problems are solved with using a pointer as storage for the AD tape. Using a pointer appears to make it harder in the single-core case under some circumstance to do good optimisations for the compiler. I thought/hope that the TLSv6 PR solves that issue. At least my benchmarks, which I have documented on the TLSv6 PR, seem to indicate that the code put up there is not showing the 15% performance hit on Linux with gcc and the schools model (on our system)... I need to be very specific here...

@serban-nicusor-toptal serban-nicusor-toptal added this to the 2.20.0++ milestone Jul 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants