Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244

seantalts · 2019-05-16T14:14:26Z

This reverts commit 5e76d67, reversing
changes made to ec52b8b.

Seems like this caused a 14% performance regression in example-models/bugs_examples/vol2/schools/schools-{2-4}.stan.
Ref: https://discourse.mc-stan.org/t/possible-performance-regression/8835

This reverts commit 5e76d67, reversing changes made to ec52b8b.

wds15 · 2019-05-16T15:38:06Z

This introduces BACK again a bug such that threading won't work anymore on windows.

In this regard, I am not sure of the slight performance regression is enough to trump that regression which we found in a single model so far.

Could we first have an overview which models are affected?

Moreover, there are two very hot candidates in terms of getting rid of the performance regression for the non-threaded case.

If it's our standard procedure to revert things like, then sure, go ahead.

seantalts · 2019-05-16T15:40:57Z

Yeah, just talked about it at the Math meeting. SOP is to revert since we didn't know about this before, and then study it in more detail / try to fix it / analyze tradeoffs while develop is "clean."

I hear what you're saying about the tradeoffs of the revert and breaking things again for Windows, but given that our performance testing coverage is woefully inadequate it seems like we should study this more to understand it better before we can actually decide to make that trade off.

wds15 · 2019-05-16T15:47:09Z

Ok... I am getting a bit tired working on this, but I will look for some motivation on the street...

Most likely it's the all pointer design. Changing that will make things a bit inconsistent, but it will be manageable. With a bit of luck we can turn the instance() method into a raw pointer and that solves it (hopefully).

EDIT: BTW, if this is our SOP - then we need by all means an easy way to generate the overview over the models which are affected by how much.

EDIT2: And does this mean that the gold-standard is non-threaded, right? We never spelled that out, but reverting introduces a major performance regression for any threaded application. So what if I would submit a model which uses threading... you would not be able to make the same argument again (other than saying non-threading is the gold ref).

seantalts · 2019-05-16T15:52:28Z

I feel ya. @rok-cesnovar and @SteveBronder (and maybe others) had previously volunteered to help out with parallelism efforts, maybe you guys could work together on analyzing this and potentially fixing it or making the case that it's worth it?

rok-cesnovar · 2019-05-16T16:39:42Z

I am working on pinpointing where exactly on the branch this happens. I will see if I can find the fix or seek some help here or on discourse.

rok-cesnovar · 2019-05-16T19:38:33Z

Reverting 49d3583 and ace8424 has no effect on the schools-4 model.

As @wds15 pointed out on the Discourse thread it could be due to the pointer access to the AD instance. And he is correct.

I tried my best (or worst) with an improvised/handwave-y/justmakethiscompile way of ifdefing the autodiffstackstorage.hpp ( the result is here). I never dealt with these autodiffstackstorage stuff and just wanted to see if this was it. With this we are back to

 Elapsed Time: 287.646 seconds (Warm-up)
               182.023 seconds (Sampling)
               469.669 seconds (Total)

rok-cesnovar · 2019-05-16T19:43:55Z

Is making a PR with this ifdefs an option? Instead of reverting and then reapplying? Either way, I can make that happen tomorrow to avoid the regression for threading on Windows.

wds15 · 2019-05-16T19:51:17Z

@rok-cesnovar Thanks a lot for pinpointing this. We should be able to wait for this, if you ask me, but this is probably a matter of SOP and easier git magic; so let's have @seantalts and/or @syclik have a say here.

In any case... handling things with pointers in the threading case and without in the non-threading case is not too nice. Since this only affects a single file it is probably fine - given the constraints we are under. I hope that this would be common understanding here. The art will be to write it as cleanly as possible.

rok-cesnovar · 2019-05-16T19:55:08Z

Yeah, with reverting and than applying we will also be able to double check that the fix is fine for all the performance tests in the batch so we are clear for good.

wds15 · 2019-05-16T19:55:24Z

reverting 49d3583 and ace8424 does have an effect on my system! See the forum.

(annoying to have to deal with this again... but looks like it is useful...yack)

rok-cesnovar · 2019-05-16T20:02:32Z

I did only try the revert with g++. That is my bad. Will double check with g++ and clang + Windows. But not before tomorrow.

wds15 · 2019-05-16T20:06:26Z

No worries. This stuff can drive you crazy and is almost not manageable without huge automation. It sounds as if we want to avoid the by-method access and we may event need to get rid of the pointer for the non-threaded case. Getting rid of the pointer (and use two different things for threading / non-threading) in the non-threaded case is not too nice.

Windows should work with the pointer for threading. I have never tested speed on Windows though (I just don't like to spend too much time under Windows).

wds15 · 2019-05-17T11:18:30Z

Shouldn’t we now expect to see speedups in ther performance tests which are kicked off automatically? Can someone point me to these please?

rok-cesnovar · 2019-05-17T11:45:25Z

Finally ran the tests with clang++

clang++ on develop
 Elapsed Time: 324.398 seconds (Warm-up)
               207.511 seconds (Sampling)
               531.909 seconds (Total)

clang++ with `49d3583` and `ace8424` reverted (made 2 runs)
 Elapsed Time: 296.09 seconds (Warm-up)
               187.73 seconds (Sampling)
               483.82 seconds (Total)
 Elapsed Time: 297.669 seconds (Warm-up)
               188.23 seconds (Sampling)
               485.9 seconds (Total)

So it seems that Clang handles this better and doesnt need the ifdef solution. Running on Windows now.

wds15 · 2019-05-17T11:49:53Z

Thanks for confirming that. The crux is that reverting those two commits is a solution which is cannot be combined easily with the optimisation to have for the non-threaded case a non-pointer solution.

And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations...

seantalts · 2019-05-17T13:11:06Z

So the github commenting mechanism isn't working yet, but Jenkins did run the stat_comp_benchmarks for performance and correctness in this PR against develop in this build: http://d1m1s1b1.stat.columbia.edu:8080/blue/organizations/jenkins/CmdStan%20Performance%20Tests/detail/downstream_tests/62/pipeline/18

Unfortunately the model with the problem is not in the stat_comp_benchmarks suite, but those results are here:

+ ./comparePerformance.py develop_performance.csv performance.csv
('stat_comp_benchmarks/benchmarks/gp_pois_regr/gp_pois_regr.stan', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan', 0.94)
('stat_comp_benchmarks/benchmarks/irt_2pl/irt_2pl.stan', 1.0)
('stat_comp_benchmarks/benchmarks/pkpd/one_comp_mm_elim_abs.stan', 0.97)
('stat_comp_benchmarks/benchmarks/eight_schools/eight_schools.stan', 1.01)
('stat_comp_benchmarks/benchmarks/gp_regr/gp_regr.stan', 0.99)
('stat_comp_benchmarks/benchmarks/arK/arK.stan', 1.04)
('performance.compilation', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan', 1.0)
('stat_comp_benchmarks/benchmarks/low_dim_gauss_mix/low_dim_gauss_mix.stan', 1.01)
('stat_comp_benchmarks/benchmarks/sir/sir.stan', 1.06)
('stat_comp_benchmarks/benchmarks/pkpd/sim_one_comp_mm_elim_abs.stan', 1.02)
('stat_comp_benchmarks/benchmarks/garch/garch.stan', 1.04)
('stat_comp_benchmarks/benchmarks/gp_regr/gen_gp_data.stan', 0.98)
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 1.21)
1.01635012423

I'd be prone to ignore the stuff that looks like plus or minus 4 percent or so. It looks like it sped up the arma model significantly, but that model takes under a second to run anyway. I'll run a test locally where I run it many times and see what it says about that model.

seantalts · 2019-05-17T13:38:55Z

I ran the arma model locally,

Seans-MBP ~/scm/performance-tests-cmdstan (master) $ ./compare-git-hashes.sh "stat_comp_benchmarks/benchmarks/arma --runs 30" develop develop develop PR-1244

and it looks like performance with the PR actually makes it slightly worse on my machine,

+ ./comparePerformance.py develop_performance.csv performance.csv
('stat_comp_benchmarks/benchmarks/arma/arma.stan', 0.92)
('performance.compilation', 1.03)
0.972083995854

So I guess my conclusion here is not to worry about the arma model and focus on the eight schools models. I'll test those against this PR locally to make sure this PR addresses the issue we spotted.

wds15 · 2019-05-17T13:54:49Z

It seems as if the compiler used matters for this PR... what is being used in the Jenkins tests and would it be possible to run the performance tests on Jenkins with some specific compiler as an option maybe?

seantalts · 2019-05-17T14:04:00Z

I didn't see the compiler mattering at all in Rok's testing, did that pop up at some point? Jenkins is using clang++-6.0 on Mac and g++ 5 I believe on Linux; the relative performance tests for the PR can execute on either.

wds15 · 2019-05-17T14:33:35Z

I am afraid that different compilers react differently to specific code patterns. Here is what I think right now, but I could be wrong:

And from the evidence we have so far it seems that gcc runs faster with a non-pointer solution while clang does not care about that, but for clang the method access should be avoided. At least this is how I recall this. Let's see if we can get a good solution under most situations...

seantalts · 2019-05-17T14:48:31Z

:(

wds15 · 2019-05-17T17:30:36Z

Currently we have a performance regression under Linux with gcc 5 with the schools model. If we revert 49d3583 and ace8424 then we even get a speedup with clang under Linux and under macOS. Thus, this brings us almost what we want. I would therefore suggest that we test if newer gcc versions manage to properly optimize things on the schools model with the revert mentioned. If newer gcc versions handle this (and do not show a performance regression), then we should revert those two commits and we end up with a version which only has a performance regression with older gcc versions.

Does that sound like an option?

rok-cesnovar · 2019-05-17T17:40:20Z

On Windows the regression is a bit less obvious (<3%):

g++ Develop:
 Elapsed Time 320.069 seconds (Warm-up)
               165.872 seconds (Sampling)
               485.941 seconds (Total)
g++ Reverted `49d3583` and `ace8424`
 Elapsed Time: 322.308 seconds (Warm-up)
               162.52 seconds (Sampling)
               484.828 seconds (Total)
g++ reverted tlsv5
 Elapsed Time: 316.418 seconds (Warm-up)
               157.7 seconds (Sampling)
               474.118 seconds (Total)

clang++ develop
 Elapsed Time: 319.805 seconds (Warm-up)
               161.913 seconds (Sampling)
               481.718 seconds (Total)
clang++ reverted `49d3583` and `ace8424`
 Elapsed Time: 324.069 seconds (Warm-up)
               163.576 seconds (Sampling)
               487.645 seconds (Total)
clang++ reverted tlsv5
 Elapsed Time: 317.134 seconds (Warm-up)
               157.861 seconds (Sampling)
               474.995 seconds (Total)

So yeah, 8schools on the combination of Linux+gcc is the one affected.

rok-cesnovar · 2019-05-17T17:43:35Z

Currently we have a performance regression under Linux with gcc 5 with the schools model.

I have gcc 7.3 on my Ubuntu system and it shows the same regression :/

wds15 · 2019-05-17T17:45:34Z

I just finished a run on a branch where then two hashes quoted are revert under gcc 8. And then I get a fast run:

 Elapsed Time: 228.18 seconds (Warm-up)
               143.031 seconds (Sampling)
               371.211 seconds (Total)

which is just as fast as the respective run with clang which took 376s with this.

So I think this would be sensible to do.

In fact, the original TLSv4 which was also approved did have these two hashes reverted.

EDIT: gcc8 on macOS.

rok-cesnovar · 2019-05-17T18:28:53Z

g++-8 on Linux suffers from the same regression. Just tried it.

I guess we are left with either ifdefing the pointer for the non-threading case (which, as you said, is on the not-nice side) or reverting the 2 commits with knowing that we have regression for g++ on Ubuntu for a few models.

seantalts · 2019-05-17T19:24:32Z

@syclik already asked me to just push this revert through, I've just been waiting for some really long running tests to show that it does indeed fix the issue.

I think if you have a code proposal you want to put forward the first step is to create a PR with it so we can reference it. Does it supersede this revert PR or build on top of it?

syclik · 2019-05-17T23:07:58Z

I just wanted to clarify; this wasn't a request to push this particular revert through. This is a general policy that the Math library has: if it breaks, we immediately fix it. See: https://github.com/stan-dev/stan/wiki/Stan-Software-Lifecycle-and-Development-Process#source-code-management "Taken together, the testing, code review, and merge process ensures that the development branch is always in a releasable state."

…

On Fri, May 17, 2019 at 3:24 PM seantalts ***@***.***> wrote: @syclik <https://github.com/syclik> already asked me to just push this revert through, I've just been waiting for some really long running tests to show that it does indeed fix the issue. I think if you have a code proposal you want to put forward the first step is to create a PR with it so we can reference it. Does it supersede this revert PR or build on top of it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1244?email_source=notifications&email_token=AADH6F3QYYTQXL2CVNYVO33PV4A7DA5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVVUWCQ#issuecomment-493570826>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F7CBD62FMEG7RTGI23PV4A7DANCNFSM4HNM73RA> .

seantalts · 2019-05-17T23:32:45Z

Sure, the question is now "what is releasable" - this revert PR breaks threading on Windows but restores performance on (at least) a few models but no perf regression on many more.

I don't want to interpret the policy incorrectly, so maybe @syclik do you want to either hit the merge button or suggest a path for rolling forward? My thought for rolling forward would still be to submit a PR with the proposed changes and let people test it, it's just whether that PR is built on top of the revert in the meantime or not. Shouldn't matter too much, though the best route doesn't seem immediately obvious so the revert PR first is attractive to maintain a sort of monotonically improving math library.

syclik · 2019-05-17T23:45:21Z

Yup. That's exactly the right question to ask. For any PR, if we have side effects that are bad, we should have made the decision prior to the PR getting in. Here, we only approved the PR with the understanding that it does not negatively impact existing single-core performance. So, anything deviating from that seems reasonable to revert first, then decide deliberately if that's acceptable. I'm not saying we shouldn't take a 15% performance hit. (I think we shouldn't, but this can ans should be discussed.) If we do, we should be deliberate about it and understand what trade-off we're making.

…

On Fri, May 17, 2019 at 7:32 PM seantalts ***@***.***> wrote: Sure, the question is now "what is releasable" - this revert PR breaks threading on Windows but restores performance on (at least) a few models but no perf regression on many more. I don't want to interpret the policy incorrectly, so maybe @syclik <https://github.com/syclik> do you want to either hit the merge button or suggest a path for rolling forward? My thought for rolling forward would still be to submit a PR with the proposed changes and let people test it, it's just whether that PR is built on top of the revert in the meantime or not. Shouldn't matter too much, though the best route doesn't seem immediately obvious so the revert PR first is attractive to maintain a sort of monotonically improving math library. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1244?email_source=notifications&email_token=AADH6F5QYQ4MZBLRC4RLFTTPV46B5A5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWCMBQ#issuecomment-493626886>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F4VIATAG6QIDA23RRDPV46B5ANCNFSM4HNM73RA> .

seantalts · 2019-05-18T00:36:15Z

Yes. And in general now that we're measuring performance automatically across a much wider range of tests (than we had in Daily Stan Performance) we're going to have to start making these difficult decisions.

If you made a decision in your last response I didn't catch it - what do you want us to do?

syclik · 2019-05-18T00:40:07Z

Just follow policy. We revert. (This is not an exceptional case where any single person had to make a decision on.)

…

On Fri, May 17, 2019 at 8:36 PM seantalts ***@***.***> wrote: Yes. And in general now that we're measuring performance automatically across a much wider range of tests (than we had in Daily Stan Performance) we're going to have to start making these difficult decisions. If you made a decision in your last response I didn't catch it - what do you want us to do? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1244?email_source=notifications&email_token=AADH6F43WQN65KZB6A43CFLPV5FQBA5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVWEEPY#issuecomment-493634111>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6F6ZR7WR4L4JRMGLWQTPV5FQBANCNFSM4HNM73RA> .

seantalts · 2019-05-18T00:43:18Z

Okay. This revert PR is ready to merge whenever.

wds15 · 2019-05-18T06:57:42Z

The statement "does not slow down end-to-end" performance always related to our performance benchmark set of models. On this set there is no slow-down. This was my understanding and as such it is to me another instance of changing requirements which are very hard to (and frustrating) work with. If others understood it differently, fine, then have another PR.

Moreover, we revert this PR and it reintroduces a major bug under Windows with threading. It's unclear how that plays with our guidelines. We should probably have a discussion on discourse about the place of threading. Either threading is deemed to be a side product (which makes Stan to me a useless toy for many of my applications) or Stan is a serious option to consider in real problems - but let's have that discussion on discourse.

seantalts · 2019-05-18T12:39:33Z

Oh man, my long-running tests finally finished (10 runs each of all the schools models) and on my macbook pro it looks like this revert PR didn't do anything vs develop...

rok-cesnovar · 2019-05-18T12:44:39Z

Macbook with clang?

wds15 · 2019-05-18T12:45:25Z

It seems like this is a linux + gcc problem only. And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading.

BTW, clang speeds up with the changes as in the v6 PR from @rok-cesnovar (thanks for making that one!).

seantalts · 2019-05-18T13:00:39Z

Yes, macbook pro with clang.

It seems like this is a linux + gcc problem only.

No, the performance drop was first measured on a Mac with clang. Something else is going on - either this is the wrong PR, or there were multiple OS-dependent performance drops, or the Mac we run the performance tests on is weird in some way (this is probably true but not sure if it's the culprit here)...

And what is really frighting to me in how we read our own SOPs is that single thread performance trumps bug with windows threading.

If you think of it as maintaining 100% support and only offering improvements then it kind of makes sense. But yeah, you might want to make that trade off a different way sometime if you absolutely have to.
[edit] Also Daniel's paradigm that you revert something that you found had unintended consequences until you can better analyze them I think seems right. It's not that we don't want to make that trade-off, it's that we want to know we're making that trade-off.

wds15 · 2019-05-18T13:10:33Z

Oh, I forgot: The clang does not like the indirect access via a method. That is reverted in the v6 branch from @rok-cesnovar which he filed.

The SOPs are interpreted in ways I don't understand and this is certainly telling me I should work a lot less on Stan!

In this case:

Why can we let Windows threading slide? I don't get it.
If performance is put into stone with our reference set (for which everything seems ok) of the stat_comp_performance - then we can this schools model trigger a revert?

I am just observing here. If this schools model can trigger a revert it should by all means be part of what we test regularly.

seantalts · 2019-05-18T13:13:47Z

Why can we let Windows threading slide? I don't get it.

It was never released.

If performance is put into stone with our reference set (for which everything seems ok) of the stat_comp_performance - then we can this schools model trigger a revert?

It's expensive to compile and run every model we have twice for the relative benchmarks. We don't do it for every PR commit, just on merge to develop.

wds15 · 2019-05-18T13:18:01Z

That helps a little.

seantalts · 2019-05-18T13:18:59Z

And we're still working these things out - maybe it would be worth doing the full set on an EC2 node for every PR push. But they're also pretty noisy, so it's not like the automated test triggered the revert, but it did alert us to look at it more carefully and then it was verified by a couple of people independently for this model.

The SOP you're referring to is just that develop be releasable, which I think @syclik is reasonably interpreting to mean, among other things, "in a state where we understand the performance tradeoffs." So if we discover something new on merge to develop on a specific PR, we revert and test that PR more thoroughly to try to outline the performance tradeoffs we're making. Only a very small percentage of PRs will need this - you happen to be doing lots of transformative work to the very core of the autodiff system, which is both exciting and needs lots of analysis.

seantalts · 2019-05-18T13:20:18Z

And in this case it's really annoying because we were making large changes to the performance suite at the same time that a few PRs of this transformative sort were merged in, haha. So we'd normally have much more clear records of which PR caused the performance regression.

wds15 · 2019-05-18T13:26:17Z

Yeah... I should ideally just code up new functions where there is no reference at all and it is completely compartmentalised... but this stuff is needed very badly... I think. I already have developed a very long breath with things, but this one is as of now almost funny if you look at it how difficult it is to get this TLS stuff in.

seantalts · 2019-05-18T13:47:48Z

Yeah, sorry about that. The work we've been doing on the performance regression suite is supposed to make this an easier, more transparent process, but it's also the first time on the project where we've systemically measured performance across a breadth of models. So we're for the first time having to make hard choices where we now realize something has a performance impact where a year ago we wouldn't have noticed until a user reported a bug (or possibly never).

I'm definitely open to suggestions on how to make this better, too!

bob-carpenter · 2019-05-20T19:14:13Z

I'm not saying we shouldn't take a 15% performance hit. (I think we shouldn't, but this can ans should be discussed.) If we do, we should be deliberate about it and understand what trade-off we're making.

If you're not saying it, I will. We shouldn't take a 15% performance hit!

syclik · 2019-05-21T13:33:39Z

The SOP you're referring to is just that develop be releasable, which I think @syclik is reasonably interpreting to mean, among other things, "in a state where we understand the performance tradeoffs." So if we discover something new on merge to develop on a specific PR, we revert and test that PR more thoroughly to try to outline the performance tradeoffs we're making. Only a very small percentage of PRs will need this - you happen to be doing lots of transformative work to the very core of the autodiff system, which is both exciting and needs lots of analysis.

@seantalts, thanks for articulating that so clearly. That's exactly what I'm thinking.

Why can we let Windows threading slide? I don't get it.

It was never released.

Exactly. To put it slightly differently, 0% of Windows users are negatively affected since the bug was never released. (They can always grab the branch and work on that knowing it's in flux.) There is an unknown percentage of users, but definitely greater than 0%, that would be affected by keeping this in.

It does not give me good confidence that the exact conditions for when this happens can be specified. This was exactly the same behavior I saw when evaluating the PR, but was presented with enough benchmarks to convince me otherwise. Now that we see it again, I think it's reasonable to assume it's a real thing.

Yeah, sorry about that. The work we've been doing on the performance regression suite is supposed to make this an easier, more transparent process, but it's also the first time on the project where we've systemically measured performance across a breadth of models. So we're for the first time having to make hard choices where we now realize something has a performance impact where a year ago we wouldn't have noticed until a user reported a bug (or possibly never).

@wds15, could I ask you to be involved in putting together performance testing? The things you want to touch are deep in the stack and it'd be good to get your input on what needs to be measured and perhaps we can also come to a common understanding on what needs to be covered.

And this is maybe the 2nd change to the autodiff stack. So we're in new territory for changing something so fundamental. It makes sense that we didn't create all the benchmarking for these things prior to this point.

If you're not saying it, I will. We shouldn't take a 15% performance hit!

Thanks, @bob-carpenter.

wds15 · 2019-05-21T14:44:41Z

First, the blanket statement that things get a 15% performance is way to bold, I think. We have seen only a 15% performance hit on the schools models when compiling using the TLSv5 changes under Linux with gcc. This is a very specialty case as it looks. We don't have evidence for a broader impact.

A 15% performance hit in general is unacceptable, of course.

I am still puzzled as to why we still let our Windows users down wrt to threading (it would work if TLSv5 stayed in) - but OK, seeing things relative to what has ever been release; then I start to see the logic.

@syclik I can certainly give my inputs on performance testing, but I am not calling myself an expert on this - I do have some intuition by now, obviously. However, when we get there I will strongly suggest that threading performance is being taking into consideration. Right now threading suffers from a huge performance regression (relative to the TLSv5 being in).

Above and beyond all that I thought that the issues are settled in the TLSv6 PR which reverts two changes from TLSv5 and this leads to even speedups when using clang (single-core).

syclik · 2019-05-21T14:50:39Z

There's a little miscommunication. No one is stating there's a 15% performance hit across the board. We've seen it for certain combinations of hardware + compiler and this has been verified independently multiple times. (If this is limited to a buggy version of g++, great... that's something we can work with, but we haven't determined that.) What we do know that it happens and for some, this will be a performance degradation. If you think we can fix Windows separately, we should do that.

…

On Tue, May 21, 2019 at 10:44 AM wds15 ***@***.***> wrote: First, the blanket statement that things get a 15% performance is way to bold, I think. We haven't seen only a 15% performance hit on the schools models when compiling using the TLSv5 changes under Linux with gcc. This is a very specialty case as it looks. A 15% performance hit in general is unacceptable, of course. I am still puzzled as to why we still let our Windows users down wrt to threading (it would work if TLSv5 stayed in) - but OK, seeing things relative to what has ever been release; then I start to see the logic. @syclik <https://github.com/syclik> I can certainly give my inputs on performance testing, but I am not calling myself an expert on this - I do have some intuition by now, obviously. However, when we get there I will strongly suggest that *threading* performance is being taking into consideration. Right now threading suffers from a huge performance regression (relative to the TLSv5 being in). Above and beyond all that I thought that the issues are settled in the TLSv6 PR which reverts two changes from TLSv5 and this leads to even speedups when using clang (single-core). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1244?email_source=notifications&email_token=AADH6F6A57UZCPZSS5YHB4DPWQDFVA5CNFSM4HNM73RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4ETAY#issuecomment-494422403>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADH6FYRHR2YQ6UZ665D6JDPWQDFVANCNFSM4HNM73RA> .

wds15 · 2019-05-21T14:57:13Z

Yes - that's more accurate.

The Windows problems are solved with using a pointer as storage for the AD tape. Using a pointer appears to make it harder in the single-core case under some circumstance to do good optimisations for the compiler. I thought/hope that the TLSv6 PR solves that issue. At least my benchmarks, which I have documented on the TLSv6 PR, seem to indicate that the code put up there is not showing the 15% performance hit on Linux with gcc and the schools model (on our system)... I need to be very specific here...

Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5"

db1a002

This reverts commit 5e76d67, reversing changes made to ec52b8b.

rok-cesnovar approved these changes May 17, 2019

View reviewed changes

stan-dev deleted a comment from stan-buildbot May 18, 2019

rok-cesnovar merged commit f49f224 into develop May 18, 2019

rok-cesnovar mentioned this pull request May 18, 2019

Feature/faster ad tls v6 #1245

Merged

seantalts deleted the perf-regression branch May 18, 2019 13:00

serban-nicusor-toptal added this to the 2.20.0++ milestone Jul 18, 2019

Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244

Revert "Merge pull request #1212 from stan-dev/feature/faster-ad-tls-v5" #1244

Conversation

seantalts commented May 16, 2019

wds15 commented May 16, 2019

seantalts commented May 16, 2019 • edited Loading

wds15 commented May 16, 2019 • edited Loading

seantalts commented May 16, 2019

rok-cesnovar commented May 16, 2019

rok-cesnovar commented May 16, 2019

rok-cesnovar commented May 16, 2019

wds15 commented May 16, 2019

rok-cesnovar commented May 16, 2019

wds15 commented May 16, 2019

rok-cesnovar commented May 16, 2019

wds15 commented May 16, 2019

wds15 commented May 17, 2019

rok-cesnovar commented May 17, 2019

wds15 commented May 17, 2019

seantalts commented May 17, 2019

seantalts commented May 17, 2019

wds15 commented May 17, 2019

seantalts commented May 17, 2019

wds15 commented May 17, 2019

seantalts commented May 17, 2019

wds15 commented May 17, 2019

rok-cesnovar commented May 17, 2019 • edited Loading

rok-cesnovar commented May 17, 2019

wds15 commented May 17, 2019 • edited Loading

rok-cesnovar commented May 17, 2019

seantalts commented May 17, 2019

syclik commented May 17, 2019 via email

seantalts commented May 17, 2019

syclik commented May 17, 2019 via email

seantalts commented May 18, 2019

syclik commented May 18, 2019 via email

seantalts commented May 18, 2019

wds15 commented May 18, 2019

seantalts commented May 18, 2019

rok-cesnovar commented May 18, 2019

wds15 commented May 18, 2019

seantalts commented May 18, 2019 • edited Loading

wds15 commented May 18, 2019

seantalts commented May 18, 2019

wds15 commented May 18, 2019

seantalts commented May 18, 2019

seantalts commented May 18, 2019

wds15 commented May 18, 2019

seantalts commented May 18, 2019

bob-carpenter commented May 20, 2019 via email

syclik commented May 21, 2019

wds15 commented May 21, 2019 • edited Loading

syclik commented May 21, 2019 via email

wds15 commented May 21, 2019

seantalts commented May 16, 2019 •

edited

Loading

wds15 commented May 16, 2019 •

edited

Loading

rok-cesnovar commented May 17, 2019 •

edited

Loading

wds15 commented May 17, 2019 •

edited

Loading

seantalts commented May 18, 2019 •

edited

Loading

wds15 commented May 21, 2019 •

edited

Loading