US Regression Test Optimization #388

ozmorph · 2020-06-15T17:09:50Z

Following up on my comment in #366 , I found low-hanging fruit inside the AssignPeopleToPlaces() function and MicroCellPosition code that results in a substantial (~42.8%) run-time reduction for the US regression test.

Merging this PR would allow for more frequent local testing.

Running time ./regressiontest_US_based.py on master on my workstation:

real    4m24.706s
user    4m52.256s
sys     0m7.381s

Running the same test on this PR:

real    2m31.731s
user    2m58.940s
sys     0m7.318s

ozmorph · 2020-06-16T00:20:45Z

Weird that it fails on Windows. I'll check locally with a MSVC compile tomorrow.

ozmorph · 2020-06-16T14:28:28Z

Local MSVC test was successful, perhaps a fluke with the container. Rebased with master to trigger another test run.

ozmorph · 2020-06-16T16:09:19Z

Local MSVC test was successful, perhaps a fluke with the container. Rebased with master to trigger another test run.

Found the issue, Windows regression test appears to be proceeding forward. I had only reserved space in the mcell_country vector and not initialized it. So in the proceeding lines when it performed an array access to initialize each element, it was reaching into uninitialized memory. Windows is much more particular about out-of-bounds memory access than OSX or Linux, so this explains why Linux/OSX were passing and Windows wasn't.

matt-gretton-dann

Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).

Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.

matt-gretton-dann · 2020-06-18T15:34:51Z

src/MicroCellPosition.hpp

+			case Up: this->y -= 1; break;
+			case Left: this->x -= 1; break;
+			case Down: this->y += 1; break;
+			default: throw std::out_of_range("direction");


Exceptions and OpenMP don't play well together. Convert this into an ERR_CRITICAL please.

ozmorph · 2020-06-18T15:44:49Z

Since opening this PR, I have made several additional optimizations. I reran the same test between master and this branch to get updated numbers. 44% run-time reduction.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-us-based-j1
    Start 2: inttest-us-based-j2
1/2 Test #2: inttest-us-based-j2 ..............   Passed  146.94 sec
2/2 Test #1: inttest-us-based-j1 ..............   Passed  170.99 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 170.99 sec

vs.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-us-based-j1
    Start 2: inttest-us-based-j2
1/2 Test #2: inttest-us-based-j2 ..............   Passed   74.01 sec
2/2 Test #1: inttest-us-based-j1 ..............   Passed   95.28 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =  95.28 sec

ozmorph · 2020-06-18T15:46:30Z

Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).

Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.

I have started a run with all tests enabled and will respond shortly.

ozmorph · 2020-06-18T15:55:34Z

Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).
Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.

I have started a run with all tests enabled and will respond shortly.

@matt-gretton-dann

I do not have sufficient memory on my current workstation (16 GB) to run all 4 tests at the same time. I am running the UK tests separately like I did with the US test.

In addition, I do recognize that the changes in this particular PR do not address code outside of model setup. So I will not be surprised to see only minor improvements to run-time for the UK tests. However as stated in #366, I have found locations within Sweep.cpp and Update.cpp that are ripe for better CPU caching behavior. But the changes to structures like Person and Household require significant refactors in the rest of the code that I thought would be better done in separate pull requests. I will address this in a new PR/Issue soon.

ozmorph · 2020-06-18T16:16:46Z

Looks like one of the last 3 commits is causing a failure on the UK tests. I'm diagnosing now.

ozmorph · 2020-06-18T18:00:37Z

Regression test fixed. Waiting on test run from master before posting run-time improvements for UK tests.

ozmorph · 2020-06-18T19:26:17Z

@matt-gretton-dann Running the UK tests, there is a ~14% run-time reduction between master and this PR.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-uk-based-j1
    Start 2: inttest-uk-based-j2
1/2 Test #2: inttest-uk-based-j2 ..............   Passed  674.82 sec
2/2 Test #1: inttest-uk-based-j1 ..............   Passed  2805.93 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 2805.94 sec

vs.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-uk-based-j1
    Start 2: inttest-uk-based-j2
1/2 Test #2: inttest-uk-based-j2 ..............   Passed  547.70 sec
2/2 Test #1: inttest-uk-based-j1 ..............   Passed  2413.15 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 2413.16 sec

ozmorph · 2020-06-18T21:49:48Z

I have a theory why Windows takes longer than MacOS and Linux. The first run of the single threaded UK integration test on Ubuntu 20.04 uses an average heap allocation of 3.5 GB, I will be doing a thorough heap profile on Windows tomorrow.

Seeing as how the GitHub Action runners only have 7 GB RAM total, this would indicate that the integration tests are probably hitting paged memory. I'm not certain yet as to why Linux and MacOS don't see a slow-down, but one theory is that Linux and MacOS containers have a lower RAM requirement for the base OS.

More to follow in perhaps a separate issue. I consider my development on this particular PR done and ready for final review.

matt-gretton-dann · 2020-06-19T08:26:34Z

I do not have sufficient memory on my current workstation (16 GB) to run all 4 tests at the same time. I am running the UK tests separately like I did with the US test.

Note that the -j option can be dropped to run tests serially: make test ARGS="-V" or even make test. This should enable you to run all tests in one go (but serially). (For reference the ARGS variable takes ctest command line options).

I have a theory why Windows takes longer than MacOS and Linux. The first run of the single threaded UK integration test on Ubuntu 20.04 uses an average heap allocation of 3.5 GB, I will be doing a thorough heap profile on Windows tomorrow.

Seeing as how the GitHub Action runners only have 7 GB RAM total, this would indicate that the integration tests are probably hitting paged memory. I'm not certain yet as to why Linux and MacOS don't see a slow-down, but one theory is that Linux and MacOS containers have a lower RAM requirement for the base OS.

Possibly - but the Windows tests were still significantly slower when we were running tests serially. Having just looked at the Actions Virtual Env setup I think Linux runs with a 4G swap space (actions/runner-images#965). I haven't worked this out for Windows/macOS - but I guess for Windows at least it'll be on the temp SSD which will give us somewhere up to 14GB of swap.

ozmorph · 2020-06-19T13:08:43Z

Thanks for the note on the -j option!

Possibly - but the Windows tests were still significantly slower when we were running tests serially. Having just looked at the Actions Virtual Env setup I think Linux runs with a 4G swap space (actions/virtual-environments#965). I haven't worked this out for Windows/macOS - but I guess for Windows at least it'll be on the temp SSD which will give us somewhere up to 14GB of swap.

I'm not sure that's necessarily true. The GitHub action runners page indicates that 14 GB of SSD disk space is provisioned. However, Microsoft's own page file documentation states that the maximum page file size is 1/8 the size of volume capacity. In this case, the page file would be 2GB which is half of Linux's swap space. There was a recent Azure Issue that pointed out that this could be problematic but had no public resolution.

But as you stated, this does not explain why the regression tests are slower on Windows when ran serially. I'll continue to diagnose.

matt-gretton-dann · 2020-06-20T13:32:30Z

I think I'm OK with this - but I'd like @dlaydon or @weshinsley to comment please.

weshinsley

Ok, I've had a careful look, and this looks good to me as best as I understand. I'll just "say out loud" what I understand the key bits are:-

Many uses of the functions get_number_of_micro_cells_high() and ...wide() replaced with a constant, since this never changes.
Relocating Hosts[x].quar_comply and .quar_start_time into HostsQuarantine[x].comply and .start_time - separating out the quarantine properties into a structure that's "parallel" to hosts in terms of the indexes.
Considerable amount of factoring out constants in the nested loops in AssignPeopleToPlaces().
Mcells[i].country relocated to mcell_country[i]
MicroCellPosition becomes a struct instead of a class

These all look good - I am not very clear where the biggest performance gains come from out of these, and I wonder whether amongst our internal documentation, there are learning points we should take away about what to encourage/what to avoid. Likely this has changed since the times of classic C, so we are learning a lot of "best practises" ...

ozmorph · 2020-06-24T02:39:32Z

@weshinsley Your analysis is correct. I'd be happy to create more in-depth/formal documentation of how I discovered these performance optimizations in the near future, but I'm a bit short for time over the next day or two.

What I can say in short is that I used perf on Linux to locate individual lines of code that were causing the biggest performance issues. Then I looked at memory dependencies associated with those lines and determined if small changes to the structs could achieve better CPU cache behavior. The idea behind moving towards parallel structs is not specific to C++, it is quite common in C code as well. https://en.wikipedia.org/wiki/AoS_and_SoA

My intention is to soon create a separate issue that will discuss this analysis technique in more detail and how to best apply it to Update.cpp and Sweep.cpp in order to further optimize the run-time performance of the code.

Thanks for the review!

matt-gretton-dann · 2020-06-24T16:47:22Z

@ozmorph : Do you mind merging in master/rebasing to resolve conflicts please - then @weshinsley or I will merge.

My working hypothesis is that the PRs that will improve performance most are those like these that changes structure layout/organisation to improve locality of sequential data accesses and have a regular pattern. This gives the CPU plenty of opportunity to predict correctly and not get blocked on the memory system

… a comparison between a host's country and a nearby place's country to utilize better CPU caching and removing a redundant computation for each loop in SetupModel.cpp

… Param::get_number_of_micro_cells_wide() and Param::get_number_of_micro_cells_high() with Param.total_microcells_wide_ and Param.total_microcells_high_; since the function was performing redudant calculations (the inputs never changed) it makes more sense to compute and store the values instead of spending a function call and loading two memory addresses to achieve the same effect

…ion code to use a standard compound operator implementation (https://en.cppreference.com/w/cpp/language/operators) and making it an inline call

…sition() to pass by const reference instead of by value

…opy-by-value initialized vector to ensure that the space not only allocated and initialized but the vector's size is not 0

… const values or const references for data in the main loop of AssignPeopleToPlaces(); because of the size of the loop, calls to other functions, and the lack of specificity that most of the arguments are const, it helps a lot to tell the compiler that many of the data accesses in the loop will not cause a side effect (even though we know they don't); this commit also makes the main loop look more readable

…he unsigned short country field in Microcell struct and replacing all references to use the new mcell_country vector; this likely increased SIMD/SIMT performance with Microcells and all increased performance with microcell country comparisons

…ction in regressiontest_US_based.py) by moving the 'quar_comply' and 'quar_start_time' fields out from 'struct Person' into a new 'struct PersonQuarantine'; this change provides much better CPU caching behavior

…ll to not duplicate twice

…h fixes regression tests

ozmorph · 2020-06-25T00:32:15Z

@ozmorph : Do you mind merging in master/rebasing to resolve conflicts please - then @weshinsley or I will merge.

@weshinsley @matt-gretton-dann: Rebased with master as requested.

My working hypothesis is that the PRs that will improve performance most are those like these that changes structure layout/organisation to improve locality of sequential data accesses and have a regular pattern. This gives the CPU plenty of opportunity to predict correctly and not get blocked on the memory system

I agree completely!

ozmorph requested review from dlaydon, gnedjati, igfoo, matt-gretton-dann, NeilFerguson and Z80coder as code owners June 15, 2020 17:09

ozmorph force-pushed the assign_people_to_places_optimization branch from 59c280a to 863f110 Compare June 16, 2020 14:27

ozmorph force-pushed the assign_people_to_places_optimization branch from 863f110 to 464eb1a Compare June 16, 2020 15:51

ozmorph force-pushed the assign_people_to_places_optimization branch from 464eb1a to bb3b5e1 Compare June 17, 2020 02:40

ozmorph mentioned this pull request Jun 17, 2020

Improve testing infrastructure #366

Merged

ozmorph force-pushed the assign_people_to_places_optimization branch from bb3b5e1 to 9713d7a Compare June 18, 2020 14:52

matt-gretton-dann requested changes Jun 18, 2020

View reviewed changes

ozmorph mentioned this pull request Jun 21, 2020

Reduce the number of for loops in CovidSim.cpp #400

Merged

ozmorph force-pushed the assign_people_to_places_optimization branch from 5e1982a to ec19433 Compare June 23, 2020 01:34

ozmorph requested a review from weshinsley as a code owner June 23, 2020 01:34

weshinsley approved these changes Jun 23, 2020

View reviewed changes

ozmorph added 13 commits June 24, 2020 20:04

25% reduction in run time for regressiontest_US_based.py by reworking…

5304a7b

… a comparison between a host's country and a nearby place's country to utilize better CPU caching and removing a redundant computation for each loop in SetupModel.cpp

deduplicated code

0970aa3

3% reduction in regressiontest_US_based.py by changing MicroCellPosit…

29f86e7

…ion code to use a standard compound operator implementation (https://en.cppreference.com/w/cpp/language/operators) and making it an inline call

changed Param::is_in_bounds() and Param::get_micro_cell_index_from_po…

5f8e3a5

…sition() to pass by const reference instead of by value

replaced mcell_country's vector.reserve() in SetupPopulation() to a c…

34cbece

…opy-by-value initialized vector to ensure that the space not only allocated and initialized but the vector's size is not 0

significant runtime improvement in RecordQuarNotInfected() (~10% redu…

8f79c57

…ction in regressiontest_US_based.py) by moving the 'quar_comply' and 'quar_start_time' fields out from 'struct Person' into a new 'struct PersonQuarantine'; this change provides much better CPU caching behavior

moved mcell_country initialization value to the vector constructor ca…

1e8faba

…ll to not duplicate twice

fixed MSVC build error

0658858

replaced exception throw with ERR_CRITICAL in MicroCellPosition.hpp

7124e8c

added re-initialization of HostsQuarantine back into InitModel() whic…

e8f7864

…h fixes regression tests

ozmorph force-pushed the assign_people_to_places_optimization branch from ec19433 to e8f7864 Compare June 25, 2020 00:09

weshinsley merged commit bcb429b into mrc-ide:master Jun 25, 2020

ozmorph deleted the assign_people_to_places_optimization branch June 25, 2020 12:52

ozmorph mentioned this pull request Jul 2, 2020

Regression Test Heap Optimization: Remove Extinct Result Arrays #433

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

US Regression Test Optimization #388

US Regression Test Optimization #388

ozmorph commented Jun 15, 2020

ozmorph commented Jun 16, 2020

ozmorph commented Jun 16, 2020

ozmorph commented Jun 16, 2020

matt-gretton-dann left a comment •

edited

Loading

matt-gretton-dann Jun 18, 2020

ozmorph Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

matt-gretton-dann commented Jun 19, 2020

ozmorph commented Jun 19, 2020

matt-gretton-dann commented Jun 20, 2020

weshinsley left a comment

ozmorph commented Jun 24, 2020

matt-gretton-dann commented Jun 24, 2020

ozmorph commented Jun 25, 2020

US Regression Test Optimization #388

US Regression Test Optimization #388

Conversation

ozmorph commented Jun 15, 2020

ozmorph commented Jun 16, 2020

ozmorph commented Jun 16, 2020

ozmorph commented Jun 16, 2020

matt-gretton-dann left a comment • edited Loading

Choose a reason for hiding this comment

matt-gretton-dann Jun 18, 2020

Choose a reason for hiding this comment

ozmorph Jun 18, 2020

Choose a reason for hiding this comment

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

ozmorph commented Jun 18, 2020

matt-gretton-dann commented Jun 19, 2020

ozmorph commented Jun 19, 2020

matt-gretton-dann commented Jun 20, 2020

weshinsley left a comment

Choose a reason for hiding this comment

ozmorph commented Jun 24, 2020

matt-gretton-dann commented Jun 24, 2020

ozmorph commented Jun 25, 2020

matt-gretton-dann left a comment •

edited

Loading