Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

US Regression Test Optimization #388

Merged

Conversation

ozmorph
Copy link
Contributor

@ozmorph ozmorph commented Jun 15, 2020

Following up on my comment in #366 , I found low-hanging fruit inside the AssignPeopleToPlaces() function and MicroCellPosition code that results in a substantial (~42.8%) run-time reduction for the US regression test.

Merging this PR would allow for more frequent local testing.

Running time ./regressiontest_US_based.py on master on my workstation:

real    4m24.706s
user    4m52.256s
sys     0m7.381s

Running the same test on this PR:

real    2m31.731s
user    2m58.940s
sys     0m7.318s

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 16, 2020

Weird that it fails on Windows. I'll check locally with a MSVC compile tomorrow.

@ozmorph ozmorph force-pushed the assign_people_to_places_optimization branch from 59c280a to 863f110 Compare June 16, 2020 14:27
@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 16, 2020

Local MSVC test was successful, perhaps a fluke with the container. Rebased with master to trigger another test run.

@ozmorph ozmorph force-pushed the assign_people_to_places_optimization branch from 863f110 to 464eb1a Compare June 16, 2020 15:51
@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 16, 2020

Local MSVC test was successful, perhaps a fluke with the container. Rebased with master to trigger another test run.

Found the issue, Windows regression test appears to be proceeding forward. I had only reserved space in the mcell_country vector and not initialized it. So in the proceeding lines when it performed an array access to initialize each element, it was reaching into uninitialized memory. Windows is much more particular about out-of-bounds memory access than OSX or Linux, so this explains why Linux/OSX were passing and Windows wasn't.

@ozmorph ozmorph force-pushed the assign_people_to_places_optimization branch from 464eb1a to bb3b5e1 Compare June 17, 2020 02:40
@ozmorph ozmorph force-pushed the assign_people_to_places_optimization branch from bb3b5e1 to 9713d7a Compare June 18, 2020 14:52
Copy link
Collaborator

@matt-gretton-dann matt-gretton-dann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).

Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.

case Up: this->y -= 1; break;
case Left: this->x -= 1; break;
case Down: this->y += 1; break;
default: throw std::out_of_range("direction");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions and OpenMP don't play well together. Convert this into an ERR_CRITICAL please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

Since opening this PR, I have made several additional optimizations. I reran the same test between master and this branch to get updated numbers. 44% run-time reduction.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-us-based-j1
    Start 2: inttest-us-based-j2
1/2 Test #2: inttest-us-based-j2 ..............   Passed  146.94 sec
2/2 Test #1: inttest-us-based-j1 ..............   Passed  170.99 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 170.99 sec

vs.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-us-based-j1
    Start 2: inttest-us-based-j2
1/2 Test #2: inttest-us-based-j2 ..............   Passed   74.01 sec
2/2 Test #1: inttest-us-based-j1 ..............   Passed   95.28 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =  95.28 sec

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).

Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.

I have started a run with all tests enabled and will respond shortly.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).
Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.

I have started a run with all tests enabled and will respond shortly.

@matt-gretton-dann

I do not have sufficient memory on my current workstation (16 GB) to run all 4 tests at the same time. I am running the UK tests separately like I did with the US test.

In addition, I do recognize that the changes in this particular PR do not address code outside of model setup. So I will not be surprised to see only minor improvements to run-time for the UK tests. However as stated in #366, I have found locations within Sweep.cpp and Update.cpp that are ripe for better CPU caching behavior. But the changes to structures like Person and Household require significant refactors in the rest of the code that I thought would be better done in separate pull requests. I will address this in a new PR/Issue soon.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

Looks like one of the last 3 commits is causing a failure on the UK tests. I'm diagnosing now.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

Regression test fixed. Waiting on test run from master before posting run-time improvements for UK tests.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

@matt-gretton-dann Running the UK tests, there is a ~14% run-time reduction between master and this PR.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-uk-based-j1
    Start 2: inttest-uk-based-j2
1/2 Test #2: inttest-uk-based-j2 ..............   Passed  674.82 sec
2/2 Test #1: inttest-uk-based-j1 ..............   Passed  2805.93 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 2805.94 sec

vs.

ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
    Start 1: inttest-uk-based-j1
    Start 2: inttest-uk-based-j2
1/2 Test #2: inttest-uk-based-j2 ..............   Passed  547.70 sec
2/2 Test #1: inttest-uk-based-j1 ..............   Passed  2413.15 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) = 2413.16 sec

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 18, 2020

I have a theory why Windows takes longer than MacOS and Linux. The first run of the single threaded UK integration test on Ubuntu 20.04 uses an average heap allocation of 3.5 GB, I will be doing a thorough heap profile on Windows tomorrow.

Seeing as how the GitHub Action runners only have 7 GB RAM total, this would indicate that the integration tests are probably hitting paged memory. I'm not certain yet as to why Linux and MacOS don't see a slow-down, but one theory is that Linux and MacOS containers have a lower RAM requirement for the base OS.

More to follow in perhaps a separate issue. I consider my development on this particular PR done and ready for final review.

@matt-gretton-dann
Copy link
Collaborator

I do not have sufficient memory on my current workstation (16 GB) to run all 4 tests at the same time. I am running the UK tests separately like I did with the US test.

Note that the -j option can be dropped to run tests serially: make test ARGS="-V" or even make test. This should enable you to run all tests in one go (but serially). (For reference the ARGS variable takes ctest command line options).

I have a theory why Windows takes longer than MacOS and Linux. The first run of the single threaded UK integration test on Ubuntu 20.04 uses an average heap allocation of 3.5 GB, I will be doing a thorough heap profile on Windows tomorrow.

Seeing as how the GitHub Action runners only have 7 GB RAM total, this would indicate that the integration tests are probably hitting paged memory. I'm not certain yet as to why Linux and MacOS don't see a slow-down, but one theory is that Linux and MacOS containers have a lower RAM requirement for the base OS.

Possibly - but the Windows tests were still significantly slower when we were running tests serially. Having just looked at the Actions Virtual Env setup I think Linux runs with a 4G swap space (actions/runner-images#965). I haven't worked this out for Windows/macOS - but I guess for Windows at least it'll be on the temp SSD which will give us somewhere up to 14GB of swap.

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 19, 2020

Thanks for the note on the -j option!

Possibly - but the Windows tests were still significantly slower when we were running tests serially. Having just looked at the Actions Virtual Env setup I think Linux runs with a 4G swap space (actions/virtual-environments#965). I haven't worked this out for Windows/macOS - but I guess for Windows at least it'll be on the temp SSD which will give us somewhere up to 14GB of swap.

I'm not sure that's necessarily true. The GitHub action runners page indicates that 14 GB of SSD disk space is provisioned. However, Microsoft's own page file documentation states that the maximum page file size is 1/8 the size of volume capacity. In this case, the page file would be 2GB which is half of Linux's swap space. There was a recent Azure Issue that pointed out that this could be problematic but had no public resolution.

But as you stated, this does not explain why the regression tests are slower on Windows when ran serially. I'll continue to diagnose.

@matt-gretton-dann
Copy link
Collaborator

I think I'm OK with this - but I'd like @dlaydon or @weshinsley to comment please.

Copy link
Collaborator

@weshinsley weshinsley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've had a careful look, and this looks good to me as best as I understand. I'll just "say out loud" what I understand the key bits are:-

  1. Many uses of the functions get_number_of_micro_cells_high() and ...wide() replaced with a constant, since this never changes.

  2. Relocating Hosts[x].quar_comply and .quar_start_time into HostsQuarantine[x].comply and .start_time - separating out the quarantine properties into a structure that's "parallel" to hosts in terms of the indexes.

  3. Considerable amount of factoring out constants in the nested loops in AssignPeopleToPlaces().

  4. Mcells[i].country relocated to mcell_country[i]

  5. MicroCellPosition becomes a struct instead of a class

These all look good - I am not very clear where the biggest performance gains come from out of these, and I wonder whether amongst our internal documentation, there are learning points we should take away about what to encourage/what to avoid. Likely this has changed since the times of classic C, so we are learning a lot of "best practises" ...

@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 24, 2020

@weshinsley Your analysis is correct. I'd be happy to create more in-depth/formal documentation of how I discovered these performance optimizations in the near future, but I'm a bit short for time over the next day or two.

What I can say in short is that I used perf on Linux to locate individual lines of code that were causing the biggest performance issues. Then I looked at memory dependencies associated with those lines and determined if small changes to the structs could achieve better CPU cache behavior. The idea behind moving towards parallel structs is not specific to C++, it is quite common in C code as well. https://en.wikipedia.org/wiki/AoS_and_SoA

My intention is to soon create a separate issue that will discuss this analysis technique in more detail and how to best apply it to Update.cpp and Sweep.cpp in order to further optimize the run-time performance of the code.

Thanks for the review!

@matt-gretton-dann
Copy link
Collaborator

@ozmorph : Do you mind merging in master/rebasing to resolve conflicts please - then @weshinsley or I will merge.

My working hypothesis is that the PRs that will improve performance most are those like these that changes structure layout/organisation to improve locality of sequential data accesses and have a regular pattern. This gives the CPU plenty of opportunity to predict correctly and not get blocked on the memory system

… a comparison between a host's country and a nearby place's country to utilize better CPU caching and removing a redundant computation for each loop in SetupModel.cpp
… Param::get_number_of_micro_cells_wide() and Param::get_number_of_micro_cells_high() with Param.total_microcells_wide_ and Param.total_microcells_high_; since the function was performing redudant calculations (the inputs never changed) it makes more sense to compute and store the values instead of spending a function call and loading two memory addresses to achieve the same effect
…ion code to use a standard compound operator implementation (https://en.cppreference.com/w/cpp/language/operators) and making it an inline call
…sition() to pass by const reference instead of by value
…opy-by-value initialized vector to ensure that the space not only allocated and initialized but the vector's size is not 0
… const values or const references for data in the main loop of AssignPeopleToPlaces(); because of the size of the loop, calls to other functions, and the lack of specificity that most of the arguments are const, it helps a lot to tell the compiler that many of the data accesses in the loop will not cause a side effect (even though we know they don't); this commit also makes the main loop look more readable
…he unsigned short country field in Microcell struct and replacing all references to use the new mcell_country vector; this likely increased SIMD/SIMT performance with Microcells and all increased performance with microcell country comparisons
…ction in regressiontest_US_based.py) by moving the 'quar_comply' and 'quar_start_time' fields out from 'struct Person' into a new 'struct PersonQuarantine'; this change provides much better CPU caching behavior
@ozmorph ozmorph force-pushed the assign_people_to_places_optimization branch from ec19433 to e8f7864 Compare June 25, 2020 00:09
@ozmorph
Copy link
Contributor Author

ozmorph commented Jun 25, 2020

@ozmorph : Do you mind merging in master/rebasing to resolve conflicts please - then @weshinsley or I will merge.

@weshinsley @matt-gretton-dann: Rebased with master as requested.

My working hypothesis is that the PRs that will improve performance most are those like these that changes structure layout/organisation to improve locality of sequential data accesses and have a regular pattern. This gives the CPU plenty of opportunity to predict correctly and not get blocked on the memory system

I agree completely!

@weshinsley weshinsley merged commit bcb429b into mrc-ide:master Jun 25, 2020
@ozmorph ozmorph deleted the assign_people_to_places_optimization branch June 25, 2020 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants