-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
US Regression Test Optimization #388
US Regression Test Optimization #388
Conversation
Weird that it fails on Windows. I'll check locally with a MSVC compile tomorrow. |
59c280a
to
863f110
Compare
Local MSVC test was successful, perhaps a fluke with the container. Rebased with master to trigger another test run. |
863f110
to
464eb1a
Compare
Found the issue, Windows regression test appears to be proceeding forward. I had only reserved space in the |
464eb1a
to
bb3b5e1
Compare
bb3b5e1
to
9713d7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly OK. What is the change in performance for the four tests (UK, US) x (j1, j2) (see updated tests on master)? Does this also improve on Linux? (I'm not too worried about as macOS).
Whilst it is significant in the integration test loop the timing for the model setup is overall a small part of the run. Running the models is what happens multiple times - so any improvements there will be more significant than shown in the integration tests.
src/MicroCellPosition.hpp
Outdated
case Up: this->y -= 1; break; | ||
case Left: this->x -= 1; break; | ||
case Down: this->y += 1; break; | ||
default: throw std::out_of_range("direction"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions and OpenMP don't play well together. Convert this into an ERR_CRITICAL please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Since opening this PR, I have made several additional optimizations. I reran the same test between ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
Start 1: inttest-us-based-j1
Start 2: inttest-us-based-j2
1/2 Test #2: inttest-us-based-j2 .............. Passed 146.94 sec
2/2 Test #1: inttest-us-based-j1 .............. Passed 170.99 sec
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 170.99 sec vs. ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
Start 1: inttest-us-based-j1
Start 2: inttest-us-based-j2
1/2 Test #2: inttest-us-based-j2 .............. Passed 74.01 sec
2/2 Test #1: inttest-us-based-j1 .............. Passed 95.28 sec
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 95.28 sec |
I have started a run with all tests enabled and will respond shortly. |
I do not have sufficient memory on my current workstation (16 GB) to run all 4 tests at the same time. I am running the UK tests separately like I did with the US test. In addition, I do recognize that the changes in this particular PR do not address code outside of model setup. So I will not be surprised to see only minor improvements to run-time for the UK tests. However as stated in #366, I have found locations within |
Looks like one of the last 3 commits is causing a failure on the UK tests. I'm diagnosing now. |
Regression test fixed. Waiting on test run from |
@matt-gretton-dann Running the UK tests, there is a ~14% run-time reduction between ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
Start 1: inttest-uk-based-j1
Start 2: inttest-uk-based-j2
1/2 Test #2: inttest-uk-based-j2 .............. Passed 674.82 sec
2/2 Test #1: inttest-uk-based-j1 .............. Passed 2805.93 sec
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 2805.94 sec vs. ozmorph@home:~/covid-sim/build$ make test ARGS="-j6 V"
Running tests...
Test project /home/ozmorph/covid-sim/build
Start 1: inttest-uk-based-j1
Start 2: inttest-uk-based-j2
1/2 Test #2: inttest-uk-based-j2 .............. Passed 547.70 sec
2/2 Test #1: inttest-uk-based-j1 .............. Passed 2413.15 sec
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 2413.16 sec |
I have a theory why Windows takes longer than MacOS and Linux. The first run of the single threaded UK integration test on Ubuntu 20.04 uses an average heap allocation of 3.5 GB, I will be doing a thorough heap profile on Windows tomorrow. Seeing as how the GitHub Action runners only have 7 GB RAM total, this would indicate that the integration tests are probably hitting paged memory. I'm not certain yet as to why Linux and MacOS don't see a slow-down, but one theory is that Linux and MacOS containers have a lower RAM requirement for the base OS. More to follow in perhaps a separate issue. I consider my development on this particular PR done and ready for final review. |
Note that the -j option can be dropped to run tests serially:
Possibly - but the Windows tests were still significantly slower when we were running tests serially. Having just looked at the Actions Virtual Env setup I think Linux runs with a 4G swap space (actions/runner-images#965). I haven't worked this out for Windows/macOS - but I guess for Windows at least it'll be on the temp SSD which will give us somewhere up to 14GB of swap. |
Thanks for the note on the
I'm not sure that's necessarily true. The GitHub action runners page indicates that 14 GB of SSD disk space is provisioned. However, Microsoft's own page file documentation states that the maximum page file size is 1/8 the size of volume capacity. In this case, the page file would be 2GB which is half of Linux's swap space. There was a recent Azure Issue that pointed out that this could be problematic but had no public resolution. But as you stated, this does not explain why the regression tests are slower on Windows when ran serially. I'll continue to diagnose. |
I think I'm OK with this - but I'd like @dlaydon or @weshinsley to comment please. |
5e1982a
to
ec19433
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've had a careful look, and this looks good to me as best as I understand. I'll just "say out loud" what I understand the key bits are:-
-
Many uses of the functions
get_number_of_micro_cells_high()
and...wide()
replaced with a constant, since this never changes. -
Relocating
Hosts[x].quar_comply
and.quar_start_time
intoHostsQuarantine[x].comply
and .start_time - separating out the quarantine properties into a structure that's "parallel" to hosts in terms of the indexes. -
Considerable amount of factoring out constants in the nested loops in AssignPeopleToPlaces().
-
Mcells[i].country relocated to mcell_country[i]
-
MicroCellPosition becomes a struct instead of a class
These all look good - I am not very clear where the biggest performance gains come from out of these, and I wonder whether amongst our internal documentation, there are learning points we should take away about what to encourage/what to avoid. Likely this has changed since the times of classic C, so we are learning a lot of "best practises" ...
@weshinsley Your analysis is correct. I'd be happy to create more in-depth/formal documentation of how I discovered these performance optimizations in the near future, but I'm a bit short for time over the next day or two. What I can say in short is that I used My intention is to soon create a separate issue that will discuss this analysis technique in more detail and how to best apply it to Thanks for the review! |
@ozmorph : Do you mind merging in master/rebasing to resolve conflicts please - then @weshinsley or I will merge. My working hypothesis is that the PRs that will improve performance most are those like these that changes structure layout/organisation to improve locality of sequential data accesses and have a regular pattern. This gives the CPU plenty of opportunity to predict correctly and not get blocked on the memory system |
… a comparison between a host's country and a nearby place's country to utilize better CPU caching and removing a redundant computation for each loop in SetupModel.cpp
… Param::get_number_of_micro_cells_wide() and Param::get_number_of_micro_cells_high() with Param.total_microcells_wide_ and Param.total_microcells_high_; since the function was performing redudant calculations (the inputs never changed) it makes more sense to compute and store the values instead of spending a function call and loading two memory addresses to achieve the same effect
…ion code to use a standard compound operator implementation (https://en.cppreference.com/w/cpp/language/operators) and making it an inline call
…sition() to pass by const reference instead of by value
…opy-by-value initialized vector to ensure that the space not only allocated and initialized but the vector's size is not 0
… const values or const references for data in the main loop of AssignPeopleToPlaces(); because of the size of the loop, calls to other functions, and the lack of specificity that most of the arguments are const, it helps a lot to tell the compiler that many of the data accesses in the loop will not cause a side effect (even though we know they don't); this commit also makes the main loop look more readable
…he unsigned short country field in Microcell struct and replacing all references to use the new mcell_country vector; this likely increased SIMD/SIMT performance with Microcells and all increased performance with microcell country comparisons
…ction in regressiontest_US_based.py) by moving the 'quar_comply' and 'quar_start_time' fields out from 'struct Person' into a new 'struct PersonQuarantine'; this change provides much better CPU caching behavior
…ll to not duplicate twice
…h fixes regression tests
ec19433
to
e8f7864
Compare
@weshinsley @matt-gretton-dann: Rebased with master as requested.
I agree completely! |
Following up on my comment in #366 , I found low-hanging fruit inside the
AssignPeopleToPlaces()
function and MicroCellPosition code that results in a substantial (~42.8%) run-time reduction for the US regression test.Merging this PR would allow for more frequent local testing.
Running
time ./regressiontest_US_based.py
onmaster
on my workstation:Running the same test on this PR: