-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression Test Heap Optimization: Remove Extinct Result Arrays #433
base: master
Are you sure you want to change the base?
Conversation
… they are not currently used in the final output; the size of the Results struct is somewhere around 800 KB, which when accounting P.NumSamples (720 in the regression tests) means that TSMeanE and TSVarE each require an allocation of around 580 MB; as a result, the regression tests when run in parallel will use 1160 x 4 = 4640 MB or 4.53 GB less than they were before; this allows the default GitHub action runners which have only 7 GB total to be able to run all 4 tests without having to page memory to disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still use them - but I like the idea of not initialising them if not used - can we use a predicate need_mean(), need_var() to determine whether to initialise them/populate them? And assert that the predicates hold true when we do output which makes use of them?
I'm surprised the regression tests don't make use of them. We should add a config that does - but we can do that in a way that uses less memory (fewer time-steps?) to check the main steps.
Also note that in the CI loop we only run -j2 as the Actions VMs are all dual-core.
Sounds good! I like having a check as an alternative. I am surprised too that it did not affect run-time. It could also be that the paging algorithm and Windows allocators are good enough to hide this cost. I'm going to try to replicate the CI binaries on my Windows system and analyze the performance versus my Linux runs to try to determine where the real difference lies. Good to note about running only 2 tests at once on the CI. |
@matt-gretton-dann Using the exact same compiler that GitHub actions is using, I get similar performance on my local machine as Clang 9 on WSL Ubuntu 20.04 does. This led me down a rabbit hole, but I think your past theory about all of the It appears that I/O performance is a common issue for GitHub actions (especially on Windows):
It appears this project is not the only one to be having this issue. It also explains why Windows performance is so drastically different than Linux and Mac. To conclude, reducing memory usage anywhere is not going to make the regression tests faster by itself but I think it would provide the extra headroom necessary for more variable tracking in additional simulations in the future. |
I have a WiP branch which enables us to control output verbosity - and so that should show whether I am right or not. |
Objective
Reduce RAM usage in regression tests so that GitHub Action runners can run all 4 tests without paging memory.
Background
The size of the
Results
struct inModel.h
is nearly 805 KB on Clang 9 for Linux, which when accounting for an array ofP.NumSamples
(720
in the regression tests) means thatTSMeanE
andTSVarE
each require an allocation of around 580 MB. As a result, a single regression test will use over 1.1 GB of RAM just for these two arrays alone. When all 4 tests are ran in parallel, they consume 4.4 GB of RAM.Rationale
As noted in #388, the default GitHub action runners have only 7 GB of total RAM and a 14 GB temporary storage device allocated to them. For Linux environments the swap space is set to 4GB and on Windows the swap space is most likely set to 2GB.
Local runs using heaptrack indicate that about 3.5GB is the average heap size of each run of
CovidSim
. If accounting for the fact that the US regression tests run for a fraction of the time that the UK tests do, let's assume that on average 7 GB of RAM is being used by the regression tests and a maximum of 14 GB is used. Windows Server 2019 requires 512MB of RAM to run. If 7 GB is the total amount of RAM given to the runners, this does not leave much room to the operating system itself. Therefore, I surmise thatCovidSim
is hitting paged memory for at least the first half of actest -j6 -V
run.Solution
This pull request removes the
TSMeanE
andTSVarE
arrays from the code since they haven't been used in the final output since the initial squash of commits back in April. The regression tests still pass.Known Concerns
If this code is still used on the super-computers or outside of the regression tests listed on GitHub, then please close this PR and I will open a new PR that documents its use so that someone doesn't attempt to propose this change in the future.
Alternative/Additional Solution
The memory requirements of the
Result
arrays could also be optimized either with heap-allocated C style arrays or std::vectors instead of static double arrays ofMAX_ADUNITS
size (which is set to3200
but the regression tests only use 1 or 2 elements). However this would require far deeper and more extensive changes to the code base than this Pull Request.