-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize swaps, reduce memory used, SeedSequence for random numbers #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done any serious functionality tests, because I'm trusting this has been well tested as part of the enterprise
review.
You should delete any obsolete code instead of just commenting it out. I've marked the few places I noticed. I also marked a method that changed names and is in need of a docstring.
You have to merge in the latest commit on the main branch, which updates some CI things. That should fix the failing tests. |
I added some suggested directions in #32 for rebasing existing branches onto master following the merge. I also notice some partial overlap of this PR with work I did in #30 (and officially submitted after this PR, following a long wait for permission to release publicly). I'm planning to look into this PR and can maybe give some useful comments on where I see overlaps. |
@@ -431,7 +452,7 @@ def sample( | |||
Neff = 0 | |||
while runComplete is False: | |||
iter += 1 | |||
|
|||
self.comm.barrier() # make sure all processes are at the same iteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is unnecessary communication. Using bcast
below should guarantee that all MPI processes are always running the same step after the first loop.
if self.MPIrank > 0: | ||
runComplete = self.comm.Iprobe(source=0, tag=55) | ||
time.sleep(0.000001) # trick to get around | ||
runComplete = self.comm.bcast(runComplete, root=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I did something very similar here (in parallel, it seems)
|
||
if logChainSwap > np.log(np.random.rand()): | ||
swapAccepted = 1 | ||
if self.MPIrank == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block looks a lot simpler (and so probably more maintainable / less error-prone) to me than the iteration on the older code I made in #30. It does introduce a dependency on the rank 0 process for all swaps, so there's a chance the code in #30 would perform better at large scale since more work could happen in parallel (and no potential communication bottleneck would exist here for rank 0 exchanging data with other processes). I'm relying on the previous review to for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation to lock everything to the rank 0, cold chain is to fix a problem that occurs when different parts of parameter space have different compute times for the posterior. This effectively breaks detailed balance as PT swaps are more likely to propose moves into slow compute regions.
getCovariance = self.comm.Iprobe(source=0, tag=111) | ||
time.sleep(0.000001) | ||
if getCovariance and self.MPIrank > 0: | ||
# update covariance matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- I think I did something very similar
): | ||
|
||
# MPI initialization | ||
self.comm = comm | ||
self.MPIrank = self.comm.Get_rank() | ||
self.nchain = self.comm.Get_size() | ||
|
||
if self.MPIrank == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I wasn't aware of this API, but will start using it.
I ran a test of this branch using minorly-modified variants of Repeatability outputFinal checksum lines for two runs don't match. $ ./bulk_gaussian_test.sh
Performing 100 runs, with 1 chains each
****************************************************************************************
Run 1...
****************************************************************************************
Started at 2022-10-27 21:07:17 UTC
Optional acor package is not installed. Acor is optionally used to calculate the effective chain length for output in the chain file.
Adding DE jump with weight 20
Finished 99.00 percent in 15.851856 s Acceptance rate = 0.314545
Run Complete in 15.93 s
Checksum: 7e213098e398c9f9ac5b89a100a35345 results/1_chains_run_1/chain_1.txt
****************************************************************************************
Run 2...
****************************************************************************************
Started at 2022-10-27 21:07:35 UTC
Optional acor package is not installed. Acor is optionally used to calculate the effective chain length for output in the chain file.
Adding DE jump with weight 20
Finished 99.00 percent in 15.923358 s Acceptance rate = 0.341828
Run Complete in 16.00 s
Checksum: e9ac50a026c880ecf00028d764952d87 results/1_chains_run_2/chain_1.txt Multi-chain errorAlso bumping up to 2 chains, a test raised an exception that I haven't observed in other branches: File "/code/PTMCMCSampler/PTMCMCSampler.py", line 588, in PTMCMCOneStep
p0, lnlike0, lnprob0 = self.PTswap(p0, lnlike0)
TypeError: PTSampler.PTswap() missing 2 required positional arguments: 'lnprob0' and 'iter' |
@chimaerase thanks for looking through this! The last error that you noticed was a byproduct of my most recent commit. I think I've fixed the issue (some arguments were missing in the call to |
You're welcome! I'm happy to help, though I do need to be cautious of how much time I'm putting into direct work on this project, since it lies at the very edge of my scope of work. It's very unfortunate that approval delays have caused us to work in parallel, and have potentially created work vs more immediate submission. I was able to easily re-run my test with 1, 2 and 6 chains, and confirmed your fix for multiple-chain workflows. I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30. |
I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the |
Great point, thank you! It's been a long time since I actively worked with this segment of the code. I'll hope to revisit and comment again tomorrow. |
With further updates to seeding in the test code, I was able to do seeded single chain runs with repeatable results. I then also ran 100 sequential runs with 3 chains, then 200 sequential runs with 6 chains. Each sequence of runs produced identical checksums for the low temperature result file ( |
The following changes have been made in this branch:
HyperModel
framework._AMBuffer
no longer stores the entire chain: instead it stores pieces and those are shifted into_DEbuffer
np.random.x
functions have now been swapped to the Generator functions recommended in current versions ofnumpy
. Uses aSeedSequence
to generate seeds for each process. In principle, this should solve the issues with reproducibility. However, to fix this for NANOGrav, we would need to adopt a similar procedure in enterprise_extensions and enterprise.