Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

Merged
merged 24 commits into from
Oct 30, 2022

Conversation

AaronDJohnson
Copy link
Collaborator

The following changes have been made in this branch:

  • Chains now stay on the same iteration and swap synchronously from highest temperature to lowest temperature. This resolves the asynchronous swapping issues in the HyperModel framework.
  • _AMBuffer no longer stores the entire chain: instead it stores pieces and those are shifted into _DEbuffer
  • np.random.x functions have now been swapped to the Generator functions recommended in current versions of numpy. Uses a SeedSequence to generate seeds for each process. In principle, this should solve the issues with reproducibility. However, to fix this for NANOGrav, we would need to adopt a similar procedure in enterprise_extensions and enterprise.

Copy link
Member

@paulthebaker paulthebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done any serious functionality tests, because I'm trusting this has been well tested as part of the enterprise review.

You should delete any obsolete code instead of just commenting it out. I've marked the few places I noticed. I also marked a method that changed names and is in need of a docstring.

PTMCMCSampler/PTMCMCSampler.py Outdated Show resolved Hide resolved
PTMCMCSampler/PTMCMCSampler.py Show resolved Hide resolved
PTMCMCSampler/PTMCMCSampler.py Outdated Show resolved Hide resolved
PTMCMCSampler/PTMCMCSampler.py Outdated Show resolved Hide resolved
PTMCMCSampler/PTMCMCSampler.py Outdated Show resolved Hide resolved
PTMCMCSampler/PTMCMCSampler.py Outdated Show resolved Hide resolved
@paulthebaker
Copy link
Member

You have to merge in the latest commit on the main branch, which updates some CI things. That should fix the failing tests.

@chimaerase
Copy link
Contributor

I added some suggested directions in #32 for rebasing existing branches onto master following the merge.

I also notice some partial overlap of this PR with work I did in #30 (and officially submitted after this PR, following a long wait for permission to release publicly). I'm planning to look into this PR and can maybe give some useful comments on where I see overlaps.

@@ -431,7 +452,7 @@ def sample(
Neff = 0
while runComplete is False:
iter += 1

self.comm.barrier() # make sure all processes are at the same iteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary communication. Using bcast below should guarantee that all MPI processes are always running the same step after the first loop.

if self.MPIrank > 0:
runComplete = self.comm.Iprobe(source=0, tag=55)
time.sleep(0.000001) # trick to get around
runComplete = self.comm.bcast(runComplete, root=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I did something very similar here (in parallel, it seems)


if logChainSwap > np.log(np.random.rand()):
swapAccepted = 1
if self.MPIrank == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block looks a lot simpler (and so probably more maintainable / less error-prone) to me than the iteration on the older code I made in #30. It does introduce a dependency on the rank 0 process for all swaps, so there's a chance the code in #30 would perform better at large scale since more work could happen in parallel (and no potential communication bottleneck would exist here for rank 0 exchanging data with other processes). I'm relying on the previous review to for correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation to lock everything to the rank 0, cold chain is to fix a problem that occurs when different parts of parameter space have different compute times for the posterior. This effectively breaks detailed balance as PT swaps are more likely to propose moves into slow compute regions.

getCovariance = self.comm.Iprobe(source=0, tag=111)
time.sleep(0.000001)
if getCovariance and self.MPIrank > 0:
# update covariance matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- I think I did something very similar

):

# MPI initialization
self.comm = comm
self.MPIrank = self.comm.Get_rank()
self.nchain = self.comm.Get_size()

if self.MPIrank == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I wasn't aware of this API, but will start using it.

@chimaerase
Copy link
Contributor

chimaerase commented Oct 27, 2022

I ran a test of this branch using minorly-modified variants of bulk_gaussian.sh (included in the comments #30), and gaussian_mpi_example.py (included in the branch from #30, and essentially just a repackaging of existing PTMCMCSampler code). Both were adjusted to use the new seed parameter and code from this branch, but sequential runs of 1 chain using this branch did not produce repeatable results. Earlier testing of master under #25 showed that 1 chain was repeatable, though multiple chains weren't. Without a full code review, I suspect that means there's a lingering seeding issue somewhere in this branch that causes non-determinism. It can't be a communication issue, since there's only 1 chain in this test.

Repeatability output

Final checksum lines for two runs don't match.

$ ./bulk_gaussian_test.sh 
Performing 100 runs, with 1 chains each
****************************************************************************************
Run 1...
****************************************************************************************
Started at 2022-10-27 21:07:17 UTC
Optional acor package is not installed. Acor is optionally used to calculate the effective chain length for output in the chain file.
Adding DE jump with weight 20
Finished 99.00 percent in 15.851856 s Acceptance rate = 0.314545
Run Complete in 15.93 s
Checksum: 7e213098e398c9f9ac5b89a100a35345  results/1_chains_run_1/chain_1.txt
****************************************************************************************
Run 2...
****************************************************************************************
Started at 2022-10-27 21:07:35 UTC
Optional acor package is not installed. Acor is optionally used to calculate the effective chain length for output in the chain file.
Adding DE jump with weight 20
Finished 99.00 percent in 15.923358 s Acceptance rate = 0.341828
Run Complete in 16.00 s
Checksum: e9ac50a026c880ecf00028d764952d87  results/1_chains_run_2/chain_1.txt

Multi-chain error

Also bumping up to 2 chains, a test raised an exception that I haven't observed in other branches:

File "/code/PTMCMCSampler/PTMCMCSampler.py", line 588, in PTMCMCOneStep
    p0, lnlike0, lnprob0 = self.PTswap(p0, lnlike0)
TypeError: PTSampler.PTswap() missing 2 required positional arguments: 'lnprob0' and 'iter'

@AaronDJohnson
Copy link
Collaborator Author

@chimaerase thanks for looking through this! The last error that you noticed was a byproduct of my most recent commit. I think I've fixed the issue (some arguments were missing in the call to PTswap).

@chimaerase
Copy link
Contributor

@chimaerase thanks for looking through this! The last error that you noticed was a byproduct of my most recent commit. I think I've fixed the issue (some arguments were missing in the call to PTswap).

You're welcome! I'm happy to help, though I do need to be cautious of how much time I'm putting into direct work on this project, since it lies at the very edge of my scope of work. It's very unfortunate that approval delays have caused us to work in parallel, and have potentially created work vs more immediate submission. I was able to easily re-run my test with 1, 2 and 6 chains, and confirmed your fix for multiple-chain workflows.

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

@paulthebaker
Copy link
Member

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the GaussianLikelihood. That numpy.random.rand() call is not linked to the same Generator object that the PTSampler is using via its SeedSequence. It is possible that the sampler is using the exact same random bits in each run, but the likelihoods going into the posterior being sampled are different.

@chimaerase
Copy link
Contributor

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the GaussianLikelihood. That numpy.random.rand() call is not linked to the same Generator object that the PTSampler is using via its SeedSequence. It is possible that the sampler is using the exact same random bits in each run, but the likelihoods going into the posterior being sampled are different.

Great point, thank you! It's been a long time since I actively worked with this segment of the code. I'll hope to revisit and comment again tomorrow.

@chimaerase
Copy link
Contributor

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the GaussianLikelihood. That numpy.random.rand() call is not linked to the same Generator object that the PTSampler is using via its SeedSequence. It is possible that the sampler is using the exact same random bits in each run, but the likelihoods going into the posterior being sampled are different.

Great point, thank you! It's been a long time since I actively worked with this segment of the code. I'll hope to revisit and comment again tomorrow.

With further updates to seeding in the test code, I was able to do seeded single chain runs with repeatable results.

I then also ran 100 sequential runs with 3 chains, then 200 sequential runs with 6 chains. Each sequence of runs produced identical checksums for the low temperature result file (chain_1.0.txt). It appears that similar to #30, this branch likely resolves both the repeatability issue and the rare deadlocks from master (which never got past ~60 or so sequential runs before deadlocking in tests documented in #30). This makes sense to me given replacement of the unpredictable message exchange between chains in earlier iprobe + recv sequences.

@paulthebaker paulthebaker merged commit ff30699 into nanograv:master Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants