Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

AaronDJohnson · 2022-09-21T22:02:37Z

The following changes have been made in this branch:

Chains now stay on the same iteration and swap synchronously from highest temperature to lowest temperature. This resolves the asynchronous swapping issues in the HyperModel framework.
_AMBuffer no longer stores the entire chain: instead it stores pieces and those are shifted into _DEbuffer
np.random.x functions have now been swapped to the Generator functions recommended in current versions of numpy. Uses a SeedSequence to generate seeds for each process. In principle, this should solve the issues with reproducibility. However, to fix this for NANOGrav, we would need to adopt a similar procedure in enterprise_extensions and enterprise.

synchronous updating should not require extra space

This reverts commit 711ec1c.

This reverts commit 611236f.

This reverts commit 383ff48.

This reverts commit 30887e3.

Also increase default DEbuffer size

This reverts commit d234562, reversing changes made to 5143090.

paulthebaker

I haven't done any serious functionality tests, because I'm trusting this has been well tested as part of the enterprise review.

You should delete any obsolete code instead of just commenting it out. I've marked the few places I noticed. I also marked a method that changed names and is in need of a docstring.

PTMCMCSampler/PTMCMCSampler.py

paulthebaker · 2022-10-27T16:23:28Z

You have to merge in the latest commit on the main branch, which updates some CI things. That should fix the failing tests.

chimaerase · 2022-10-27T19:09:19Z

I added some suggested directions in #32 for rebasing existing branches onto master following the merge.

I also notice some partial overlap of this PR with work I did in #30 (and officially submitted after this PR, following a long wait for permission to release publicly). I'm planning to look into this PR and can maybe give some useful comments on where I see overlaps.

chimaerase · 2022-10-27T19:15:19Z

PTMCMCSampler/PTMCMCSampler.py

@@ -431,7 +452,7 @@ def sample(
        Neff = 0
        while runComplete is False:
            iter += 1
-
+            self.comm.barrier()  # make sure all processes are at the same iteration


I think this is unnecessary communication. Using bcast below should guarantee that all MPI processes are always running the same step after the first loop.

chimaerase · 2022-10-27T19:16:14Z

PTMCMCSampler/PTMCMCSampler.py

-            if self.MPIrank > 0:
-                runComplete = self.comm.Iprobe(source=0, tag=55)
-                time.sleep(0.000001)  # trick to get around
+            runComplete = self.comm.bcast(runComplete, root=0)


Great! I did something very similar here (in parallel, it seems)

chimaerase · 2022-10-27T19:34:04Z

PTMCMCSampler/PTMCMCSampler.py

-
-                if logChainSwap > np.log(np.random.rand()):
-                    swapAccepted = 1
+        if self.MPIrank == 0:


This block looks a lot simpler (and so probably more maintainable / less error-prone) to me than the iteration on the older code I made in #30. It does introduce a dependency on the rank 0 process for all swaps, so there's a chance the code in #30 would perform better at large scale since more work could happen in parallel (and no potential communication bottleneck would exist here for rank 0 exchanging data with other processes). I'm relying on the previous review to for correctness.

The motivation to lock everything to the rank 0, cold chain is to fix a problem that occurs when different parts of parameter space have different compute times for the posterior. This effectively breaks detailed balance as PT swaps are more likely to propose moves into slow compute regions.

chimaerase · 2022-10-27T20:00:25Z

PTMCMCSampler/PTMCMCSampler.py

-        getCovariance = self.comm.Iprobe(source=0, tag=111)
-        time.sleep(0.000001)
-        if getCovariance and self.MPIrank > 0:
+        # update covariance matrix


Same here -- I think I did something very similar

chimaerase · 2022-10-27T20:04:05Z

PTMCMCSampler/PTMCMCSampler.py

    ):

        # MPI initialization
        self.comm = comm
        self.MPIrank = self.comm.Get_rank()
        self.nchain = self.comm.Get_size()

+        if self.MPIrank == 0:


This is great! I wasn't aware of this API, but will start using it.

chimaerase · 2022-10-27T21:20:21Z

I ran a test of this branch using minorly-modified variants of bulk_gaussian.sh (included in the comments #30), and gaussian_mpi_example.py (included in the branch from #30, and essentially just a repackaging of existing PTMCMCSampler code). Both were adjusted to use the new seed parameter and code from this branch, but sequential runs of 1 chain using this branch did not produce repeatable results. Earlier testing of master under #25 showed that 1 chain was repeatable, though multiple chains weren't. Without a full code review, I suspect that means there's a lingering seeding issue somewhere in this branch that causes non-determinism. It can't be a communication issue, since there's only 1 chain in this test.

Repeatability output

Final checksum lines for two runs don't match.

$ ./bulk_gaussian_test.sh 
Performing 100 runs, with 1 chains each
****************************************************************************************
Run 1...
****************************************************************************************
Started at 2022-10-27 21:07:17 UTC
Optional acor package is not installed. Acor is optionally used to calculate the effective chain length for output in the chain file.
Adding DE jump with weight 20
Finished 99.00 percent in 15.851856 s Acceptance rate = 0.314545
Run Complete in 15.93 s
Checksum: 7e213098e398c9f9ac5b89a100a35345  results/1_chains_run_1/chain_1.txt
****************************************************************************************
Run 2...
****************************************************************************************
Started at 2022-10-27 21:07:35 UTC
Optional acor package is not installed. Acor is optionally used to calculate the effective chain length for output in the chain file.
Adding DE jump with weight 20
Finished 99.00 percent in 15.923358 s Acceptance rate = 0.341828
Run Complete in 16.00 s
Checksum: e9ac50a026c880ecf00028d764952d87  results/1_chains_run_2/chain_1.txt

Multi-chain error

Also bumping up to 2 chains, a test raised an exception that I haven't observed in other branches:

File "/code/PTMCMCSampler/PTMCMCSampler.py", line 588, in PTMCMCOneStep
    p0, lnlike0, lnprob0 = self.PTswap(p0, lnlike0)
TypeError: PTSampler.PTswap() missing 2 required positional arguments: 'lnprob0' and 'iter'

AaronDJohnson · 2022-10-27T22:09:19Z

@chimaerase thanks for looking through this! The last error that you noticed was a byproduct of my most recent commit. I think I've fixed the issue (some arguments were missing in the call to PTswap).

chimaerase · 2022-10-27T22:35:42Z

@chimaerase thanks for looking through this! The last error that you noticed was a byproduct of my most recent commit. I think I've fixed the issue (some arguments were missing in the call to PTswap).

You're welcome! I'm happy to help, though I do need to be cautious of how much time I'm putting into direct work on this project, since it lies at the very edge of my scope of work. It's very unfortunate that approval delays have caused us to work in parallel, and have potentially created work vs more immediate submission. I was able to easily re-run my test with 1, 2 and 6 chains, and confirmed your fix for multiple-chain workflows.

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

paulthebaker · 2022-10-28T00:12:38Z

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the GaussianLikelihood. That numpy.random.rand() call is not linked to the same Generator object that the PTSampler is using via its SeedSequence. It is possible that the sampler is using the exact same random bits in each run, but the likelihoods going into the posterior being sampled are different.

chimaerase · 2022-10-28T00:24:47Z

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the GaussianLikelihood. That numpy.random.rand() call is not linked to the same Generator object that the PTSampler is using via its SeedSequence. It is possible that the sampler is using the exact same random bits in each run, but the likelihoods going into the posterior being sampled are different.

Great point, thank you! It's been a long time since I actively worked with this segment of the code. I'll hope to revisit and comment again tomorrow.

chimaerase · 2022-10-29T01:39:34Z

I am still concerned about non-repeatability for a single chain, which is a significant change from my recent tests of master. After that, it might also be worth running a full test of 100+ sequential runs to verify whether rare deadlocks I noticed in master have been addressed here similar to #30.

I'm not 100% sure, but I think the repeatability issue is that the test code generates a random covariance to initialize the GaussianLikelihood. That numpy.random.rand() call is not linked to the same Generator object that the PTSampler is using via its SeedSequence. It is possible that the sampler is using the exact same random bits in each run, but the likelihoods going into the posterior being sampled are different.

Great point, thank you! It's been a long time since I actively worked with this segment of the code. I'll hope to revisit and comment again tomorrow.

With further updates to seeding in the test code, I was able to do seeded single chain runs with repeatable results.

I then also ran 100 sequential runs with 3 chains, then 200 sequential runs with 6 chains. Each sequence of runs produced identical checksums for the low temperature result file (chain_1.0.txt). It appears that similar to #30, this branch likely resolves both the repeatability issue and the rare deadlocks from master (which never got past ~60 or so sequential runs before deadlocking in tests documented in #30). This makes sense to me given replacement of the unpredictable message exchange between chains in earlier iprobe + recv sequences.

AaronDJohnson added 22 commits April 21, 2022 12:07

add more space for hot chains

383ff48

synchronous swaps

611236f

remove extra space for hot chains

711ec1c

synchronous updating should not require extra space

Revert "remove extra space for hot chains"

4b3ae42

This reverts commit 711ec1c.

Revert "synchronous swaps"

b707818

This reverts commit 611236f.

Revert "add more space for hot chains"

7481e7f

This reverts commit 383ff48.

Update .gitignore

4aa80ad

change AMbuffer to not store entire chain

9aed786

fix acor bit

165b16d

barrier

731a0aa

attempt to sync

30887e3

Revert "attempt to sync"

38ea9cc

This reverts commit 30887e3.

synchronous swaps!

862ca79

reduce allocated memory for hot chains

384dc30

Also increase default DEbuffer size

Fix chains halting at end

5143090

Merge branch 'reduce_memory' into all_changes

d234562

Revert "Merge branch 'reduce_memory' into all_changes"

eb20798

This reverts commit d234562, reversing changes made to 5143090.

reduce memory required for AMbuffer

ada56ad

make chains reproducible

0844cb2

allgather -> gather

4fa94ba

remove print statement

2d72920

don't propose swaps if running single chain

24d9d62

paulthebaker requested changes Oct 21, 2022

View reviewed changes

remove comments, pt_swap -> PTswap, black

a27ab4e

paulthebaker approved these changes Oct 27, 2022

View reviewed changes

chimaerase reviewed Oct 27, 2022

View reviewed changes

paulthebaker mentioned this pull request Oct 27, 2022

Fix repeatability, deadlocks, and reduced mixing for > 1 chain #30

Closed

Fix missing arguments in PTswap

c1d29a6

paulthebaker merged commit ff30699 into nanograv:master Oct 30, 2022

chimaerase mentioned this pull request Oct 31, 2022

Document and / or correct repeatability #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

AaronDJohnson commented Sep 21, 2022

paulthebaker left a comment

paulthebaker commented Oct 27, 2022

chimaerase commented Oct 27, 2022

chimaerase Oct 27, 2022

chimaerase Oct 27, 2022

chimaerase Oct 27, 2022

paulthebaker Oct 27, 2022

chimaerase Oct 27, 2022

chimaerase Oct 27, 2022

chimaerase commented Oct 27, 2022 •

edited

Loading

AaronDJohnson commented Oct 27, 2022

chimaerase commented Oct 27, 2022

paulthebaker commented Oct 28, 2022

chimaerase commented Oct 28, 2022

chimaerase commented Oct 29, 2022

Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

Synchronize swaps, reduce memory used, SeedSequence for random numbers #29

Conversation

AaronDJohnson commented Sep 21, 2022

paulthebaker left a comment

Choose a reason for hiding this comment

paulthebaker commented Oct 27, 2022

chimaerase commented Oct 27, 2022

chimaerase Oct 27, 2022

Choose a reason for hiding this comment

chimaerase Oct 27, 2022

Choose a reason for hiding this comment

chimaerase Oct 27, 2022

Choose a reason for hiding this comment

paulthebaker Oct 27, 2022

Choose a reason for hiding this comment

chimaerase Oct 27, 2022

Choose a reason for hiding this comment

chimaerase Oct 27, 2022

Choose a reason for hiding this comment

chimaerase commented Oct 27, 2022 • edited Loading

Repeatability output

Multi-chain error

AaronDJohnson commented Oct 27, 2022

chimaerase commented Oct 27, 2022

paulthebaker commented Oct 28, 2022

chimaerase commented Oct 28, 2022

chimaerase commented Oct 29, 2022

chimaerase commented Oct 27, 2022 •

edited

Loading