Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug related to Boost RNG #297

Merged
merged 17 commits into from
Apr 9, 2014
Merged

Fix bug related to Boost RNG #297

merged 17 commits into from
Apr 9, 2014

Conversation

jeffdonahue
Copy link
Contributor

I've observed some weird behavior with the new Boost RNG that I can't explain or fix, and am posting a PR here with test cases so others can hopefully reproduce the behavior. An explanation of the steps taken in the test TestRngGaussianTimesBernoulli to demonstrate the weird behavior:

  1. Take N (= 10K) samples from a 0-mean Gaussian using caffe_vRngGaussian -- check that the samples are roughly half positive and half negative (with no zeros) as expected. These checks pass.
  2. Take N samples from a Bernoulli with p = 0.3 using caffe_vRngBernoulli -- check that roughly N_p are ones and N_(1-p) are zeros (with no other values) as expected. These checks pass.
  3. Element-wise multiply the Gaussian samples with the Bernoulli samples. Of the resulting product, check that roughly N_(1-p) are zeros as expected -- this check passes. Check that roughly N_p/2 are positive and N_p/2 are negative -- this check does NOT pass - instead, of the ~N_p non-zero elements, roughly 2/3 are positive and 1/3 are negative, indicating some kind of correlation between the Gaussian result and the Bernoulli result.

I also added the equivalent test but replacing the Gaussian with a uniform distribution on the interval [-1, 1] using caffe_vRngUniform. In this case the result indicates an exact inverse correlation between the Bernoulli and Gaussian Uniform results -- the product consists of ONLY negative values.

If anyone has a suggestion for what the problem might be and/or how to fix it, please comment here (and/or send a PR with the fix if you like). Thanks!

@jeffdonahue jeffdonahue added the bug label Apr 7, 2014
@kloudkl
Copy link
Contributor

kloudkl commented Apr 7, 2014

The distributions are sharing the same random seed Caffe::set_random_seed(1701);.

@kloudkl
Copy link
Contributor

kloudkl commented Apr 7, 2014

Possible explanations and fixes:

Hellekalek, P.: Good random number generators are (not so) easy to find. Mathematics and Computers in Simulation, 46: 485--505, 1998.

http://stackoverflow.com/questions/12779235/how-to-properly-choose-rng-seed-for-parallel-processes

@jeffdonahue
Copy link
Contributor Author

The distributions are sharing the same random seed Caffe::set_random_seed(1701);.

Unless I'm missing your point, I don't think this is what's happening -- I only do set_random_seed(1701) once at the beginning of the test, and then the Gaussian samples are generated from that random seed, followed by the generation of the Bernoulli samples from the same stream (caffe_rng() which returns Caffe's Boost RNG singleton) without re-seeding. At the beginning of the Bernoulli sample generation, the state of the RNG has changed because it generated the Gaussian samples, so the behavior should be different from the behavior if you reseed right before generating the Bernoulli samples.

The two samples are not generated from two separate parallel processes; it is a single deterministic process first generating the Gaussian samples and then generating the Bernoulli samples.

@jeffdonahue
Copy link
Contributor Author

^ The above PR just re-seeds the RNG in the test case before sampling from the Bernoulli. This is not a bug fix, it just makes the test pass -- you should be able to sample from a Gaussian followed by a Bernoulli without having the samples be correlated.

@jeffdonahue
Copy link
Contributor Author

I was wrong though, the behavior is the same as re-seeding with 1701 before generating the Bernoulli samples. Anyway, I'm too tired to think about this now.

@kloudkl
Copy link
Contributor

kloudkl commented Apr 7, 2014

The correlation can be introduced by the combination of two uncorrelated sequences. http://www.sitmo.com/article/generating-correlated-random-numbers/

@kloudkl
Copy link
Contributor

kloudkl commented Apr 7, 2014

True random number generators (TRNG) may not be achievable without physical simulation[1]. Boost.Random as a pseudo random number generator (PRNG) inevitably introduces serial correlation even between different distributions.

If the bug has no significant practical influence in training or testing the network, then it can be temporarily ignored.

[1] Helmut G. Katzgraber. Random Numbers in Scientific Computing: An Introduction. arXiv:1005.4117 [physics.comp-ph]

@jeffdonahue
Copy link
Contributor Author

I fixed this in 1dfdce8. The problem was with Boost's variate_generator. The caffe_vRng* functions were written under the assumption that variate_generator will modify the state of the underlying RNG passed into its constructor, but in fact in the variate_generator constructor the input RNG is passed by value and variate_generator simply makes a copy of it and modifies that instead. The somewhat ugly workaround is to use caffe_set_rng to copy the RNG state from variate_generator after it is used.

@shelhamer
Copy link
Member

Thanks for the fix Jeff!

The somewhat ugly workaround is to use caffe_set_rng to copy the RNG state from variate_generator after it is used.

I thought about this since we talked this morning and I don't have a better solution, so let's just live with it.

@jeffdonahue
Copy link
Contributor Author

I'll add such a comment. Also there's at least one test_math_functions failure now -- looking into that.

@jeffdonahue
Copy link
Contributor Author

@shelhamer added comment in 51b2694

Re failing test_math_functions: the only failure was TestGPUHammingDistance and it would seem that the only reason it previously passed was because the RNG wasn't working, so it would GaussianFill in blob A, and then GaussianFill in blob B, but of course these had the same result so they had Hamming distance = 0. Non-trivial Hamming distances don't seem to be computed correctly. I spent a while trying to debug this but failed, and since this function isn't used anywhere in the codebase (I'm not actually sure why it was added -- why do we need a Hamming distance function?) I removed it (and its test case) in 8e2285b. Happy to see it brought back if someone (e.g. @kloudkl who added it) wants to fix it.

@shelhamer
Copy link
Member

Thanks for commenting and investigating the failing math test case. Let's
merge this fix once you're done with any last changes.

The fancy Hamming distances were introduced by @kloudkl for their proposed
image retrieval demo. #243 should reintroduce a tested and passing version
of it's needed.

On Mon, Apr 7, 2014 at 7:59 PM, Jeff Donahue notifications@github.comwrote:

@shelhamer https://github.com/shelhamer added comment in 51b269451b2694

Re failing test_math_functions: the only failure was
TestGPUHammingDistance and it would seem that the only reason it previously
passed was because the RNG wasn't working, so it would GaussianFill in blob
A, and then GaussianFill in blob B, but of course these had the same result
so they had Hamming distance = 0. Non-trivial Hamming distances don't seem
to be computed correctly. I spent a while trying to debug this but failed,
and since this function isn't used anywhere in the codebase (I'm not
actually sure why it was added -- why do we need a Hamming distance
function?) I removed it (and its test case) in 8e2285bhttps://github.com/BVLC/caffe/commit/8e2285b.
Happy to see it brought back if someone (e.g. @kloudklhttps://github.com/kloudklwho added it) wants to fix it.


Reply to this email directly or view it on GitHubhttps://github.com//pull/297#issuecomment-39807238
.

@jeffdonahue
Copy link
Contributor Author

I'd like to clean up the added RNG unit tests a bit (they're barely readable to me, and I wrote them). Will probably not happen until tomorrow, but should be good to go after that.

@jeffdonahue jeffdonahue changed the title Strange bug(?) with Boost RNG Fix bug related to Boost RNG Apr 8, 2014
@jeffdonahue
Copy link
Contributor Author

In my latest two commits I cleaned up the RNG unit tests (85974f5) and was less aggressive in my removal of caffe_gpu_hamming_distance; I marked it NOT_IMPLEMENTED with a TODO to fix and prefixed its unit test with DISABLED_ (gtest skips tests with this prefix unless one passes the flag --gtest_also_run_disabled_tests to the binary). Please merge if all looks good.

@jeffdonahue
Copy link
Contributor Author

OK, done for real now...since last comment I've:

-changed the RNG function names from, e.g. caffe_vRngGaussian to caffe_rng_gaussian for consistency with other math function names.

-changed the RNG function argument ordering so the pointer to be filled by the RNG is last (per Google C++ style guidelines and the other math functions).

-added analogous GPU RNG functions for Gaussian and Uniform (thought about also adding Bernoulli but curand doesn't have a native function for it); changed all current uses of curandGenerate in the code to call these functions instead.

-added unit tests for the GPU RNG functions.

-confirmed that adding Caffe::set_random_seed(this->seed_); before the second RNG call in any unit test which calls RNG functions twice (to check that the results are not correlated -- any test with Times in its name) makes the test fail.

-confirmed that all tests pass.

@shelhamer
Copy link
Member

Nice to have this figured out and the rng interface polished. Thanks Jeff.

shelhamer added a commit that referenced this pull request Apr 9, 2014
Fix persistence of random state with Boost RNG
@shelhamer shelhamer merged commit 3e3980f into BVLC:dev Apr 9, 2014
@jeffdonahue jeffdonahue deleted the rng-bug branch April 9, 2014 18:13
@sguada
Copy link
Contributor

sguada commented Apr 15, 2014

@jeffdonahue I'm getting random core dumps during fillers initialization after updating dev in durian2 and using atlas and blas.

@jeffdonahue
Copy link
Contributor Author

@sguada I haven't seen that behavior - can you give any more information? Does it stop happening when you checkout the commit before this PR?

@shelhamer
Copy link
Member

@sguada Did you do make clean to be certain you're building and linking against ATLAS instead of partial MKL build? Did you update your Makefile.config (since only Makefile.config.example is updated by versioning)? Have you checked git status for local changes?

Note the Makefile.config format changed slightly in ed38827 to support more BLAS choices.

shelhamer added a commit that referenced this pull request Apr 19, 2014
@shelhamer shelhamer mentioned this pull request Apr 21, 2014
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Fix persistence of random state with Boost RNG
mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants