Fix bug related to Boost RNG #297

jeffdonahue · 2014-04-07T03:04:51Z

I've observed some weird behavior with the new Boost RNG that I can't explain or fix, and am posting a PR here with test cases so others can hopefully reproduce the behavior. An explanation of the steps taken in the test TestRngGaussianTimesBernoulli to demonstrate the weird behavior:

Take N (= 10K) samples from a 0-mean Gaussian using caffe_vRngGaussian -- check that the samples are roughly half positive and half negative (with no zeros) as expected. These checks pass.
Take N samples from a Bernoulli with p = 0.3 using caffe_vRngBernoulli -- check that roughly N_p are ones and N_(1-p) are zeros (with no other values) as expected. These checks pass.
Element-wise multiply the Gaussian samples with the Bernoulli samples. Of the resulting product, check that roughly N_(1-p) are zeros as expected -- this check passes. Check that roughly N_p/2 are positive and N_p/2 are negative -- this check does NOT pass - instead, of the ~N_p non-zero elements, roughly 2/3 are positive and 1/3 are negative, indicating some kind of correlation between the Gaussian result and the Bernoulli result.

I also added the equivalent test but replacing the Gaussian with a uniform distribution on the interval [-1, 1] using caffe_vRngUniform. In this case the result indicates an exact inverse correlation between the Bernoulli and ~~Gaussian~~ Uniform results -- the product consists of ONLY negative values.

If anyone has a suggestion for what the problem might be and/or how to fix it, please comment here (and/or send a PR with the fix if you like). Thanks!

kloudkl · 2014-04-07T08:19:46Z

The distributions are sharing the same random seed Caffe::set_random_seed(1701);.

kloudkl · 2014-04-07T08:24:53Z

Possible explanations and fixes:

Hellekalek, P.: Good random number generators are (not so) easy to find. Mathematics and Computers in Simulation, 46: 485--505, 1998.

http://stackoverflow.com/questions/12779235/how-to-properly-choose-rng-seed-for-parallel-processes

jeffdonahue · 2014-04-07T09:07:45Z

The distributions are sharing the same random seed Caffe::set_random_seed(1701);.

Unless I'm missing your point, I don't think this is what's happening -- I only do set_random_seed(1701) once at the beginning of the test, and then the Gaussian samples are generated from that random seed, followed by the generation of the Bernoulli samples from the same stream (caffe_rng() which returns Caffe's Boost RNG singleton) without re-seeding. At the beginning of the Bernoulli sample generation, the state of the RNG has changed because it generated the Gaussian samples, so the behavior should be different from the behavior if you reseed right before generating the Bernoulli samples.

The two samples are not generated from two separate parallel processes; it is a single deterministic process first generating the Gaussian samples and then generating the Bernoulli samples.

jeffdonahue · 2014-04-07T09:11:24Z

^ The above PR just re-seeds the RNG in the test case before sampling from the Bernoulli. This is not a bug fix, it just makes the test pass -- you should be able to sample from a Gaussian followed by a Bernoulli without having the samples be correlated.

jeffdonahue · 2014-04-07T09:21:38Z

I was wrong though, the behavior is the same as re-seeding with 1701 before generating the Bernoulli samples. Anyway, I'm too tired to think about this now.

kloudkl · 2014-04-07T09:23:19Z

The correlation can be introduced by the combination of two uncorrelated sequences. http://www.sitmo.com/article/generating-correlated-random-numbers/

kloudkl · 2014-04-07T09:32:57Z

True random number generators (TRNG) may not be achievable without physical simulation[1]. Boost.Random as a pseudo random number generator (PRNG) inevitably introduces serial correlation even between different distributions.

If the bug has no significant practical influence in training or testing the network, then it can be temporarily ignored.

[1] Helmut G. Katzgraber. Random Numbers in Scientific Computing: An Introduction. arXiv:1005.4117 [physics.comp-ph]

jeffdonahue · 2014-04-08T01:23:50Z

I fixed this in 1dfdce8. The problem was with Boost's variate_generator. The caffe_vRng* functions were written under the assumption that variate_generator will modify the state of the underlying RNG passed into its constructor, but in fact in the variate_generator constructor the input RNG is passed by value and variate_generator simply makes a copy of it and modifies that instead. The somewhat ugly workaround is to use caffe_set_rng to copy the RNG state from variate_generator after it is used.

shelhamer · 2014-04-08T01:38:33Z

Thanks for the fix Jeff!

The somewhat ugly workaround is to use caffe_set_rng to copy the RNG state from variate_generator after it is used.

I thought about this since we talked this morning and I don't have a better solution, so let's just live with it.

jeffdonahue · 2014-04-08T01:39:36Z

I'll add such a comment. Also there's at least one test_math_functions failure now -- looking into that.

jeffdonahue · 2014-04-08T02:59:56Z

@shelhamer added comment in 51b2694

Re failing test_math_functions: the only failure was TestGPUHammingDistance and it would seem that the only reason it previously passed was because the RNG wasn't working, so it would GaussianFill in blob A, and then GaussianFill in blob B, but of course these had the same result so they had Hamming distance = 0. Non-trivial Hamming distances don't seem to be computed correctly. I spent a while trying to debug this but failed, and since this function isn't used anywhere in the codebase (I'm not actually sure why it was added -- why do we need a Hamming distance function?) I removed it (and its test case) in 8e2285b. Happy to see it brought back if someone (e.g. @kloudkl who added it) wants to fix it.

shelhamer · 2014-04-08T04:23:01Z

Thanks for commenting and investigating the failing math test case. Let's
merge this fix once you're done with any last changes.

The fancy Hamming distances were introduced by @kloudkl for their proposed
image retrieval demo. #243 should reintroduce a tested and passing version
of it's needed.

On Mon, Apr 7, 2014 at 7:59 PM, Jeff Donahue notifications@github.comwrote:

@shelhamer https://github.com/shelhamer added comment in 51b269451b2694

Re failing test_math_functions: the only failure was
TestGPUHammingDistance and it would seem that the only reason it previously
passed was because the RNG wasn't working, so it would GaussianFill in blob
A, and then GaussianFill in blob B, but of course these had the same result
so they had Hamming distance = 0. Non-trivial Hamming distances don't seem
to be computed correctly. I spent a while trying to debug this but failed,
and since this function isn't used anywhere in the codebase (I'm not
actually sure why it was added -- why do we need a Hamming distance
function?) I removed it (and its test case) in 8e2285bhttps://github.com/BVLC/caffe/commit/8e2285b.
Happy to see it brought back if someone (e.g. @kloudklhttps://github.com/kloudklwho added it) wants to fix it.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/297#issuecomment-39807238
.

jeffdonahue · 2014-04-08T05:39:52Z

I'd like to clean up the added RNG unit tests a bit (they're barely readable to me, and I wrote them). Will probably not happen until tomorrow, but should be good to go after that.

jeffdonahue · 2014-04-08T11:35:26Z

In my latest two commits I cleaned up the RNG unit tests (85974f5) and was less aggressive in my removal of caffe_gpu_hamming_distance; I marked it NOT_IMPLEMENTED with a TODO to fix and prefixed its unit test with DISABLED_ (gtest skips tests with this prefix unless one passes the flag --gtest_also_run_disabled_tests to the binary). Please merge if all looks good.

jeffdonahue · 2014-04-09T03:07:53Z

OK, done for real now...since last comment I've:

-changed the RNG function names from, e.g. caffe_vRngGaussian to caffe_rng_gaussian for consistency with other math function names.

-changed the RNG function argument ordering so the pointer to be filled by the RNG is last (per Google C++ style guidelines and the other math functions).

-added analogous GPU RNG functions for Gaussian and Uniform (thought about also adding Bernoulli but curand doesn't have a native function for it); changed all current uses of curandGenerate in the code to call these functions instead.

-added unit tests for the GPU RNG functions.

-confirmed that adding Caffe::set_random_seed(this->seed_); before the second RNG call in any unit test which calls RNG functions twice (to check that the results are not correlated -- any test with Times in its name) makes the test fail.

-confirmed that all tests pass.

gaussian followed by a bernoulli

…EMENTED with TODO to fix and disable its unit test

…elines

add test cases

actually check for uncorrelated RNG results

shelhamer · 2014-04-09T18:03:38Z

Nice to have this figured out and the rng interface polished. Thanks Jeff.

Fix persistence of random state with Boost RNG

sguada · 2014-04-15T01:40:33Z

@jeffdonahue I'm getting random core dumps during fillers initialization after updating dev in durian2 and using atlas and blas.

jeffdonahue · 2014-04-15T16:19:10Z

@sguada I haven't seen that behavior - can you give any more information? Does it stop happening when you checkout the commit before this PR?

shelhamer · 2014-04-15T16:28:02Z

@sguada Did you do make clean to be certain you're building and linking against ATLAS instead of partial MKL build? Did you update your Makefile.config (since only Makefile.config.example is updated by versioning)? Have you checked git status for local changes?

Note the Makefile.config format changed slightly in ed38827 to support more BLAS choices.

Fix RNG segfault related to #297

Fix persistence of random state with Boost RNG

Fix RNG segfault related to BVLC#297

jeffdonahue added the bug label Apr 7, 2014

kloudkl mentioned this pull request Apr 7, 2014

De-correlate boost random number generators in tests #300

Closed

jeffdonahue changed the title ~~Strange bug(?) with Boost RNG~~ Fix bug related to Boost RNG Apr 8, 2014

jeffdonahue added 12 commits April 8, 2014 20:17

add test demonstrating weird boost RNG issue when sampling from a

9068c5c

gaussian followed by a bernoulli

add analogous test for uniform instead of gaussian

3fe018c

cleanup log messages

fca7eaf

add bernoulli*bernoulli test

7d54229

call caffe_set_rng at the end of each vRng function to maintain state

db4d3b0

fix bernoulli*bernoulli test, now all pass

a52f654

comment to explain the purpose of Caffe::set_generator

435cbf4

cleanup test_math_functions

d6776f0

make rng_ a private member of Generator

95783f9

cleanup RNG unit tests

5491f58

gpu_hamming_distance fails unit test with fixed RNG; mark it NOT_IMPL…

81db75c

…EMENTED with TODO to fix and disable its unit test

make RNG function names more similar to other caffe math function names

078e0bf

jeffdonahue added 5 commits April 8, 2014 20:17

make RNG function outputs the last argument per Google C++ style guid…

d002799

…elines

have rng_stream initialize RNG if not already initialized

acfafb2

add analogous caffe_gpu_rng_* functions for gaussian and uniform, and

2ebe94e

add test cases

change *Plus* tests to *Times* tests because the Plus tests don't

61c5653

actually check for uncorrelated RNG results

more rng test cleanup

1f5b9a5

shelhamer added a commit that referenced this pull request Apr 9, 2014

Merge pull request #297 from jeffdonahue/rng-bug

3e3980f

Fix persistence of random state with Boost RNG

shelhamer merged commit 3e3980f into BVLC:dev Apr 9, 2014

jeffdonahue deleted the rng-bug branch April 9, 2014 18:13

sguada mentioned this pull request Apr 17, 2014

Crash due to Boost RNG and persisting random number generator state #335

Closed

shelhamer added a commit that referenced this pull request Apr 19, 2014

Merge pull request #336 from jeffdonahue/fix-rng-segfault

f8c751f

Fix RNG segfault related to #297

shelhamer mentioned this pull request Apr 21, 2014

Ugly MKL Problem #344

Closed

sguada mentioned this pull request May 12, 2014

Training imagenet: loss does not decrease #401

Closed

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#297 from jeffdonahue/rng-bug

46a1821

Fix persistence of random state with Boost RNG

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#336 from jeffdonahue/fix-rng-segfault

eff7fdb

Fix RNG segfault related to BVLC#297

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug related to Boost RNG #297

Fix bug related to Boost RNG #297

jeffdonahue commented Apr 7, 2014

kloudkl commented Apr 7, 2014

kloudkl commented Apr 7, 2014

jeffdonahue commented Apr 7, 2014

jeffdonahue commented Apr 7, 2014

jeffdonahue commented Apr 7, 2014

kloudkl commented Apr 7, 2014

kloudkl commented Apr 7, 2014

jeffdonahue commented Apr 8, 2014

shelhamer commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

shelhamer commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

jeffdonahue commented Apr 9, 2014

shelhamer commented Apr 9, 2014

sguada commented Apr 15, 2014

jeffdonahue commented Apr 15, 2014

shelhamer commented Apr 15, 2014

Fix bug related to Boost RNG #297

Fix bug related to Boost RNG #297

Conversation

jeffdonahue commented Apr 7, 2014

kloudkl commented Apr 7, 2014

kloudkl commented Apr 7, 2014

jeffdonahue commented Apr 7, 2014

jeffdonahue commented Apr 7, 2014

jeffdonahue commented Apr 7, 2014

kloudkl commented Apr 7, 2014

kloudkl commented Apr 7, 2014

jeffdonahue commented Apr 8, 2014

shelhamer commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

shelhamer commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

jeffdonahue commented Apr 8, 2014

jeffdonahue commented Apr 9, 2014

shelhamer commented Apr 9, 2014

sguada commented Apr 15, 2014

jeffdonahue commented Apr 15, 2014

shelhamer commented Apr 15, 2014