-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does ebloom always fail? #774
Comments
Note it has not been proven that it is the ebloom part that it is failing here, it is just that the narrative of symptoms fit the explanation neatly. |
I upped the keycount in the repl_aae test in riak_test to 20k, and added some logging to
interestingly, the “random” number is always I then changed the code so that |
Although the test is for |
I was also thinking about this line: If at some stage something goes wrong with the reference that wraps the NIF'd ebloom - might this become a badarg? |
Weirdly test still passes (with deliberate error on |
Created a more manual test with basho_bench loading a cluster, and then running full-sync. this had no issues either. This doesn't appear to be an ebloom issue. Trying to recreate in production environment now with extra logging. |
After some further investigation the https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L580 The The proposed fix is to make typed buckets into a binary before bloom insertion, and also before bloom membership checking later in the code. |
As per the issue #774 in some instances the Bucket given to ebloom was a 2-tuple (typed bucket) not a binary, and so a `badarg` was thrown. Though this does not stop repl from working, it leads to crashes and re-tries, is broken, and polutes this log. This commit fixes by ensuring a binary is passed to ebloom:insert/2 and ebloom:contains/2
As noted in #773 the repl code for AAE full-sync will switch to using a bloom filter once it calculates that more than 3% of the keys have been repaired.
However, running in production we keep seeing these errors:
The actual cause of the errors has not been proven, but the timing of the error indicates it may be aligned with the switch to using the bloom. This crash appears to prompt a process restart. At this point, the sync will start from the beginning, and a new 3% of the difference will be repaired to eventually the threshold is reached and it crashes again.
The restarting process may eventually complete the synchronisation. However if the restarts are often enough, and the delta is big enough, instead you cna get:
So the beam on the node is crashed as the restart intensity module has been breached. This then stops Riak on that node.
The text was updated successfully, but these errors were encountered: