Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does ebloom always fail? #774

Closed
martinsumner opened this issue Dec 4, 2017 · 8 comments
Closed

Does ebloom always fail? #774

martinsumner opened this issue Dec 4, 2017 · 8 comments

Comments

@martinsumner
Copy link
Contributor

As noted in #773 the repl code for AAE full-sync will switch to using a bloom filter once it calculates that more than 3% of the keys have been repaired.

However, running in production we keep seeing these errors:

2017-11-19 02:42:00.570 [info] <0.2803.0>@riak_repl2_fscoordinator:handle_info:437 fssource <18828.26051.2517> exited abnormally: {{badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]},{gen_server,call,[<18828.28077.2342>,{compare,{1392993748081016843912887106182707253109560705024,3},#Fun<riak_repl_aae_source.38.18255153>,#Fun<riak_repl_aae_source.39.18255153>,{exchange,inline,undefined,undefined,275878,0}},infinity]}}

The actual cause of the errors has not been proven, but the timing of the error indicates it may be aligned with the switch to using the bloom. This crash appears to prompt a process restart. At this point, the sync will start from the beginning, and a new 3% of the difference will be repaired to eventually the threshold is reached and it crashes again.

The restarting process may eventually complete the synchronisation. However if the restarts are often enough, and the delta is big enough, instead you cna get:

2017-11-19 02:47:30 =SUPERVISOR REPORT====
     Supervisor: {local,riak_repl2_fssource_sup}
     Reason:     {{badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]},{gen_server,call,[<0.24511.1293>,{compare,{616571003248974668617179538802181898917346541568,3},#Fun<riak_repl_aae_source.38.18255153>,#Fun<riak_repl_aae_source.39.18255153>,{exchange,inline,undefined,undefined,278720,42747}},infinity]}}
     Offender:   [{pid,<0.26073.2520>},{name,622279994019798508141412682679979879462877528064},{mfargs,{riak_repl2_fssource,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]
2017-11-19 02:47:30.509 [error] emulator Error in process <0.26923.2523> on node 'riak@192.168.88.15' with exit value: {badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_... 
2017-11-19 02:47:30.509 [error] emulator Error in process <0.26923.2523> on node 'riak@192.168.88.15' with exit value: {badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_... 
2017-11-19 02:47:30.509 [info] <0.23398.2520>@riak_repl_aae_source:handle_info:127 Monitored pid <0.24511.1293>, AAE Hashtree process went down because: {badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]}
2017-11-19 02:47:30.509 [error] <0.23398.2520> gen_fsm <0.23398.2520> in state compute_differences terminated with reason: no match of right hand value {error,closed} in riak_repl_aae_source:send_asynchronous_msg/2 line 723
2017-11-19 02:47:30.509 [info] <0.2803.0>@riak_repl2_fscoordinator:handle_info:437 fssource <0.26073.2520> exited abnormally: {{badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]},{gen_server,call,[<0.24511.1293>,{compare,{616571003248974668617179538802181898917346541568,3},#Fun<riak_repl_aae_source.38.18255153>,#Fun<riak_repl_aae_source.39.18255153>,{exchange,inline,undefined,undefined,278720,42747}},infinity]}}
2017-11-19 02:47:30.509 [error] <0.23398.2520> gen_fsm <0.23398.2520> in state compute_differences terminated with reason: no match of right hand value {error,closed} in riak_repl_aae_source:send_asynchronous_msg/2 line 723

Also in the logs at 02:47:32 (1 second resolution timestamp):
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed

So the beam on the node is crashed as the restart intensity module has been breached. This then stops Riak on that node.

@martinsumner
Copy link
Contributor Author

Note it has not been proven that it is the ebloom part that it is failing here, it is just that the narrative of symptoms fit the explanation neatly.

@russelldb
Copy link
Member

russelldb commented Dec 6, 2017

I upped the keycount in the repl_aae test in riak_test to 20k, and added some logging to riak_repl_aae_source:maybe_create_bloom/2 and I got many lines like:

@riak_repl_aae_source:maybe_create_bloom:531 creating ebloom eblom:new( 10000, 0.01, 444)

interestingly, the “random” number is always 444. Something to do with default initial state on the random module. (I wonder if that matters?)
Anyway, we can see clearly that ebloom:new/3 does get called, and does not always crash. I think in order to get a badarg there that first argument would have to be a non-integer (even a float would do it.) Looking through the code I can’t figure out how that would be anything but an integer, but there maybe some path through the code where it is undefined (at a guess.)

I then changed the code so that ebloom:new/3 is always a badarg, and the error is different to the one in your initial issue. You were right and the badarg isn't from that point in the code, but elsewhere. It needs more digging. Sorry for the red herring.

@russelldb
Copy link
Member

Although the test is for repl_aae_fullsync as opposed to repl2. I don't know if that is relevant. I'll look into a repl2 aae fs test (there isn't one named as such in the riak_test repo.)

@martinsumner
Copy link
Contributor Author

I was also thinking about this line:
https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L530

If at some stage something goes wrong with the reference that wraps the NIF'd ebloom - might this become a badarg?

@russelldb
Copy link
Member

russelldb commented Dec 6, 2017

Weirdly test still passes (with deliberate error on ebloom:new/3)

@martinsumner
Copy link
Contributor Author

Created a more manual test with basho_bench loading a cluster, and then running full-sync. this had no issues either. This doesn't appear to be an ebloom issue.

Trying to recreate in production environment now with extra logging.

@russelldb
Copy link
Member

After some further investigation the badarg is coming from the ebloom insertion line

https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L580

The Bucket may be a two-tuple (a typed bucket) and not a binary. The badarg is the call to <<Bucket\binary, Key\binary>> with that Bucket being something like {<<"bucket-type">>, <<"bucket">>}

The proposed fix is to make typed buckets into a binary before bloom insertion, and also before bloom membership checking later in the code.

russelldb added a commit that referenced this issue Mar 1, 2018
As per the issue #774 in some
instances the Bucket given to ebloom was a 2-tuple (typed bucket) not a
binary, and so a `badarg` was thrown. Though this does not stop repl
from working, it leads to crashes and re-tries, is broken, and polutes
this log. This commit fixes by ensuring a binary is passed to
ebloom:insert/2 and ebloom:contains/2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants