Does ebloom always fail? #774

martinsumner · 2017-12-04T13:21:48Z

As noted in #773 the repl code for AAE full-sync will switch to using a bloom filter once it calculates that more than 3% of the keys have been repaired.

However, running in production we keep seeing these errors:

2017-11-19 02:42:00.570 [info] <0.2803.0>@riak_repl2_fscoordinator:handle_info:437 fssource <18828.26051.2517> exited abnormally: {{badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]},{gen_server,call,[<18828.28077.2342>,{compare,{1392993748081016843912887106182707253109560705024,3},#Fun<riak_repl_aae_source.38.18255153>,#Fun<riak_repl_aae_source.39.18255153>,{exchange,inline,undefined,undefined,275878,0}},infinity]}}

The actual cause of the errors has not been proven, but the timing of the error indicates it may be aligned with the switch to using the bloom. This crash appears to prompt a process restart. At this point, the sync will start from the beginning, and a new 3% of the difference will be repaired to eventually the threshold is reached and it crashes again.

The restarting process may eventually complete the synchronisation. However if the restarts are often enough, and the delta is big enough, instead you cna get:

2017-11-19 02:47:30 =SUPERVISOR REPORT====
     Supervisor: {local,riak_repl2_fssource_sup}
     Reason:     {{badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]},{gen_server,call,[<0.24511.1293>,{compare,{616571003248974668617179538802181898917346541568,3},#Fun<riak_repl_aae_source.38.18255153>,#Fun<riak_repl_aae_source.39.18255153>,{exchange,inline,undefined,undefined,278720,42747}},infinity]}}
     Offender:   [{pid,<0.26073.2520>},{name,622279994019798508141412682679979879462877528064},{mfargs,{riak_repl2_fssource,start_link,undefined}},{restart_type,temporary},{shutdown,5000},{child_type,worker}]
2017-11-19 02:47:30.509 [error] emulator Error in process <0.26923.2523> on node 'riak@192.168.88.15' with exit value: {badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_... 
2017-11-19 02:47:30.509 [error] emulator Error in process <0.26923.2523> on node 'riak@192.168.88.15' with exit value: {badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_... 
2017-11-19 02:47:30.509 [info] <0.23398.2520>@riak_repl_aae_source:handle_info:127 Monitored pid <0.24511.1293>, AAE Hashtree process went down because: {badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]}
2017-11-19 02:47:30.509 [error] <0.23398.2520> gen_fsm <0.23398.2520> in state compute_differences terminated with reason: no match of right hand value {error,closed} in riak_repl_aae_source:send_asynchronous_msg/2 line 723
2017-11-19 02:47:30.509 [info] <0.2803.0>@riak_repl2_fscoordinator:handle_info:437 fssource <0.26073.2520> exited abnormally: {{badarg,[{riak_repl_aae_source,maybe_accumulate_key,4,[{file,"src/riak_repl_aae_source.erl"},{line,579}]},{lists,foldl,3,[{file,"lists.erl"},{line,1248}]},{riak_kv_index_hashtree,'-do_compare/6-fun-2-',5,[{file,"src/riak_kv_index_hashtree.erl"},{line,964}]}]},{gen_server,call,[<0.24511.1293>,{compare,{616571003248974668617179538802181898917346541568,3},#Fun<riak_repl_aae_source.38.18255153>,#Fun<riak_repl_aae_source.39.18255153>,{exchange,inline,undefined,undefined,278720,42747}},infinity]}}
2017-11-19 02:47:30.509 [error] <0.23398.2520> gen_fsm <0.23398.2520> in state compute_differences terminated with reason: no match of right hand value {error,closed} in riak_repl_aae_source:send_asynchronous_msg/2 line 723

Also in the logs at 02:47:32 (1 second resolution timestamp):
[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed

So the beam on the node is crashed as the restart intensity module has been breached. This then stops Riak on that node.

The text was updated successfully, but these errors were encountered:

martinsumner · 2017-12-04T13:35:10Z

Note it has not been proven that it is the ebloom part that it is failing here, it is just that the narrative of symptoms fit the explanation neatly.

russelldb · 2017-12-06T14:26:36Z

I upped the keycount in the repl_aae test in riak_test to 20k, and added some logging to riak_repl_aae_source:maybe_create_bloom/2 and I got many lines like:

@riak_repl_aae_source:maybe_create_bloom:531 creating ebloom eblom:new( 10000, 0.01, 444)

interestingly, the “random” number is always 444. Something to do with default initial state on the random module. (I wonder if that matters?)
Anyway, we can see clearly that ebloom:new/3 does get called, and does not always crash. I think in order to get a badarg there that first argument would have to be a non-integer (even a float would do it.) Looking through the code I can’t figure out how that would be anything but an integer, but there maybe some path through the code where it is undefined (at a guess.)

I then changed the code so that ebloom:new/3 is always a badarg, and the error is different to the one in your initial issue. You were right and the badarg isn't from that point in the code, but elsewhere. It needs more digging. Sorry for the red herring.

russelldb · 2017-12-06T14:43:11Z

Although the test is for repl_aae_fullsync as opposed to repl2. I don't know if that is relevant. I'll look into a repl2 aae fs test (there isn't one named as such in the riak_test repo.)

martinsumner · 2017-12-06T15:04:39Z

I was also thinking about this line:
https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L530

If at some stage something goes wrong with the reference that wraps the NIF'd ebloom - might this become a badarg?

russelldb · 2017-12-06T15:08:04Z

Weirdly test still passes (with deliberate error on ebloom:new/3)

martinsumner · 2017-12-08T16:45:20Z

Created a more manual test with basho_bench loading a cluster, and then running full-sync. this had no issues either. This doesn't appear to be an ebloom issue.

Trying to recreate in production environment now with extra logging.

russelldb · 2018-02-28T10:16:57Z

After some further investigation the badarg is coming from the ebloom insertion line

https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L580

The Bucket may be a two-tuple (a typed bucket) and not a binary. The badarg is the call to <<Bucket\binary, Key\binary>> with that Bucket being something like {<<"bucket-type">>, <<"bucket">>}

The proposed fix is to make typed buckets into a binary before bloom insertion, and also before bloom membership checking later in the code.

russelldb · 2018-02-28T10:35:08Z

Interestingly it's handled in keylisting:

https://github.com/nhs-riak/riak_repl/blob/develop-2.2/src/riak_repl_keylist_server.erl#L457
https://github.com/nhs-riak/riak_repl/blob/develop-2.2/src/riak_repl_keylist_server.erl#L703

As per the issue #774 in some instances the Bucket given to ebloom was a 2-tuple (typed bucket) not a binary, and so a `badarg` was thrown. Though this does not stop repl from working, it leads to crashes and re-tries, is broken, and polutes this log. This commit fixes by ensuring a binary is passed to ebloom:insert/2 and ebloom:contains/2

russelldb self-assigned this Feb 28, 2018

russelldb added type:bug prio:must Kanban-Cabal-Repl Easy labels Feb 28, 2018

russelldb added this to the 2.2.5 milestone Feb 28, 2018

This was referenced Mar 1, 2018

Fix badarg in binary construction for args to ebloom #777

Merged

Add a test for riak_repl#774 basho/riak_test#1301

Merged

martinsumner closed this as completed Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does ebloom always fail? #774

Does ebloom always fail? #774

martinsumner commented Dec 4, 2017

martinsumner commented Dec 4, 2017

russelldb commented Dec 6, 2017 •

edited

Loading

russelldb commented Dec 6, 2017

martinsumner commented Dec 6, 2017

russelldb commented Dec 6, 2017 •

edited

Loading

martinsumner commented Dec 8, 2017

russelldb commented Feb 28, 2018

russelldb commented Feb 28, 2018

Does ebloom always fail? #774

Does ebloom always fail? #774

Comments

martinsumner commented Dec 4, 2017

martinsumner commented Dec 4, 2017

russelldb commented Dec 6, 2017 • edited Loading

russelldb commented Dec 6, 2017

martinsumner commented Dec 6, 2017

russelldb commented Dec 6, 2017 • edited Loading

martinsumner commented Dec 8, 2017

russelldb commented Feb 28, 2018

russelldb commented Feb 28, 2018

russelldb commented Dec 6, 2017 •

edited

Loading

russelldb commented Dec 6, 2017 •

edited

Loading