-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Put merge failure and crash of nodes allow_mult=false #1707
Comments
Reading through the code, it is hard to see how an object with a content could end up with a merged object without content (or an empty list). The only avenue appears to be: https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p4/src/riak_object.erl#L435-L436 Could it be that prune_object_siblings could do a mutual prune in some bizarre circumstances? Would this be more possible if we get a fake object as part of the put_merge. In other words could we have a fake object whose clock prunes all the updates in the new object, but itself has empty contents: https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p4/src/riak_kv_vnode.erl#L2371-L2374 This has changed in 2.9.0 - but to make the circumstances in which we get a fake object stricter ... see However, this path was almost never trodden before, as prior to the introduction of HEAD requests and leveled the fake object path would depend on the existence of the object in the metadata cache - and the metadata cache was by default disabled. Perhaps there is hidden danger in the use of a fake object in PUT merge that has not yet been encountered. |
Further investigation required - but it should be noted that the object being PUT has a vector clock where two of the entries have the same timestamp:
This appears to be true for each example. The dot of the PUT request singleton content is: The object in the vnode store that was fetched could have a vclock of , When pruning the changes though, the inbound change would be pruned as it is in the vclock of the "old" object, and we would end up with an empty list of contents. |
It should be noted that in this case the behaviour is ugly but safe - the vnode crashes before it does anything wrong. However, in other cases might it not crash but prune a sibling? |
If theory is correct the fix is relatively simple. Currently when the PUT is being coordinated, no merge is performed if the inbound vclock dominates the old object. However, for non-coordinated PUTs the merge is still performed if the vclock dominates - which presents an issue if the object is fake. Should the merge not just be bypassed in the dominate use-case for non-coordinated PUTs as well - through changing the logic in May need to better identify it as a fake object to make this easier (and in particular to not confuse a tombstone with a fake object. |
This is a super old and mysterious bug, I remember spending a few hours pouring over customer logs to try and figure out the object and how it got like that, I suspect it is where the bucket props have changed from lww=true to lww=false or maybe allow_mult true -> false. It's a long time ago, and I don't remember well, but IIRC you have two identical clocks, but on one side DOT-A is kept, and the other DOT-B is kept, and the results is merging is an empty list. I assumed it is a non-deterministic bug in how "latest" timestamp is chosen for picking a value, but never was able to repro it, so I couldn't fix it. I'll try and find the existing issue for you |
Thanks @russelldb, anything you can dig up would be really useful. As I walking through the code last night though, I couldn't resolve how The issue we found here involves identical timestamps in the clock and I don't think this is a problem with the |
Some good news and some bad news. Good news: my eyeballing of the code was wrong, the fake_object is not passed into riak_object:merge. This means that this is not a general data-loss causing issue with fake objects as I had feared. The code uses riak_kv_vnode code syntactic_merge, which will not merge contents if one vclock dominates another - and fake objects are only used when domination is an issue. There is no need to panic that 2.9.0 has a fundamental bug because of the activation of the fake_object code path. this has been confirmed through test. Bad news: I now don't know what the cause of the discovered issue is. It looks like this may well be a long-lost issue as suggested by @russelldb, and not specific to this release or the changes in it. |
Many, many tests have now been performed in pre-production environments stopping/starting nodes. During one test, and only one test, there were a number of errors.
These errors were as a result of 7 different PUTs, each of which caused a crash off the vnode.
The crash occurred because of an unexpected case in the riak_kv_vnode:select_newest_content/1 function. This is used when allow_mult=false, to select the sibling content for storage with the highest last_modified_date. The function does nothing when the contents of the object are a single item list, and performs the selection when the contents of the object are anything else (assuming anything else to be a multi-item list).
However in this case:
The text was updated successfully, but these errors were encountered: