Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put merge failure and crash of nodes allow_mult=false #1707

Open
martinsumner opened this issue Aug 21, 2019 · 8 comments
Open

Put merge failure and crash of nodes allow_mult=false #1707

martinsumner opened this issue Aug 21, 2019 · 8 comments
Labels
2.9 Known Issue Known issues with the Riak 2.9 release Bug

Comments

@martinsumner
Copy link
Contributor

Many, many tests have now been performed in pre-production environments stopping/starting nodes. During one test, and only one test, there were a number of errors.

These errors were as a result of 7 different PUTs, each of which caused a crash off the vnode.

The crash occurred because of an unexpected case in the riak_kv_vnode:select_newest_content/1 function. This is used when allow_mult=false, to select the sibling content for storage with the highest last_modified_date. The function does nothing when the contents of the object are a single item list, and performs the selection when the contents of the object are anything else (assuming anything else to be a multi-item list).

However in this case:

"src/riak_kv_vnode.erl"},{line,2603}]},{riak_kv_vnode,prepare_put_existing_object,6,[{file,"src/riak_kv_vnode.erl"},{line,2399}]},{riak_kv_vnode,do_put,7,[{file
,"src/riak_kv_vnode.erl"},{line,2276}]},{riak_kv_vnode,handle_request,4,[{file,"src/riak_kv_vnode.erl"},{line,1318}]},{riak_core_vnode,vnode_command,3,[{file,"s
rc/riak_core_vnode.erl"},{line,361}]},{gen_fsm,handle_msg,7,[{file,"gen_fsm.erl"},{line,505}]}]}```

So it appears that the merged object being tested for LMD had contents which were an empty list ???

Looking at the last event in the stack trace, the object being PUT was a riak_object in the expected format, with a single item content list.  Retrieving the object following the crash, reveals the object (in the last event) stored as expected.
@martinsumner
Copy link
Contributor Author

martinsumner commented Aug 21, 2019

Reading through the code, it is hard to see how an object with a content could end up with a merged object without content (or an empty list).

The only avenue appears to be:

https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p4/src/riak_object.erl#L435-L436

Could it be that prune_object_siblings could do a mutual prune in some bizarre circumstances?

Would this be more possible if we get a fake object as part of the put_merge. In other words could we have a fake object whose clock prunes all the updates in the new object, but itself has empty contents:

https://github.com/basho/riak_kv/blob/riak_kv-2.9.0p4/src/riak_kv_vnode.erl#L2371-L2374

This has changed in 2.9.0 - but to make the circumstances in which we get a fake object stricter ... see riak_kv_vnode:determine_requires_get/2 in this commit. The old object must now be dominated before a fake is used.

However, this path was almost never trodden before, as prior to the introduction of HEAD requests and leveled the fake object path would depend on the existence of the object in the metadata cache - and the metadata cache was by default disabled. Perhaps there is hidden danger in the use of a fake object in PUT merge that has not yet been encountered.

@martinsumner
Copy link
Contributor Author

martinsumner commented Aug 21, 2019

Further investigation required - but it should be noted that the object being PUT has a vector clock where two of the entries have the same timestamp:

[{<<93,8,50,2,157,164,165,154>>,{1,63733083419}},{<<202,218,235,138,86,98,97,118>>,{1,63733083419}}]

This appears to be true for each example.

The dot of the PUT request singleton content is: {<<93,8,50,2,157,164,165,154>>,{1,63733083419}}

The object in the vnode store that was fetched could have a vclock of , [{<<93,8,50,2,157,164,165,154>>,{1,63733083419}}] as it has already seen this change but not the other change. In this case a fake object would be used.

When pruning the changes though, the inbound change would be pruned as it is in the vclock of the "old" object, and we would end up with an empty list of contents.

@martinsumner martinsumner added the 2.9 Known Issue Known issues with the Riak 2.9 release label Aug 21, 2019
@martinsumner
Copy link
Contributor Author

It should be noted that in this case the behaviour is ugly but safe - the vnode crashes before it does anything wrong. However, in other cases might it not crash but prune a sibling?

@martinsumner
Copy link
Contributor Author

martinsumner commented Aug 22, 2019

If theory is correct the fix is relatively simple.

Currently when the PUT is being coordinated, no merge is performed if the inbound vclock dominates the old object. However, for non-coordinated PUTs the merge is still performed if the vclock dominates - which presents an issue if the object is fake.

Should the merge not just be bypassed in the dominate use-case for non-coordinated PUTs as well - through changing the logic in riak_kv_vnode:put_merge/6

May need to better identify it as a fake object to make this easier (and in particular to not confuse a tombstone with a fake object.

@russelldb
Copy link
Member

This is a super old and mysterious bug, I remember spending a few hours pouring over customer logs to try and figure out the object and how it got like that, I suspect it is where the bucket props have changed from lww=true to lww=false or maybe allow_mult true -> false. It's a long time ago, and I don't remember well, but IIRC you have two identical clocks, but on one side DOT-A is kept, and the other DOT-B is kept, and the results is merging is an empty list. I assumed it is a non-deterministic bug in how "latest" timestamp is chosen for picking a value, but never was able to repro it, so I couldn't fix it. I'll try and find the existing issue for you

@martinsumner
Copy link
Contributor Author

martinsumner commented Aug 22, 2019

Thanks @russelldb, anything you can dig up would be really useful.

As I walking through the code last night though, I couldn't resolve how riak_object:merge/2 would work safely if the OldObject is a fake object (as might be found if the metadata_cache or HEAD requests are used, and it is not a coordinating PUT). So I'm concerned I've introduced a more widespread problem with false pruning of siblings by having a configuration (leveled backend) that uses the fake objects in the PUT path by default.

The issue we found here involves identical timestamps in the clock and allow_mult = false. However, I'm not sure it wouldn't have a problem with non-identical timestamps and allow_mult = true.

I don't think this is a problem with the riak_object:merge/2 implementation - this shouldn't have to expect the existence of fake objects. I think the problem is in riak_kv_vnode. In the riak_kv_vnode coord=true PUT path, the merge is not called if the incoming vclock dominates the the existing vlock (so there is no merge with a fake object). However, if coord is not true, merge will be still called even if the inbound dominates the old clock and the oldobject is now a fake.

@martinsumner
Copy link
Contributor Author

Some good news and some bad news.

Good news: my eyeballing of the code was wrong, the fake_object is not passed into riak_object:merge.

This means that this is not a general data-loss causing issue with fake objects as I had feared. The code uses riak_kv_vnode code syntactic_merge, which will not merge contents if one vclock dominates another - and fake objects are only used when domination is an issue. There is no need to panic that 2.9.0 has a fundamental bug because of the activation of the fake_object code path. this has been confirmed through test.

Bad news: I now don't know what the cause of the discovered issue is.

It looks like this may well be a long-lost issue as suggested by @russelldb, and not specific to this release or the changes in it.

@martinsumner
Copy link
Contributor Author

#1708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.9 Known Issue Known issues with the Riak 2.9 release Bug
Projects
None yet
Development

No branches or pull requests

2 participants