-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent writes loss in the presence of heavy handoff activity #1754
Comments
Many thanks for the high quality investigation, and excellent write-up. I have another bug I'm focused on at the moment, but will try and look at this as soon as possible. @martincox would you, or someone in your team, be able to look too? My initial reaction if your hunch is correct, is that we should just fail the PUT if the PUT is forwarded to a node that doesn't then find itself in the preflist. But this requires some thought. |
Thanks @keynslug. Yep, I'll take a look at this tomorrow, @martinsumner. |
In this case the problem occurs straight after a node leave, which makes me suspicious about problems we've seen in the shutdown process for leaving nodes. We discovered an issue in What was found with The riak_core service is the last to shutdown, so it will happily provide the incorrect ring to any service where shutdown has not yet been triggered. I don't know if it is possible for a PUT to be forwarded to a node (or a GET to be received by a node) in this state, so that because it sees all vnodes as local due to the fresh ring, it will think that PR or PW can be satisfied locally. To get round this, I introduced (on 3.0 only) the concept of a last-gasp-ring (that is the fresh ring created for the node as it shuts down after leaving): Then some changes to stop services reacting to a last gasp ring: One way of checking this would be to use If there is a last-gasp-ring issue, then it might be worth pack-porting the last-gasp-ring changes, and then get the PUT_FSM and GET_FSM to check that any ring they're making decisions on isn't a last-gasp. |
Thank you for looking into this @martinsumner. I've given some thought to your hypothesis and skimmed through linked commits and related work. However dare I say I believe there are few contradictions to it. Let me explain.
As for this specific Jepsen test report I've provided, it seems that the anomalous read was observed right after those two extra nodes had just started leaving the cluster, long before they actually shut down. You can see it in
Moreover in our production setup we started to experience occasional lost writes similar to those described in the report after we joined a fresh node to the cluster consisting of 6 healthy nodes, with no leaving nodes altogether. Sure there may be more moving parts I did not consider but I still think it's worth to mention.
Yeah, I certainly can imagine that but I still fail to see how a node with a fresh ring could respond to a read op with stale value but not with not found. Or do you envision a situation when both the write of value 31 and the subsequent read of 31 (which are by the way almost exactly three second apart) ended up on the same coordinating node? Anyway I will certainly try to find some time to make a few runs with scenario modified to exclude node leaving whatsoever, to see if I still could reproduce the issue. On the other hand I might as well come up with a minimal PR fixing just what I still regard as my primary suspect, to better illustrate my point, since there's a body of evidence which is hard to explain otherwise:
Do you accept and review external PRs? As always I may be wrong in my assumptions, please feel free to point it out. |
Ah, I had missed the fact that Jepsen had logged the leave competing some time after, I had assumed that given the small size of the data set the leave would have been much quicker. I was thinking on the lines of the two theories combining. When riak_kv stops it brings the API down first, perhaps the fact that we forward PUTs allows the node to be hit in this interim state, when without forwarding it could not be hit by a normal request as the local API is down. But I can't satisfactorily explain the fact that the reads get the old value without the node being resurrected some way. It doesn't explain why there should be problems on joins not leaves. So treat my theory as a diversion for now. Couple of questions though ... The errors in the logs you included in the report were for Is there a consistent time gap between the PUT and its anomalous read? |
In terms of a PR, at some stage we need to run through the full suite of Ideally changes should have there own riak_test test - but in this case I wouldn't expect it to be reproducible in riak_test so that step can be skipped. Everything gets done on a best endeavours basis in terms of getting the PR into a release, I can't promise much more than that we will do our best. |
Gosh, sorry for that confusion. Here are all of them, out of
Notably here's only one occurence of
IIRC in all of my successfully reproduced cases (including this report) anomalous read:
This is how I actually started my investigation, there are not much other places where 3 second interval is used as some default:
|
Quite fair I'd say. I'll take a shot then if I'm able to secure some time for it. |
Sorry, I might not be understanding the Jepsen log entry, could you help me please:
Does |
That's correct.
The first one is pre-request vclock which a client used to make a request, the second one is post-request vclock received along with response. |
Sorry, I got distracted with other events. So is it possible that:
...
The loss of the ack from the the coordinator (on B) back to the original PUT_FSM (on A) would have therefor the potential for data loss with allow_mult=false. This doesn't explain why this would occur whilst a node leave or add is occurring - why this event should cause such a message loss. But regardless of node leave/join issues - it might suggest that retry_put_coordinator is not safe with allow_mult=false. |
Put FSM did not account for the possibility of multiple coordinator forwards, hence failing to acknowledge request originator in such cases. Multiple forwards are presumably possible under high transfer activity in the cluster, when participating nodes may end up disagreeing on the ring state for a short amount of time. More details: basho#1754
Yes, I too believe that this is more or less what is happening.
My best bet is this. AFAIK ring state can not possibly be strongly consistent across a cluster, only eventually consistent, given its changes are gossiped. Which means that in the presence of constant ownership transfers some two nodes may disagree on the ring state and consequently have different sets of primaries for some key.
...This is why I believe there was no message loss (since we would observe it, somewhere in the operation history or at least in logs). Presumably, ack was not returned to Node A because Node B had forwarded once again and did not care to ack to Node A afterwards. I made an attempt to fix it with #1755. I must confess it's not ready for review yet, but I think it at least may serve as an illustration of sorts. |
Put FSM did not account for the possibility of multiple coordinator forwards, hence failing to acknowledge request originator in such cases. Multiple forwards are presumably possible under high transfer activity in the cluster, when participating nodes may end up disagreeing on the ring state for a short amount of time. More details: basho#1754
Fixed now merged and released in both 2.9 and 3.0 |
Riak KV may lost recent writes in the presence of heavy handoff activity.
I've made a test with the help of Jepsen framework specifically to investigate this kind of buggy behaviour because it hit us several times already in our production environment. This test observes an occurence of this bug almost every run to be honest. An ordinary test run follows roughly these steps:
I have attached a Jepsen report archive of one such test run. Most notable observations are in
results.edn
, here's an excerpt, redacted for brewity:Basically this all means that the client 25 while operating on key
jepsen638479067-6
read value20
and subsequently read value31
quite unexpectedly, all in the absence of concurrent writes under this key.Skimming through
history.edn
, here's relevant history fragment, again redacted a bit for brewity:It can be seen clearly that an act of writing
31
at index 16232 effected read at index 16407, though in the meantime 3 subsequent writes have succeeded (16245, 16356, 16361).Affected version(s)
Though it won't hurt to mention that I have successfully reproduced this bug under riak 2.2.3 as well.
Steps to reproduce
Clone aforementioned repo.
Start a test run with:
Wait for it to finish cleanly.
Note that tearing down node with
riak stop
on 2.9.1 may become stuck for some reason at the end of a run, simple ^C would be fine then.Look at the analysis report in
store/latest/results.edn
.Suspected cause
Essentially this is caused by the fact that put fsm does not account for the possibility of multiple forwards to another coordinator. I'll try to illustrate:
retry_put_coordinator_failure
istrue
which is the default per cuttlefish schema.This is why I believe there are occasional unrecognized messages in logs, which are extraneous responses to original requests retried unwarrantedly:
One more evidence which strengthen my suspicion is that turning off
retry_put_coordinator_failure
makes this buggy behaviuor go away with quite high probability, if not completely, given the number of test runs I've performed so far.Additional observations
Moreover it's not clear for me why acknowledgement is being sent in the execute state of an fsm, but not in the validate state since there's a possibility that execute will be skipped altogether, in the event of parameter violation for example.
The text was updated successfully, but these errors were encountered: