-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness: race condition when handling successive requests from two clients leads to deadlock #5
Comments
Where `config.pipelining_max` exceeds `config.io_depth_write` it's possible for a client request to be unable to acquire a write IOP if we have maxed out our IO depth. This can lead to deadlock for a cluster of one or two, since there is no other way for the leader to repair the dirty op because no other replica has it. The fix is for `on_prepare_timeout()` to retry the prepare. Reported-by: @ThreeFx Fixes: tigerbeetle/viewstamped-replication-made-famous#5
Congrats @ThreeFx on being the first person to earn a TigerBeetle liveness bounty (in less than 72 hours)! You've officially opened the leaderboard. Your report was crystal clear, easy to reproduce, you understood the issue (concurrent client requests exceeds concurrent I/O write depth with no retry in This is a really nice find, because it would slowly shutdown clients and then finally shutdown the whole prepare pipeline for a cluster of one or two, and this would be triggered only under load. You had a good instinct to explore the boundaries by setting clients higher, since this surfaced the issue which depends on The suggested fix was not quite there because the Considering the quality of your report overall, we've decided to award you with a $500 liveness bounty. P.S. Thanks for sharing Liquicity Yearmix 2020, we've got it on rotation at the office now! |
Congrats also on the bonus $50 that will now be going to the ZSF! |
I'm pushing the fix now. Please let me know if it fixes your seed without introducing anything else. |
Please drop us an email at info@tigerbeetle.com so we can arrange payment. |
Done!
I've confirmed that the seed now works correctly for me as well. |
71a35083fd1955377a699a61a672ce2c7c2366ee |
Where `config.pipelining_max` exceeds `config.io_depth_write` it's possible for a client request to be unable to acquire a write IOP if we have maxed out our IO depth. This can lead to deadlock for a cluster of one or two, since there is no other way for the leader to repair the dirty op because no other replica has it. The fix is for `on_prepare_timeout()` to retry the prepare. Reported-by: @ThreeFx Fixes: tigerbeetle#5
Description and Impact
This bug occurs (only?) with one or two replicas. We assume one replica for all of the following. It seems like this issue is resolved by leader election following a view expiration for n>=3 replicas.
Successive client requests may lead to a deadlock: The write lock acquired by a replica in
jounal.write_prepare
may be held until after the next op should be written, causing the write to fail and not be retried. In the one replica case it seems that the replica can never recovers this error and deadlocks.This leads to a dirty operation clogging the journal. The replica believes that the op is still going to be written,
on_prepare_timeout
reports0: on_prepare_timeout: waiting for journal
, and backs off. Inrepair_op(self, op)
, the replica cannotchoose_any_other_replica()
, so the uncommited, dirty op stays uncommited and dirty.At some point the replica gets stuck in a
preparing
state for eachclient
, and no new requests can be answered, thus the system deadlocks. In this exact example I have14
clients and op 12 gets stuck, and the last op number assigned is 25, which makes sense.This only happens sometimes during testing: It must be the case that two requests from
clients
i and j (we assume i < j wlog) are both en route to the target replica, and i's request is delivered immediately before j's request. In a distributed setting (i.e. with multiple replicas) this does not occur with high probability, because replicas have to exchangeprepare
messages before writing the log (maybe when setting the message delay to 0, I have not tested this). Still, I have no reason to believe that this is only a testing error though, as this routing might occur naturally in the one replica setting (imagine co-located clients issuing requests at the same time).Steps to Reproduce the Bug
./vopr.sh 14432206721076428670 -OReleaseSafe
./vopr.sh 14432206721076428670
to get a detailed log.Suggested Fix
A "obviously correct" fix is to always wait for writes, however this is infeasible.
I believe that the responsible path in
repair_prepare
could be adjusted to account for the one replica case. I have not tested any modifications, but this seems like a promising start. I am available on the TigerBeetle Discord for further debugging if needed.The Story Behind the Bug
I had (have?) a feeling that many clients may lead to problems/race conditions, so I increased the client count. Then I ran
./vopr.sh
with increased client numbers, stumbled upon this.I was not specifically targeting the one replica case, and was really lucky that I got such a small trace to analyze (only the first about 1.5k relevant log lines have all neccessary info).
Songs Enjoyed During the Production of This Issue
Liquicity Yearmix 2020
Literature
No response
The Last Word
No response
The text was updated successfully, but these errors were encountered: