Liveness: race condition when handling successive requests from two clients leads to deadlock #5

ThreeFx · 2021-09-13T21:33:05Z

Description and Impact

This bug occurs (only?) with one or two replicas. We assume one replica for all of the following. It seems like this issue is resolved by leader election following a view expiration for n>=3 replicas.

Successive client requests may lead to a deadlock: The write lock acquired by a replica in jounal.write_prepare may be held until after the next op should be written, causing the write to fail and not be retried. In the one replica case it seems that the replica can never recovers this error and deadlocks.

// file: src/vsr/journal.zig

pub fn write_prepare(
    self: *Self,
    callback: fn (self: *Replica, wrote: ?*Message, trigger: Write.Trigger) void,
    message: *Message,
    trigger: Self.Write.Trigger,
) void {
    // [snip]
    // right here there's a race condition between two messages
    const write = self.writes.acquire() orelse {
        self.write_prepare_debug(message.header, "no IOP available");
        callback(replica, null, trigger);
        return;
    };
    // [snip]
}

This leads to a dirty operation clogging the journal. The replica believes that the op is still going to be written, on_prepare_timeout reports 0: on_prepare_timeout: waiting for journal, and backs off. In repair_op(self, op), the replica cannot choose_any_other_replica(), so the uncommited, dirty op stays uncommited and dirty.

At some point the replica gets stuck in a preparing state for each client, and no new requests can be answered, thus the system deadlocks. In this exact example I have 14 clients and op 12 gets stuck, and the last op number assigned is 25, which makes sense.

This only happens sometimes during testing: It must be the case that two requests from clients i and j (we assume i < j wlog) are both en route to the target replica, and i's request is delivered immediately before j's request. In a distributed setting (i.e. with multiple replicas) this does not occur with high probability, because replicas have to exchange prepare messages before writing the log (maybe when setting the message delay to 0, I have not tested this). Still, I have no reason to believe that this is only a testing error though, as this routing might occur naturally in the one replica setting (imagine co-located clients issuing requests at the same time).

Steps to Reproduce the Bug

Apply the patch at the bottom
Run ./vopr.sh 14432206721076428670 -OReleaseSafe
- The system will deadlock after two transitions
Run ./vopr.sh 14432206721076428670 to get a detailed log.

diff --git a/src/config.zig b/src/config.zig
index ed6ac8f..f212a24 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -6,14 +6,14 @@ pub const log_level = 7;
 
 /// The maximum number of replicas allowed in a cluster.
 /// This has been limited to 5 just to decrease the amount of memory required by the VOPR simulator.
-pub const replicas_max = 5;
+pub const replicas_max = 11;
 
 /// The maximum number of clients allowed per cluster, where each client has a unique 128-bit ID.
 /// This impacts the amount of memory allocated at initialization by the server.
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 30;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index aab7380..7631fd1 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -75,7 +75,7 @@ pub fn main() !void {
                 .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
                 .one_way_delay_min = prng.random.uintLessThan(u16, 3),
                 .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+                .path_maximum_capacity = 20, // must deterministically set because it screws up the randomness otherwise
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
                 .packet_replay_probability = prng.random.uintLessThan(u8, 50),

Suggested Fix

A "obviously correct" fix is to always wait for writes, however this is infeasible.

I believe that the responsible path in repair_prepare could be adjusted to account for the one replica case. I have not tested any modifications, but this seems like a promising start. I am available on the TigerBeetle Discord for further debugging if needed.

// file: src/vsr/replica.zig

fn repair_prepare(self: *Self, op: u64) bool {
    // [snip]

    if (self.status == .view_change and op > self.commit_max) {
        // [snip]
    } else {
        const nature = if (op > self.commit_max) "uncommitted" else "committed";
        const reason = if (self.journal.faulty.bit(op)) "faulty" else "dirty";
        log.debug("{}: repair_prepare: op={} checksum={} ({s}, {s})", .{
            self.replica,
            op,
            checksum,
            nature,
            reason,
        });

        // We expect that `repair_prepare()` is called in reverse chronological order:
        // Any uncommitted ops should have already been dealt with.
        // We never roll back committed ops, and thus never regard `nack_prepare` responses.
        // Alternatively, we may not be the leader, in which case we do distinguish anyway.
        assert(self.nack_prepare_op == null);
        assert(request_prepare.context == checksum);

        // TODO: account for the one replica case here?
        if (self.choose_any_other_replica()) |replica| {
            self.send_header_to_replica(replica, request_prepare);
        }
    }

    return true;
}

The Story Behind the Bug

I had (have?) a feeling that many clients may lead to problems/race conditions, so I increased the client count. Then I ran ./vopr.sh with increased client numbers, stumbled upon this.

I was not specifically targeting the one replica case, and was really lucky that I got such a small trace to analyze (only the first about 1.5k relevant log lines have all neccessary info).

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Literature

No response

The Last Word

No response

The text was updated successfully, but these errors were encountered:

@ThreeFx

Where `config.pipelining_max` exceeds `config.io_depth_write` it's possible for a client request to be unable to acquire a write IOP if we have maxed out our IO depth. This can lead to deadlock for a cluster of one or two, since there is no other way for the leader to repair the dirty op because no other replica has it. The fix is for `on_prepare_timeout()` to retry the prepare. Reported-by: @ThreeFx Fixes: tigerbeetle/viewstamped-replication-made-famous#5

jorangreef · 2021-09-14T14:03:45Z

Congrats @ThreeFx on being the first person to earn a TigerBeetle liveness bounty (in less than 72 hours)! You've officially opened the leaderboard.

Your report was crystal clear, easy to reproduce, you understood the issue (concurrent client requests exceeds concurrent I/O write depth with no retry in on_prepare_timeout for replica_count<=2 when that happens) and you seem to have already spent a huge amount of time getting familiar with the protocol code.

This is a really nice find, because it would slowly shutdown clients and then finally shutdown the whole prepare pipeline for a cluster of one or two, and this would be triggered only under load.

You had a good instinct to explore the boundaries by setting clients higher, since this surfaced the issue which depends on clients_max (and thus pipelining_max) exceeding io_depth_write.

The suggested fix was not quite there because the repair_prepare() function is for requesting prepares from other replicas whereas on_prepare_timeout() already has access to the deadlocked prepare which is in the prepare pipeline.

Considering the quality of your report overall, we've decided to award you with a $500 liveness bounty.

P.S. Thanks for sharing Liquicity Yearmix 2020, we've got it on rotation at the office now!

jorangreef · 2021-09-14T14:22:16Z

Congrats also on the bonus $50 that will now be going to the ZSF!

jorangreef · 2021-09-14T14:22:25Z

I'm pushing the fix now. Please let me know if it fixes your seed without introducing anything else.

jorangreef · 2021-09-14T14:23:00Z

Please drop us an email at info@tigerbeetle.com so we can arrange payment.

ThreeFx · 2021-09-14T15:00:42Z

Please drop us an email at info@tigerbeetle.com so we can arrange payment.

Done!

I'm pushing the fix now. Please let me know if it fixes your seed without introducing anything else.

I've confirmed that the seed now works correctly for me as well.

ThreeFx · 2021-09-14T15:03:39Z

71a35083fd1955377a699a61a672ce2c7c2366ee

@ThreeFx

Where `config.pipelining_max` exceeds `config.io_depth_write` it's possible for a client request to be unable to acquire a write IOP if we have maxed out our IO depth. This can lead to deadlock for a cluster of one or two, since there is no other way for the leader to repair the dirty op because no other replica has it. The fix is for `on_prepare_timeout()` to retry the prepare. Reported-by: @ThreeFx Fixes: tigerbeetle#5

ThreeFx added the Liveness label Sep 13, 2021

ThreeFx changed the title ~~Liveness: race condition when handling successive client requests leads to deadlock~~ Liveness: race condition when handling successive requests from two clients leads to deadlock Sep 13, 2021

jorangreef closed this as completed in eb423c3 Sep 14, 2021

jorangreef added the Awarded label Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness: race condition when handling successive requests from two clients leads to deadlock #5

Liveness: race condition when handling successive requests from two clients leads to deadlock #5

ThreeFx commented Sep 13, 2021 •

edited

Loading

jorangreef commented Sep 14, 2021

jorangreef commented Sep 14, 2021

jorangreef commented Sep 14, 2021

jorangreef commented Sep 14, 2021

ThreeFx commented Sep 14, 2021

ThreeFx commented Sep 14, 2021

Liveness: race condition when handling successive requests from two clients leads to deadlock #5

Liveness: race condition when handling successive requests from two clients leads to deadlock #5

Comments

ThreeFx commented Sep 13, 2021 • edited Loading

Description and Impact

Steps to Reproduce the Bug

Suggested Fix

The Story Behind the Bug

Songs Enjoyed During the Production of This Issue

Literature

The Last Word

jorangreef commented Sep 14, 2021

jorangreef commented Sep 14, 2021

jorangreef commented Sep 14, 2021

jorangreef commented Sep 14, 2021

ThreeFx commented Sep 14, 2021

ThreeFx commented Sep 14, 2021

ThreeFx commented Sep 13, 2021 •

edited

Loading