Deadlock after snapshot on leader #141

mbrevoort · 2014-05-09T22:49:50Z

I'm running a few tests in EC2. In this case I have a 3 server cluster (c3.xlarge's with the consul data getting written to SSDs) and 392 m1.small clients. The clients all have one service registered with a health check.

In this test all of my client nodes are writing a timestamp to the same key (/v1/kv/time) every 500ms. So that's roughly 800 writes per second to the same key across all nodes getting RPC'd to the 3 servers. It's not a realistic scenario per se but I think it's an interesting one.

When I start the test everything just hums along, CPU on the servers are all under 8% and nothing looks unusual, but then CPU drops off an things freeze up. If I try to write to a key on a server via curl the request hangs. In this state I can read successfully and all of the consul commands seem snappy. Writes to any key just hangs indefinitely though.

It looks like there was a compaction that correlates to the deadlock.

2014/05/09 22:20:41 [INFO] snapshot: Creating new snapshot at /mnt/consul/data/raft/snapshots/5-77915-2014-05-09T22:20:41.87297349Z.tmp
2014/05/09 22:20:41 [INFO] snapshot: reaping snapshot /mnt/consul/data/raft/snapshots/5-31805-2014-05-09T22:13:56.045582455Z
2014/05/09 22:20:41 [INFO] raft: Compacting logs from 52301 to 67804
2014/05/09 22:20:41 [INFO] raft: Snapshot to 77915 complete

If I start and stop the current leader, things get moving again.

Here was the consul info before the restart. I only have GOMAXPROCS set to 1. I was planning to bump that up incrementally:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    server = true
raft:
    applied_index = 77915
    commit_index = 77915
    fsm_pending = 0
    last_contact = 0
    last_log_index = 78044
    last_log_term = 5
    last_snapshot_index = 77915
    last_snapshot_term = 5
    num_peers = 2
    state = Leader
    term = 5
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 955
    max_procs = 1
    os = linux
    version = go1.2.1
serf_lan:
    event_queue = 0
    event_time = 7
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 525
    members = 392
    query_queue = 0
    query_time = 1
serf_wan:
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

It happened again after ~5 minutes. Same thing, compaction on the leader:

2014/05/09 22:39:18 [INFO] consul.fsm: snapshot created in 26.676us
2014/05/09 22:39:18 [INFO] raft: Starting snapshot up to 137042
2014/05/09 22:39:18 [INFO] snapshot: Creating new snapshot at /mnt/consul/data/raft/snapshots/7-137042-2014-05-09T22:39:18.988037583Z.tmp
2014/05/09 22:39:19 [INFO] snapshot: reaping snapshot /mnt/consul/data/raft/snapshots/5-53893-2014-05-09T22:16:26.271401047Z
2014/05/09 22:39:19 [INFO] raft: Compacting logs from 67805 to 126893
2014/05/09 22:39:19 [INFO] raft: Snapshot to 137042 complete

The text was updated successfully, but these errors were encountered:

armon · 2014-05-09T23:55:50Z

I see why this is happening, not sure how I never triggered this before. If you notice, last_log_index is 78044 but the last commit_index the leader sees is 77915. Not coincidentally, the difference is 129, or 1 more than our buffered inflight channel of 128 entries.

Not exactly sure how compaction relates here (slows the disk down enough to delay leader writes?)

In any case, there is a potentially deadlock on that channel. Should be easy enough to fix.

armon · 2014-05-10T02:20:23Z

@mbrevoort I've pushed hashicorp/raft@5800ad5 which should resolve the deadlock issue here. Also as a safety 686fac0 changes our interaction with Raft to limit how long we wait. This will prevent writes from hanging forever if Raft is ever extremely busy or deadlocked (which would be a bug).

armon · 2014-05-10T02:20:37Z

I'm closing this now, but re-open it if you can reproduce this issue.

armon closed this as completed May 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock after snapshot on leader #141

Deadlock after snapshot on leader #141

mbrevoort commented May 9, 2014

armon commented May 9, 2014

armon commented May 10, 2014

armon commented May 10, 2014

Deadlock after snapshot on leader #141

Deadlock after snapshot on leader #141

Comments

mbrevoort commented May 9, 2014

armon commented May 9, 2014

armon commented May 10, 2014

armon commented May 10, 2014