You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running a few tests in EC2. In this case I have a 3 server cluster (c3.xlarge's with the consul data getting written to SSDs) and 392 m1.small clients. The clients all have one service registered with a health check.
In this test all of my client nodes are writing a timestamp to the same key (/v1/kv/time) every 500ms. So that's roughly 800 writes per second to the same key across all nodes getting RPC'd to the 3 servers. It's not a realistic scenario per se but I think it's an interesting one.
When I start the test everything just hums along, CPU on the servers are all under 8% and nothing looks unusual, but then CPU drops off an things freeze up. If I try to write to a key on a server via curl the request hangs. In this state I can read successfully and all of the consul commands seem snappy. Writes to any key just hangs indefinitely though.
It looks like there was a compaction that correlates to the deadlock.
2014/05/09 22:20:41 [INFO] snapshot: Creating new snapshot at /mnt/consul/data/raft/snapshots/5-77915-2014-05-09T22:20:41.87297349Z.tmp
2014/05/09 22:20:41 [INFO] snapshot: reaping snapshot /mnt/consul/data/raft/snapshots/5-31805-2014-05-09T22:13:56.045582455Z
2014/05/09 22:20:41 [INFO] raft: Compacting logs from 52301 to 67804
2014/05/09 22:20:41 [INFO] raft: Snapshot to 77915 complete
If I start and stop the current leader, things get moving again.
Here was the consul info before the restart. I only have GOMAXPROCS set to 1. I was planning to bump that up incrementally:
I see why this is happening, not sure how I never triggered this before. If you notice, last_log_index is 78044 but the last commit_index the leader sees is 77915. Not coincidentally, the difference is 129, or 1 more than our buffered inflight channel of 128 entries.
Not exactly sure how compaction relates here (slows the disk down enough to delay leader writes?)
In any case, there is a potentially deadlock on that channel. Should be easy enough to fix.
@mbrevoort I've pushed hashicorp/raft@5800ad5 which should resolve the deadlock issue here. Also as a safety 686fac0 changes our interaction with Raft to limit how long we wait. This will prevent writes from hanging forever if Raft is ever extremely busy or deadlocked (which would be a bug).
I'm running a few tests in EC2. In this case I have a 3 server cluster (c3.xlarge's with the consul data getting written to SSDs) and 392 m1.small clients. The clients all have one service registered with a health check.
In this test all of my client nodes are writing a timestamp to the same key (/v1/kv/time) every 500ms. So that's roughly 800 writes per second to the same key across all nodes getting RPC'd to the 3 servers. It's not a realistic scenario per se but I think it's an interesting one.
When I start the test everything just hums along, CPU on the servers are all under 8% and nothing looks unusual, but then CPU drops off an things freeze up. If I try to write to a key on a server via curl the request hangs. In this state I can read successfully and all of the consul commands seem snappy. Writes to any key just hangs indefinitely though.
It looks like there was a compaction that correlates to the deadlock.
If I start and stop the current leader, things get moving again.
Here was the consul info before the restart. I only have GOMAXPROCS set to 1. I was planning to bump that up incrementally:
It happened again after ~5 minutes. Same thing, compaction on the leader:
The text was updated successfully, but these errors were encountered: