You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Autopilot may mark some failed node as left even if this node still has live sessions with a number of KV locks associated with it, effectively releasing them. This breaks safety guarantees, namely that some lock cannot be reacquired as long as the session associated with it can still be alive.
Reproduction Steps
It's not easy to outline a straightforward reproducible scenario since I have a system-under-test setup upon which I periodically wreak havoc with a jepsen test. The SUT consists of 5 nodes (n1, ..., n5), with a Consul server and my app on each node. Every app instance use sessions and locks APIs from its local Consul instance heavily to make sure there's at most one process in a cluster with a unique ID doing its job.
This one time I observed following sequence of events. Please note that I picked just these events relevant to the issue from these logs I have at hand.
At 17:08:59.855 app instance on n3 created a session with TTL = 30 seconds and kept on renewing it every 10 seconds.
At 17:09:43.952 app instance on n3 acquired The Lock with his session.
Then there was a lot of bad things happening in the cluster caused by jepsen randomly partitioning it every 50 seconds or so but The Lock was held by app instance on n3 all this time.
At 17:11:45 every Consul instance reported that new leader was elected: n1.
At 17:12:19.928 app instance on n3 successfully renewed his session for another 30 seconds with lock delay of 10 seconds.
At 17:12:24.311 jepsen test runner isolated nodes (n1n3) from (n2n4n5).
At 17:12:26 Consul on n2 righteously proclaimed himself the leader after winning election.
At 17:12:34 Consul on n2 marked n3 as failed.
At 17:12:36 Consul's autopilot on n2 removed the failed server node n3.
From then on every Consul instance rejected lock acquisition attempts noting that lock delay will expire at 17:12:46.579.
At 17:12:44.311 jepsen runner healed the cluster partitions
At 17:12:47.648 app instance on n2 successfully acquired The Lock with its session and started making progress.
Up until 17:12:49.944 app instance on n3 thought that he was in charge of The Lock and too was making progress.
Finally at 17:12:49.944 app instance on n3 failed to renew his session one more time and then died, thus stopping from making progress.
So there was ~2 seconds during which two processes with the same ID were up and running. I was quite bewildered that the autopilot made a decision to consider n3 as left after mere 2 seconds since she was deemed as failed, even though she had a session still alive at that moment. I'm also kinda curious about what exaclty kicked autopilot's logic in.
This is a followup to my earlier comment on a loosely related issue.
Overview of the Issue
Autopilot may mark some failed node as left even if this node still has live sessions with a number of KV locks associated with it, effectively releasing them. This breaks safety guarantees, namely that some lock cannot be reacquired as long as the session associated with it can still be alive.
Reproduction Steps
It's not easy to outline a straightforward reproducible scenario since I have a system-under-test setup upon which I periodically wreak havoc with a jepsen test. The SUT consists of 5 nodes (
n1
, ...,n5
), with a Consul server and my app on each node. Every app instance use sessions and locks APIs from its local Consul instance heavily to make sure there's at most one process in a cluster with a unique ID doing its job.consul.json
This one time I observed following sequence of events. Please note that I picked just these events relevant to the issue from these logs I have at hand.
17:08:59.855
app instance onn3
created a session with TTL = 30 seconds and kept on renewing it every 10 seconds.17:09:43.952
app instance onn3
acquired The Lock with his session.n3
all this time.17:11:45
every Consul instance reported that new leader was elected:n1
.17:12:19.928
app instance onn3
successfully renewed his session for another 30 seconds with lock delay of 10 seconds.17:12:24.311
jepsen test runner isolated nodes (n1
n3
) from (n2
n4
n5
).17:12:26
Consul onn2
righteously proclaimed himself the leader after winning election.17:12:34
Consul onn2
markedn3
as failed.17:12:36
Consul's autopilot onn2
removed the failed server noden3
.17:12:46.579
.17:12:44.311
jepsen runner healed the cluster partitions17:12:47.648
app instance onn2
successfully acquired The Lock with its session and started making progress.17:12:49.944
app instance onn3
thought that he was in charge of The Lock and too was making progress.17:12:49.944
app instance onn3
failed to renew his session one more time and then died, thus stopping from making progress.So there was ~2 seconds during which two processes with the same ID were up and running. I was quite bewildered that the autopilot made a decision to consider
n3
as left after mere 2 seconds since she was deemed as failed, even though she had a session still alive at that moment. I'm also kinda curious about what exaclty kicked autopilot's logic in.Consul info for Server
Server info
Operating system and Environment details
Docker-compose environment on
Linux 5.3.10-arch1-1 #1 SMP PREEMPT
.Log Fragments
consul.log @ n2 / n3
The text was updated successfully, but these errors were encountered: