Autopilot reaps dead servers too aggressively #6793

keynslug · 2019-11-14T15:22:52Z

This is a followup to my earlier comment on a loosely related issue.

Overview of the Issue

Autopilot may mark some failed node as left even if this node still has live sessions with a number of KV locks associated with it, effectively releasing them. This breaks safety guarantees, namely that some lock cannot be reacquired as long as the session associated with it can still be alive.

Reproduction Steps

It's not easy to outline a straightforward reproducible scenario since I have a system-under-test setup upon which I periodically wreak havoc with a jepsen test. The SUT consists of 5 nodes (n1, ..., n5), with a Consul server and my app on each node. Every app instance use sessions and locks APIs from its local Consul instance heavily to make sure there's at most one process in a cluster with a unique ID doing its job.

consul.json

{
    "server": true,
    "bind_addr": "0.0.0.0",
    "bootstrap_expect": 5,
    "retry_join": ["n1"],
    "performance": {
        "raft_multiplier": 1
    }
}

This one time I observed following sequence of events. Please note that I picked just these events relevant to the issue from these logs I have at hand.

At 17:08:59.855 app instance on n3 created a session with TTL = 30 seconds and kept on renewing it every 10 seconds.
At 17:09:43.952 app instance on n3 acquired The Lock with his session.
Then there was a lot of bad things happening in the cluster caused by jepsen randomly partitioning it every 50 seconds or so but The Lock was held by app instance on n3 all this time.
At 17:11:45 every Consul instance reported that new leader was elected: n1.
At 17:12:19.928 app instance on n3 successfully renewed his session for another 30 seconds with lock delay of 10 seconds.
At 17:12:24.311 jepsen test runner isolated nodes (n1 n3) from (n2 n4 n5).
At 17:12:26 Consul on n2 righteously proclaimed himself the leader after winning election.
At 17:12:34 Consul on n2 marked n3 as failed.
At 17:12:36 Consul's autopilot on n2 removed the failed server node n3.
From then on every Consul instance rejected lock acquisition attempts noting that lock delay will expire at 17:12:46.579.
At 17:12:44.311 jepsen runner healed the cluster partitions
At 17:12:47.648 app instance on n2 successfully acquired The Lock with its session and started making progress.
Up until 17:12:49.944 app instance on n3 thought that he was in charge of The Lock and too was making progress.
Finally at 17:12:49.944 app instance on n3 failed to renew his session one more time and then died, thus stopping from making progress.

So there was ~2 seconds during which two processes with the same ID were up and running. I was quite bewildered that the autopilot made a decision to consider n3 as left after mere 2 seconds since she was deemed as failed, even though she had a session still alive at that moment. I'm also kinda curious about what exaclty kicked autopilot's logic in.

Consul info for Server

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = 
	revision = 40cec984
	version = 1.5.1
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 172.25.0.4:8300
	server = true
raft:
	applied_index = 134
	commit_index = 134
	fsm_pending = 0
	last_contact = 46.196847ms
	last_log_index = 134
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:02b3980b-0827-ecfa-b0c5-72d75a05f34e Address:172.25.0.4:8300} {Suffrage:Voter ID:19e1c36d-b066-77b6-2e67-5198da4b8126 Address:172.25.0.9:8300} {Suffrage:Voter ID:ddb325a6-d6f3-38bb-207f-0d4fd3da256c Address:172.25.0.7:8300} {Suffrage:Voter ID:6d00f848-9f02-9601-2ec1-53d2dd917640 Address:172.25.0.11:8300} {Suffrage:Voter ID:a2134945-006d-d90c-55b9-1b22c27edd8c Address:172.25.0.5:8300}]
	latest_configuration_index = 1
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 2
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 94
	max_procs = 8
	os = linux
	version = go1.12.1
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 10
	members = 5
	query_queue = 0
	query_time = 1

Operating system and Environment details

Docker-compose environment on Linux 5.3.10-arch1-1 #1 SMP PREEMPT.

Log Fragments

consul.log @ n2 / n3

The text was updated successfully, but these errors were encountered:

hanshasselberg · 2019-11-14T19:37:08Z

Thanks for reporting and the detailed explanation. This behavior seems to be problematic to me.

hanshasselberg added the type/bug Feature does not function as expected label Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autopilot reaps dead servers too aggressively #6793

Autopilot reaps dead servers too aggressively #6793

keynslug commented Nov 14, 2019 •

edited

Loading

hanshasselberg commented Nov 14, 2019

Autopilot reaps dead servers too aggressively #6793

Autopilot reaps dead servers too aggressively #6793

Comments

keynslug commented Nov 14, 2019 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for Server

Operating system and Environment details

Log Fragments

hanshasselberg commented Nov 14, 2019

keynslug commented Nov 14, 2019 •

edited

Loading