Leader doesn't step down when establishLeadership fails #5047

kyhavlov · 2018-12-03T22:58:55Z

Currently the leaderLoop method will periodically retry the establishLeadership operations until successful, instead of stepping down immediately after a failure. This could theoretically cause problems because it's still able to do other normal leader operations like reconciling nodes from Serf or KV read/writes while it's waiting to retry, which violates the assumption that establishLeadership has to succeed before we can handle requests as the leader.

We should look into this and see if there's any negative consequences to immediately stepping down as leader when the establishLeadership method returns an error.

hanshasselberg · 2019-01-15T10:11:22Z

It seems to me that we do not retry establishLeadership periodically. The leaderLoop reconciles periodically:

consul/agent/consul/leader.go

Lines 215 to 216 in 3c110d5

    
           case <-interval: 
        
           	goto RECONCILE

but establishLeadership is not being called from the loop, only once after acquiring leadership:

consul/agent/consul/leader.go

Line 169 in 3c110d5

if !establishedLeader {

establishLeadership is however called by reassert after restoring a snapshot:

consul/agent/consul/leader.go

Lines 221 to 222 in 3c110d5

    
           case errCh := <-s.reassertLeaderCh: 
        
           	errCh <- reassert()

reassert revokes leadership before it tries to establish it again.

That being said, I think we are doing everything correctly and there is no need to change anything.

banks · 2019-01-17T15:00:40Z

The bug I think @i0rek is that establishLeadership does a bunch of work that must succeed for the leader to be in a healthy state. If it fails then the leader is not actually behaving like a leader and many of our invariants break for things like Connect CA, cleaning up state, replication etc.

So I don't think the leader loop should retry establishLeader so much as it should fail hard and exit the leader loop and step down as leader when it fails.

The fact it doesn't is what can leave some specific errors that impact establishLeader to cause the leader to be in a broken state where it is only running with part of it's state and goroutines that we expect it to run so some features work and others get into "impossible" states like being nil when we assume established leader must have X setup etc.

In general an error from establishLeader means that the leader is not able to get into a healthy state where one or more of its expected sub processes are not initialised properly so I don't think it's ever right to continue trying to be leader in that state.

hanshasselberg · 2019-02-01T12:51:08Z

There is some progress. I implemented leadership transition in raft and as soon that is merged and revendored in consul, we can finally step down if establishleadership fails. hashicorp/raft#306.

kyhavlov added the type/bug Feature does not function as expected label Dec 3, 2018

banks mentioned this issue Dec 7, 2018

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552

Closed

banks added this to the 1.4.1 milestone Dec 7, 2018

pearkes assigned banks Dec 17, 2018

hanshasselberg assigned hanshasselberg and unassigned banks Jan 14, 2019

hanshasselberg mentioned this issue Jan 22, 2019

Transfer leadership when establishLeadership fails #5247

Merged

pearkes modified the milestones: 1.4.1, 1.4.2, 1.4.3 Jan 23, 2019

mkeeler modified the milestones: 1.4.3, 1.4.4 Mar 4, 2019

hanshasselberg modified the milestones: 1.4.4, 1.5 Mar 19, 2019

pearkes modified the milestones: 1.5.0, 1.5.1 Apr 29, 2019

mkeeler modified the milestones: 1.5.1, 1.5.2 May 23, 2019

hanshasselberg closed this as completed in #5247 Jun 19, 2019

notnoop mentioned this issue Jul 20, 2020

Step down leadership on establishLeader failures hashicorp/nomad#8470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader doesn't step down when establishLeadership fails #5047

Leader doesn't step down when establishLeadership fails #5047

kyhavlov commented Dec 3, 2018

hanshasselberg commented Jan 15, 2019 •

edited

Loading

banks commented Jan 17, 2019

hanshasselberg commented Feb 1, 2019

Leader doesn't step down when establishLeadership fails #5047

Leader doesn't step down when establishLeadership fails #5047

Comments

kyhavlov commented Dec 3, 2018

hanshasselberg commented Jan 15, 2019 • edited Loading

banks commented Jan 17, 2019

hanshasselberg commented Feb 1, 2019

hanshasselberg commented Jan 15, 2019 •

edited

Loading