Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader doesn't step down when establishLeadership fails #5047

Closed
kyhavlov opened this issue Dec 3, 2018 · 3 comments · Fixed by #5247
Closed

Leader doesn't step down when establishLeadership fails #5047

kyhavlov opened this issue Dec 3, 2018 · 3 comments · Fixed by #5247
Assignees
Labels
type/bug Feature does not function as expected
Milestone

Comments

@kyhavlov
Copy link
Contributor

kyhavlov commented Dec 3, 2018

Currently the leaderLoop method will periodically retry the establishLeadership operations until successful, instead of stepping down immediately after a failure. This could theoretically cause problems because it's still able to do other normal leader operations like reconciling nodes from Serf or KV read/writes while it's waiting to retry, which violates the assumption that establishLeadership has to succeed before we can handle requests as the leader.

We should look into this and see if there's any negative consequences to immediately stepping down as leader when the establishLeadership method returns an error.

@hanshasselberg
Copy link
Member

hanshasselberg commented Jan 15, 2019

It seems to me that we do not retry establishLeadership periodically. The leaderLoop reconciles periodically:

case <-interval:
goto RECONCILE
but establishLeadership is not being called from the loop, only once after acquiring leadership:
if !establishedLeader {

establishLeadership is however called by reassert after restoring a snapshot:

case errCh := <-s.reassertLeaderCh:
errCh <- reassert()

reassert revokes leadership before it tries to establish it again.

That being said, I think we are doing everything correctly and there is no need to change anything.

@banks
Copy link
Member

banks commented Jan 17, 2019

The bug I think @i0rek is that establishLeadership does a bunch of work that must succeed for the leader to be in a healthy state. If it fails then the leader is not actually behaving like a leader and many of our invariants break for things like Connect CA, cleaning up state, replication etc.

So I don't think the leader loop should retry establishLeader so much as it should fail hard and exit the leader loop and step down as leader when it fails.

The fact it doesn't is what can leave some specific errors that impact establishLeader to cause the leader to be in a broken state where it is only running with part of it's state and goroutines that we expect it to run so some features work and others get into "impossible" states like being nil when we assume established leader must have X setup etc.

In general an error from establishLeader means that the leader is not able to get into a healthy state where one or more of its expected sub processes are not initialised properly so I don't think it's ever right to continue trying to be leader in that state.

@hanshasselberg
Copy link
Member

There is some progress. I implemented leadership transition in raft and as soon that is merged and revendored in consul, we can finally step down if establishleadership fails. hashicorp/raft#306.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants