Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of docs - remove use of consul leave during upgrade instructions as it caused leadership changes into release/1.15.x #17772

Conversation

hc-github-team-consul-core
Copy link
Collaborator

Backport

This PR is auto-generated from #17758 to be assessed for backporting due to the inclusion of the label backport/1.15.

The below text is copied from the body of the original PR.


Note: this needs to be updated on all versions of docs, so backport labels for 1.13 - 1.16 are in place. I will manually cherry pick this into PRS to the docs release branches prior to 1.13.

Description

A customer ran into an issue where leadership elections occurred multiple times for each server that they were upgrading when the initial goal of the process is to ensure the leader is upgraded last. This was caused by the use of consul leave during the upgrade process as they upgraded from consul to consul enterprise.

When upgrading it is important that the leader goes last, so that the leader is replicating raft logs on the lower consul version to servers that are either at the same level or at a higher level and are aware of all fields that are within the raft log.

When using consul leave during the upgrade process, the following was observed.

Observed when shutting down

The following occurred when consul leave was issued:

  • shutdown starts
  • server leaves the cluster
  • server is draining and continues the shut down process
  • raft is not turned off on the server, so it can experience heartbeat timeouts (since it left the cluster) and will start new elections and drive up its term index` (ex: cluster has a term of 100 and server being upgraded has a term of 104) until it shuts down
  • server shuts down

This happened on multiple servers and the server being upgraded had a term that was several greater than the leader and the rest of the cluster.

At this point the server is shut down and has the new consul binary.

Observed when restarting

The instructions then have the user start the server using something like systemctl start. At this point, the following was observed:

  • server starts up and joins cluster
  • leader replicates logs to other server successfully
  • leader gets to the upgraded and server and encounters that it has a higher term
  • leader steps down thinking there could possibly be a new leader
  • there is no leader and leadership is lost
  • a new election term is started by the servers and one is elected.

This loop of losing leadership / starting new elections / electing a new leader will continue until the term of the cluster matches the term of the upgraded server. In the example previously mentioned where the cluster had a term of 100 and the upgraded server has a term of 104, this loop would occur 4 times.

At this point, the upgrade process has encountered multiple leader election and the process has been destabilized because it is highly probable that your leader is now different and overall your upgrade process is compromised and not set up for success.


Overview of commits

@hc-github-team-consul-core hc-github-team-consul-core requested a review from a team as a code owner June 15, 2023 17:07
@hc-github-team-consul-core hc-github-team-consul-core force-pushed the backport/jm/NET-4477/urgently-eager-marmot branch from 7c81037 to 08e7c23 Compare June 15, 2023 17:07
@hc-github-team-consul-core hc-github-team-consul-core enabled auto-merge (squash) June 15, 2023 17:07
@github-actions github-actions bot added the type/docs Documentation needs to be created/updated/clarified label Jun 15, 2023
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging June 15, 2023 17:10 Inactive
@vercel vercel bot temporarily deployed to Preview – consul June 15, 2023 17:16 Inactive
@hc-github-team-consul-core hc-github-team-consul-core merged commit 07d27a8 into release/1.15.x Jun 15, 2023
@hc-github-team-consul-core hc-github-team-consul-core deleted the backport/jm/NET-4477/urgently-eager-marmot branch June 15, 2023 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants