Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stop networkd before leaving etcd on 'reset' path #3590

Merged
merged 1 commit into from
May 7, 2021

Conversation

smira
Copy link
Member

@smira smira commented May 7, 2021

The problem is with VIP and reset sequence: the order of operations
was that etcd was stopped first while networkd was still running,
and if the node owned the VIP at the time of the reset action, the lease
will be lost (as client connection is gone), so VIP will be unassigned
for a pretty long time.

This PR changes the order of operations: first, stop networkd and
other pods, and leave etcd last, so that VIP is released, and
kube-apiserver for example isn't left hanging on the node while etcd
is gone.

Fixes #3500

Signed-off-by: Andrey Smirnov smirnov.andrey@gmail.com

The problem is with VIP and `reset` sequence: the order of operations
was that `etcd` was stopped first while `networkd` was still running,
and if the node owned the VIP at the time of the reset action, the lease
will be lost (as client connection is gone), so VIP will be unassigned
for a pretty long time.

This PR changes the order of operations: first, stop `networkd` and
other pods, and leave `etcd` last, so that VIP is released, and
`kube-apiserver` for example isn't left hanging on the node while `etcd`
is gone.

Fixes siderolabs#3500

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
@smira smira added this to the 0.11 milestone May 7, 2021
@smira
Copy link
Member Author

smira commented May 7, 2021

/approve

@smira

This comment has been minimized.

@smira
Copy link
Member Author

smira commented May 7, 2021

/promote integration-qemu-encrypted-vip

@smira
Copy link
Member Author

smira commented May 7, 2021

/lgtm

@talos-bot talos-bot merged commit 4ffd7c0 into siderolabs:master May 7, 2021
smira added a commit to smira/talos that referenced this pull request May 14, 2021
The change is essentially same as siderolabs#3590, but applied to the upgrade path
which is very similar to the reset path.

We have to stop networkd (and remove the VIP/lease on the VIP) before we
leave and stop etcd. Plus we stop the kube-apiserver before the etcd is
stopped, so that we don't have unhealthy kube-apiserver.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit that referenced this pull request May 14, 2021
The change is essentially same as #3590, but applied to the upgrade path
which is very similar to the reset path.

We have to stop networkd (and remove the VIP/lease on the VIP) before we
leave and stop etcd. Plus we stop the kube-apiserver before the etcd is
stopped, so that we don't have unhealthy kube-apiserver.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira added a commit to smira/talos that referenced this pull request May 20, 2021
The change is essentially same as siderolabs#3590, but applied to the upgrade path
which is very similar to the reset path.

We have to stop networkd (and remove the VIP/lease on the VIP) before we
leave and stop etcd. Plus we stop the kube-apiserver before the etcd is
stopped, so that we don't have unhealthy kube-apiserver.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
(cherry picked from commit 0825cf1)
smira added a commit that referenced this pull request May 20, 2021
The change is essentially same as #3590, but applied to the upgrade path
which is very similar to the reset path.

We have to stop networkd (and remove the VIP/lease on the VIP) before we
leave and stop etcd. Plus we stop the kube-apiserver before the etcd is
stopped, so that we don't have unhealthy kube-apiserver.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
(cherry picked from commit 0825cf1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e with VIP: resetting node fails with control plane endpoint being down
3 participants