Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos hangs on CordonAndDrain task (before upgrade) #3124

Closed
smira opened this issue Feb 8, 2021 · 1 comment · Fixed by #3213
Closed

Talos hangs on CordonAndDrain task (before upgrade) #3124

smira opened this issue Feb 8, 2021 · 1 comment · Fixed by #3213
Milestone

Comments

@smira
Copy link
Member

smira commented Feb 8, 2021

Bug Report

Logs:

147.75.198.5: user: warning: [2021-02-08T20:14:12.947164964Z]: [talos] upgrade request received: preserve false, staged false
147.75.198.5: user: warning: [2021-02-08T20:14:12.954269964Z]: [talos] validating "ghcr.io/talos-systems/installer:v0.8.1-3-g3a803927"
147.75.198.5: user: warning: [2021-02-08T20:14:20.523549964Z]: [talos] upgrade sequence: 10 phase(s)
147.75.198.5: user: warning: [2021-02-08T20:14:20.528943964Z]: [talos] phase drain (1/10): 1 tasks(s)
147.75.198.5: user: warning: [2021-02-08T20:14:20.534090964Z]: [talos] task cordonAndDrainNode (1/1): starting
147.75.198.5: user: warning: [2021-02-08T20:14:21.167726964Z]: [talos] skipping DaemonSet pod kube-proxy-xt8qd
147.75.198.5: user: warning: [2021-02-08T20:14:21.173534964Z]: [talos] skipping DaemonSet pod csi-cephfsplugin-fsfkk
147.75.198.5: user: warning: [2021-02-08T20:14:21.179839964Z]: [talos] skipping DaemonSet pod csi-rbdplugin-k65jx
147.75.198.5: user: warning: [2021-02-08T20:14:21.185873964Z]: [talos] skipping DaemonSet pod calico-node-zlq4k
147.75.198.5: user: warning: [2021-02-08T20:14:55.664897964Z]: [talos] WARNING: failed to evict pod: failed to evict pod argocd/argocd-redis-ha-server-0: pods "argocd-redis-ha-server-0" is forbidden: node infra-green-general-amd64-2dszt can only evict pods with spec.nodeName set to itself
147.75.198.5: user: warning: [2021-02-08T20:15:21.241713964Z]: [talos] WARNING: failed to evict pod: failed waiting on pod argocd/argocd-redis-ha-haproxy-7c598b8fb5-qrbvs to be deleted: 2 error(s) occurred:
147.75.198.5: user: warning: [2021-02-08T20:15:21.255807964Z]: \x09pod is still running on the node
147.75.198.5: user: warning: [2021-02-08T20:15:21.260395964Z]: \x09timeout

Description

The only pod left looks like was:

rook-ceph                            rook-ceph-crashcollector-infra-green-general-amd64-2dszt-5h9nfs   0/1     Pending     0          12m

Even deleting that pod doesn't unblock Talos.

Logs

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
  • Kubernetes version: [kubectl version --short]
  • Platform:
@smira
Copy link
Member Author

smira commented Feb 8, 2021

Turned out the problem was the following pod:

tekton-pipelines                     tekton-pipelines-webhook-6479d769ff-rpnr4                         1/1     Running     1          46d     192.168.53.211    infra-green-general-amd64-2dszt   <none>           <none>

I think we should have a hard deadline for eviction

@smira smira added this to the v0.9 milestone Feb 8, 2021
smira added a commit to smira/talos that referenced this issue Feb 25, 2021
Critical bug (I believe) was that drain code entered the loop to evict
the pod after wait for pod to be deleted returned success effectively
evicting pod once again once it got rescheduled to a different node.

Add a global timeout to prevent draining code from running forever.

Filter more pod types which shouldn't be ever drained.

Fixes siderolabs#3124

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit that referenced this issue Feb 25, 2021
Critical bug (I believe) was that drain code entered the loop to evict
the pod after wait for pod to be deleted returned success effectively
evicting pod once again once it got rescheduled to a different node.

Add a global timeout to prevent draining code from running forever.

Filter more pod types which shouldn't be ever drained.

Fixes #3124

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant