If node that runs the taskrun pod shutdown then retry will not work as expected #6558

yuzp1996 · 2023-04-19T15:25:10Z

Expected Behavior

When the node shutdown and the tasrkrun pod running on the node will be failed. And if we set retry on taskrun we would expect the retried pod to start in a normal node and taskrun can continue to work.

Actual Behavior

Taskrun will use the failed pod as the work pod and will not create a new pod for retry.

Steps to Reproduce the Problem

create a taskrun with retry in a k8s cluster that has multi nodes
shutdown the node which running the pod
check the status of taskrun

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:22:09Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.6", GitCommit:"ff2c119726cc1f8926fb0585c74b25921e866a28", GitTreeState:"clean", BuildDate:"2023-01-18T19:15:26Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

v0.41.0

The text was updated successfully, but these errors were encountered:

yuzp1996 · 2023-04-19T15:31:19Z

It is a normal retry that retry pods are different pods.

It is an unnormal retry that retry pods are always the first pod.

yuzp1996 · 2023-04-19T15:46:54Z

I find it's because we only check the status of container however when node shutdown the container status will not change.

And I find kubectl get po will return the right pod status Terminating indicating that the pod is unnormal.

I think that k8s can recognise that the pod is in an unnormal status and then I read the related code.

If pod.DeletionTimestamp != nil and k8s will think this pod is in Terminating status.

I think maybe we can change the logic of detecting the taskrun pod status like this to resolve the problem.

// DidTaskRunFail check the status of pod to decide if related taskrun is failed
func DidTaskRunFail(pod *corev1.Pod) bool {
	f := pod.Status.Phase == corev1.PodFailed
	for _, s := range pod.Status.ContainerStatuses {
		if IsContainerStep(s.Name) {
			if s.State.Terminated != nil {
				f = f || s.State.Terminated.ExitCode != 0 || isOOMKilled(s)
			}
		}
	}
+	// when the node lost the pod will be in Terminating
+	f = f || pod.DeletionTimestamp != nil

	return f
}

yuzp1996 · 2023-04-19T15:51:53Z

/assign

When node shutdown then the retry pod will always use the same pod because it can not recongize that the pod can not work anymore and k8s can not delete it before the node recovered and it will cause retry is actually not working. I fix it by check the DeletionTimestamp to know if the pod is actually not work any more. Fix tektoncd#6558 Signed-off-by: yuzhipeng <zpyu@alauda.io>

When the node shutdown then the retry pod will always be the same pod because it can not recognise that the pod can not work anymore and k8s can not delete the pod before the node recovers and it causes the retry is actually not to work. I fix it by checking the DeletionTimestamp to know if the pod is actually not work any more. Fix tektoncd#6558 Signed-off-by: yuzhipeng <zpyu@alauda.io>

lbernick · 2023-04-24T14:47:09Z

~~I think it's worth considering whether we want a TaskRun to handle node failures in general or only in the case of retries.~~ FYI @pritidesai @skaegi

Edit: discussed at API WG today; we only want to create a new pod if a taskrun has retries specified.

pritidesai · 2023-04-24T20:24:56Z

Thank you @yuzp1996 for reporting this issue 🙏

Why are these pod names same for multiple attempts? @XinruZhang @lbernick can we please reproduce this 🙏

As per our retry strategies, the pod names must be unique for each retry attempt if I am not mistaken.

lbernick · 2023-04-24T20:59:06Z

@yuzp1996 behavior of retries was modified in v0.43.0. Are you able to reproduce this bug on v0.43.0 or newer?

XinruZhang · 2023-04-24T21:28:47Z

I reproduced the error -- it seems like this behavior is unrelated to the "retries" feature: evicting the pod of the taskrun that is created out of the following pipelinerun has the same result:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: pr-retries-
spec:
  serviceAccountName: 'default'
  pipelineSpec:
    tasks:
      - name: task-abnormal-pod
        taskSpec:
          steps:
          - name: echo
            image: alpine
            script: |
                sleep 50000 && exit 1

We discussed this issue in the API WG today: do we want to retry a taskrun that contains an unfinished pod that is evicted or terminated unexpectedly (node deletion etc)? People in the WG meeting generally agree that we should not retry it.

cc @yuzp1996 @pritidesai @lbernick

pritidesai · 2023-04-24T21:41:30Z

thanks a bunch @XinruZhang 🙏

evicting the pod of the taskrun that is created out of the following pipelinerun has the same result

please explain what does the same result mean here. What happens to such pipelineRun without retry? Is pipelineRun kind of hangs until it times out?

XinruZhang · 2023-04-24T21:44:19Z

Most likely here is the case (I've tested the following behavior):

When the pod is scheduled in the same node that tekton-pipelines-controller and tekton-pipelines-webhook are running on, then if we drain the node, all pods on that node are evicted, the reported behavior (the same pod will be created for the taskrun -- simply because the controller logic needs to be re-run ¹) happens.

But if we only evict the pod created for the taskrun, the taskrun and pipelinerun faills:

Events:
  Type     Reason         Age                    From     Message
  ----     ------         ----                   ----     -------
  Normal   Started        15m                    TaskRun
  Normal   Pending        15m                    TaskRun  Pending
  Normal   Pending        15m                    TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [prepare place-scripts]"
  Normal   Pending        15m                    TaskRun  pod status "Initialized":"False"; message: "containers with incomplete status: [place-scripts]"
  Normal   Pending        15m                    TaskRun  pod status "Ready":"False"; message: "containers with unready status: [step-echo]"
  Normal   Running        15m (x2 over 15m)      TaskRun  Not all Steps in the Task have finished executing
  Warning  Failed         6m27s                  TaskRun  "step-echo" exited with code 255 (image: "docker.io/library/alpine@sha256:124c7d2707904eea7431fffe91522a01e5a861a624ee31d03372cc1d138a3126"); for logs run: kubectl -n default logs pr-retries-z5mdd-task-abnormal-retries-pod -c step-echo
  Warning  InternalError  6m26s (x2 over 6m26s)  TaskRun  pods "pr-retries-z5mdd-task-abnormal-retries-pod" not found

https://github.com/tektoncd/pipeline/blob/8e8c163446e8bef885bd628072320f9805529ac9/pkg/reconciler/taskrun/taskrun.go#L437-L504 ↩

XinruZhang · 2023-04-24T21:52:40Z

I don't quite understand why "creating the pod" and "updating the podname in taskrun" would happen in two different reconcile loops -- why is there a case that PodName is not empty while the Pod is not found.

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 437 to 448 in 8e8c163

    
           if tr.Status.PodName != "" { 
        
           	pod, err = c.podLister.Pods(tr.Namespace).Get(tr.Status.PodName) 
        
           	if k8serrors.IsNotFound(err) { 
        
           		// Keep going, this will result in the Pod being created below. 
        
           	} else if err != nil { 
        
           		// This is considered a transient error, so we return error, do not update 
        
           		// the task run condition, and return an error which will cause this key to 
        
           		// be requeued for reconcile. 
        
           		logger.Errorf("Error getting pod %q: %v", tr.Status.PodName, err) 
        
           		return err 
        
           	} 
        
           } else {

XinruZhang · 2023-04-24T21:57:24Z

evicting the pod of the taskrun that is created out of the following pipelinerun has the same result

Please explain what does the same result mean here. What happens to such pipelineRun without retry? Is pipelineRun kind of hangs until it times out?

The same result means: the pod will be recreated (with the same PodName as this issue reports), and the PipelineRun will eventually succeed (or fail if that's what's defined by the PipelineTask).

pritidesai · 2023-04-24T23:39:53Z

Looking at the history, this check on tr.Status.PodName was deleted but then it was reverted back.

Introduced in 0f20c35 to stop looking up the pod for a taskRun by name but instead only look up by labelSelector.

This adds Reconciler.getPod, which looks up the Pod for a TaskRun by performing a label selector query on Pods, looking for the label we apply to Pods generated by TaskRuns.

If zero Pods are returned, it's the same as .status.podName being "". If multiple Pods are returned, that's an error.

This commit was reverted in #1944 because back then we had multiple pods associated to single taskRun object and was not easy to identify when to declare done for a taskRun.

Further reference: #1689

We can certainly update pod creation implementation since we have ownerReference implemented now but I think that could be something nice to have and not causing any issues here.

pritidesai · 2023-04-25T00:25:12Z

When the pod is scheduled in the same node that tekton-pipelines-controller and tekton-pipelines-webhook are running on, then if we drain the node, all pods on that node are evicted, the reported behavior (the same pod will be created for the taskrun -- simply because the controller logic needs to be re-run [1]

Let's focus our testing for this bug report in which the controller and taskRun pods are on different nodes. There is generally a separation of duties and access controls in actual deployments.

But if we only evict the pod created for the taskrun, the taskrun and pipelinerun faills:

Warning  Failed         6m27s                  TaskRun  "step-echo" exited with code 255 (image: "docker.io/library/alpine@sha256:124c7d2707904eea7431fffe91522a01e5a861a624ee31d03372cc1d138a3126"); for logs run: kubectl -n default logs pr-retries-z5mdd-task-abnormal-retries-pod -c step-echo
Warning  InternalError  6m26s (x2 over 6m26s)  TaskRun  pods "pr-retries-z5mdd-task-abnormal-retries-pod" not found

I am interpreting the taskRun and pipelineRun fails without any retries based on the status updates here ⬆️ which is as expected, right?

How does this InternalError impact the pipelineRun status?

@yuzp1996 can you please try this with the controller beyond 0.43 as @lbernick requested?

XinruZhang · 2023-04-25T14:02:58Z

Thank you @pritidesai for tracing back to the history.

I am interpreting the taskRun and pipelineRun fails without any retries based on the status updates here ⬆️ which is as expected, right?

Yes yes, the TaskRun and PipelineRun failed directly which is expected because no Retries was specified.

Once I specify the Retries field, and then the pod is retried, see the status tansitions here:

pr-retries-vmbdz-task-abnormal-retries-pod   1/1     Running   0          41s
pr-retries-vmbdz-task-abnormal-retries-pod   1/1     Terminating   0          51s
pr-retries-vmbdz-task-abnormal-retries-pod   0/1     Terminating   0          52s
pr-retries-vmbdz-task-abnormal-retries-pod   0/1     Terminating   0          52s
pr-retries-vmbdz-task-abnormal-retries-pod   0/1     Terminating   0          52s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Pending       0          0s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Pending       0          0s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Init:0/2      0          1s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     Init:1/2      0          2s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   0/1     PodInitializing   0          3s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   1/1     Running           0          4s
pr-retries-vmbdz-task-abnormal-retries-pod-retry1   1/1     Running           0          4s

P.S. I'm testing the behavior on the latest Tekton Pipelines code. cc @yuzp1996

XinruZhang · 2023-04-25T15:34:47Z

Let's focus our testing for this bug report in which the controller and taskRun pods are on different nodes. There is generally a separation of duties and access controls in actual deployments.

Agreed that generally the deployment should be seperate. Curious is there any related guideline for end users to follow (Or is this just an industrial standard that we can assume most people know XD )?

Plus, what's the expected behavior if the they are deployed in the same node and then the node is drained. Should we fail the TaskRun. I would think so because the behavior it's better to align the behavior among different deployment setups).

yuzp1996 · 2023-05-05T11:54:17Z

Thank you for your response! I'll try this with a controller above 0.43 and see if the problem persists.

tekton-robot · 2023-08-03T12:13:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-09-02T12:18:05Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-10-02T12:21:33Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2023-10-02T12:21:36Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuzp1996 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 19, 2023

tekton-robot assigned yuzp1996 Apr 19, 2023

lbernick mentioned this issue Apr 20, 2023

TEP-0135: Per-PipelineRun (instead of per-workspace) affinity assistant #6543

Closed

yuzp1996 mentioned this issue Apr 22, 2023

Retry using different pod when node shutdown. #6572

Closed

7 tasks

pritidesai changed the title ~~If node that runs the taskrun pod shutdown then retry will not work as except~~ If node that runs the taskrun pod shutdown then retry will not work as expected May 2, 2023

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 3, 2023

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 2, 2023

tekton-robot closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If node that runs the taskrun pod shutdown then retry will not work as expected #6558

If node that runs the taskrun pod shutdown then retry will not work as expected #6558

yuzp1996 commented Apr 19, 2023 •

edited

Loading

yuzp1996 commented Apr 19, 2023 •

edited

Loading

yuzp1996 commented Apr 19, 2023 •

edited

Loading

yuzp1996 commented Apr 19, 2023

lbernick commented Apr 24, 2023 •

edited

Loading

pritidesai commented Apr 24, 2023 •

edited

Loading

lbernick commented Apr 24, 2023

XinruZhang commented Apr 24, 2023

pritidesai commented Apr 24, 2023

XinruZhang commented Apr 24, 2023 •

edited

Loading

XinruZhang commented Apr 24, 2023

XinruZhang commented Apr 24, 2023

pritidesai commented Apr 24, 2023 •

edited

Loading

pritidesai commented Apr 25, 2023 •

edited

Loading

XinruZhang commented Apr 25, 2023

XinruZhang commented Apr 25, 2023 •

edited

Loading

yuzp1996 commented May 5, 2023

tekton-robot commented Aug 3, 2023

tekton-robot commented Sep 2, 2023

tekton-robot commented Oct 2, 2023

tekton-robot commented Oct 2, 2023

If node that runs the taskrun pod shutdown then retry will not work as expected #6558

If node that runs the taskrun pod shutdown then retry will not work as expected #6558

Comments

yuzp1996 commented Apr 19, 2023 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

yuzp1996 commented Apr 19, 2023 • edited Loading

yuzp1996 commented Apr 19, 2023 • edited Loading

yuzp1996 commented Apr 19, 2023

lbernick commented Apr 24, 2023 • edited Loading

pritidesai commented Apr 24, 2023 • edited Loading

lbernick commented Apr 24, 2023

XinruZhang commented Apr 24, 2023

pritidesai commented Apr 24, 2023

XinruZhang commented Apr 24, 2023 • edited Loading

Footnotes

XinruZhang commented Apr 24, 2023

XinruZhang commented Apr 24, 2023

pritidesai commented Apr 24, 2023 • edited Loading

pritidesai commented Apr 25, 2023 • edited Loading

XinruZhang commented Apr 25, 2023

XinruZhang commented Apr 25, 2023 • edited Loading

yuzp1996 commented May 5, 2023

tekton-robot commented Aug 3, 2023

tekton-robot commented Sep 2, 2023

tekton-robot commented Oct 2, 2023

tekton-robot commented Oct 2, 2023

yuzp1996 commented Apr 19, 2023 •

edited

Loading

yuzp1996 commented Apr 19, 2023 •

edited

Loading

yuzp1996 commented Apr 19, 2023 •

edited

Loading

lbernick commented Apr 24, 2023 •

edited

Loading

pritidesai commented Apr 24, 2023 •

edited

Loading

XinruZhang commented Apr 24, 2023 •

edited

Loading

pritidesai commented Apr 24, 2023 •

edited

Loading

pritidesai commented Apr 25, 2023 •

edited

Loading

XinruZhang commented Apr 25, 2023 •

edited

Loading