TestStepTimeout is flaky #8293

afrittoli · 2024-09-25T12:32:08Z

Expected Behavior

TestStepTimeout works every time.

Actual Behavior

=== NAME  TestStepTimeout
    timeout_test.go:217: step-canceled should have been canceled after step-timeout timed out

Steps to Reproduce the Problem

Run the test on a local cluster

apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
  name: step-timeout-jozymhbz
spec:
  serviceAccountName: default
  taskSpec:
    steps:
    - computeResources: {}
      image: mirror.gcr.io/busybox
      name: no-timeout
      script: sleep 1
      timeout: 2s
    - computeResources: {}
      image: mirror.gcr.io/busybox
      name: timeout
      script: sleep 1
      timeout: 1ms
    - computeResources: {}
      image: mirror.gcr.io/busybox
      name: canceled
      script: sleep 1
  timeout: 1h0m0s

After executions, the steps in the status look like:

  steps:
  - container: step-no-timeout
    imageID: mirror.gcr.io/busybox@sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140
    name: no-timeout
    terminated:
      containerID: containerd://6addccfa2e844ca5d18bb3da91cfee2f314d8db217dfe59e098895a6e3ff80b1
      exitCode: 0
      finishedAt: "2024-09-25T11:51:28Z"
      reason: Completed
      startedAt: "2024-09-25T11:51:27Z"
    terminationReason: Completed
  - container: step-timeout
    imageID: mirror.gcr.io/busybox@sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140
    name: timeout
    terminated:
      containerID: containerd://6c287a90e21ae152966cc8a51882055091c0d7c19595d881457b921e643fa63b
      exitCode: 1
      finishedAt: "2024-09-25T11:51:28Z"
      reason: Error
      startedAt: "2024-09-25T11:51:28Z"
    terminationReason: TimeoutExceeded
  - container: step-canceled
    imageID: mirror.gcr.io/busybox@sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140
    name: canceled
    running:
      startedAt: "2024-09-25T11:51:26Z"

And the pod still exists in the cluster:

step-timeout-jozymhbz-pod                       0/3     Error       0          40m

Additional Info

Kubernetes version:

v1.29.7-gke.1104000

Tekton Pipeline version:

Happening on main after release v0.63 (and before that too).

The text was updated successfully, but these errors were encountered:

afrittoli · 2024-09-25T12:41:02Z

I could reproduce this behaviour on my local cluster.
The test fails because it expects all steps to be set as "terminated" while one of the steps is not.

It's not a race condition about the test ending before the update happens, because the update is part of the same reconcile cycle, so either all updates are there or non of them is. The test waits until the Pod fails so it should not be a flake.

I checked in my cluster and noticed that the Pod is still there (it was not removed, like it normally happens in case of timeout) and the step in the status is still not updated after a while. Looking at the reconciler code, I think I found the source of the issue: in case the Pod fails to delete (for whatever reason), steps are not marked as failed:

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 816 to 823 in 1f27899

    
           	if err != nil && !k8serrors.IsNotFound(err) { 
        
           		logger.Errorf("Failed to terminate pod %s: %v", tr.Status.PodName, err) 
        
           		return err 
        
           	} 
        
           	terminateStepsInPod(tr, reason) 
        
           	return nil 
        
           }

I will make a PR to fix this - since Tekton only tries once to delete the Pod, to force stop the execution, I think it would be fair to mark the steps as terminated even if the Pod could not be deleted, since they were terminated from Tekton POV.

Today this is done after the Pod is handled. If Pod deletion fails for some reason, the steps are not updated. Tekton only makes one attempt to delete the Pod to try and avoid that it keeps running even if the TaskRun failed. If this deletion fails for whatever reason, Tekton should still update the TaskRun status to mark the steps as failed, as they are failed from Tekton POV. Fixes: tektoncd#8293 Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>

Today this is done after the Pod is handled. If Pod deletion fails for some reason, the steps are not updated. Tekton only makes one attempt to delete the Pod to try and avoid that it keeps running even if the TaskRun failed. If this deletion fails for whatever reason, Tekton should still update the TaskRun status to mark the steps as failed, as they are failed from Tekton POV. Fixes: #8293 Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>

afrittoli added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2024

afrittoli self-assigned this Sep 25, 2024

afrittoli added kind/flake Categorizes issue or PR as related to a flakey test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Sep 25, 2024

afrittoli added this to the Pipeline v0.64 milestone Sep 25, 2024

afrittoli mentioned this issue Sep 25, 2024

Mark steps as deleted when TaskRun fails #8294

Merged

8 tasks

tekton-robot closed this as completed in #8294 Sep 25, 2024

tekton-robot closed this as completed in f59eea0 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestStepTimeout is flaky #8293

TestStepTimeout is flaky #8293

afrittoli commented Sep 25, 2024

afrittoli commented Sep 25, 2024

TestStepTimeout is flaky #8293

TestStepTimeout is flaky #8293

Comments

afrittoli commented Sep 25, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

afrittoli commented Sep 25, 2024