Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestStepTimeout is flaky #8293

Closed
afrittoli opened this issue Sep 25, 2024 · 1 comment · Fixed by #8294
Closed

TestStepTimeout is flaky #8293

afrittoli opened this issue Sep 25, 2024 · 1 comment · Fixed by #8294
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@afrittoli
Copy link
Member

Expected Behavior

TestStepTimeout works every time.

Actual Behavior

=== NAME  TestStepTimeout
    timeout_test.go:217: step-canceled should have been canceled after step-timeout timed out

Steps to Reproduce the Problem

  1. Run the test on a local cluster
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
  name: step-timeout-jozymhbz
spec:
  serviceAccountName: default
  taskSpec:
    steps:
    - computeResources: {}
      image: mirror.gcr.io/busybox
      name: no-timeout
      script: sleep 1
      timeout: 2s
    - computeResources: {}
      image: mirror.gcr.io/busybox
      name: timeout
      script: sleep 1
      timeout: 1ms
    - computeResources: {}
      image: mirror.gcr.io/busybox
      name: canceled
      script: sleep 1
  timeout: 1h0m0s

After executions, the steps in the status look like:

  steps:
  - container: step-no-timeout
    imageID: mirror.gcr.io/busybox@sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140
    name: no-timeout
    terminated:
      containerID: containerd://6addccfa2e844ca5d18bb3da91cfee2f314d8db217dfe59e098895a6e3ff80b1
      exitCode: 0
      finishedAt: "2024-09-25T11:51:28Z"
      reason: Completed
      startedAt: "2024-09-25T11:51:27Z"
    terminationReason: Completed
  - container: step-timeout
    imageID: mirror.gcr.io/busybox@sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140
    name: timeout
    terminated:
      containerID: containerd://6c287a90e21ae152966cc8a51882055091c0d7c19595d881457b921e643fa63b
      exitCode: 1
      finishedAt: "2024-09-25T11:51:28Z"
      reason: Error
      startedAt: "2024-09-25T11:51:28Z"
    terminationReason: TimeoutExceeded
  - container: step-canceled
    imageID: mirror.gcr.io/busybox@sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140
    name: canceled
    running:
      startedAt: "2024-09-25T11:51:26Z"

And the pod still exists in the cluster:

step-timeout-jozymhbz-pod                       0/3     Error       0          40m

Additional Info

  • Kubernetes version:
v1.29.7-gke.1104000
  • Tekton Pipeline version:

Happening on main after release v0.63 (and before that too).

@afrittoli afrittoli added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2024
@afrittoli afrittoli self-assigned this Sep 25, 2024
@afrittoli afrittoli added kind/flake Categorizes issue or PR as related to a flakey test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Sep 25, 2024
@afrittoli afrittoli added this to the Pipeline v0.64 milestone Sep 25, 2024
@afrittoli
Copy link
Member Author

I could reproduce this behaviour on my local cluster.
The test fails because it expects all steps to be set as "terminated" while one of the steps is not.

It's not a race condition about the test ending before the update happens, because the update is part of the same reconcile cycle, so either all updates are there or non of them is. The test waits until the Pod fails so it should not be a flake.

I checked in my cluster and noticed that the Pod is still there (it was not removed, like it normally happens in case of timeout) and the step in the status is still not updated after a while. Looking at the reconciler code, I think I found the source of the issue: in case the Pod fails to delete (for whatever reason), steps are not marked as failed:

if err != nil && !k8serrors.IsNotFound(err) {
logger.Errorf("Failed to terminate pod %s: %v", tr.Status.PodName, err)
return err
}
terminateStepsInPod(tr, reason)
return nil
}

I will make a PR to fix this - since Tekton only tries once to delete the Pod, to force stop the execution, I think it would be fair to mark the steps as terminated even if the Pod could not be deleted, since they were terminated from Tekton POV.

afrittoli added a commit to afrittoli/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit to tekton-robot/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit to tekton-robot/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit to tekton-robot/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit to tekton-robot/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Sep 25, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: tektoncd#8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit that referenced this issue Sep 26, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: #8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit that referenced this issue Sep 26, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: #8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit that referenced this issue Sep 26, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: #8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit that referenced this issue Sep 26, 2024
Today this is done after the Pod is handled. If Pod deletion fails
for some reason, the steps are not updated.

Tekton only makes one attempt to delete the Pod to try and avoid
that it keeps running even if the TaskRun failed. If this deletion
fails for whatever reason, Tekton should still update the TaskRun
status to mark the steps as failed, as they are failed from
Tekton POV.

Fixes: #8293

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flakey test priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant