Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed Out or Cancelled TaskRun Pods are Deleted #3051

Closed
danielhelfand opened this issue Aug 3, 2020 · 15 comments
Closed

Timed Out or Cancelled TaskRun Pods are Deleted #3051

danielhelfand opened this issue Aug 3, 2020 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@danielhelfand
Copy link
Member

danielhelfand commented Aug 3, 2020

Expected Behavior

After a TaskRun is timed out or cancelled, I should still be able to view the logs of the failed or cancelled TaskRun.

Actual Behavior

It appears that the pods for TaskRuns are being deleted due to a change implemented in #2365. In the func, failTaskRun in taskrun.go, the pod associated with the TaskRun is being deleted as noted here.

Assuming this is not an expected feature, I would suggest checking for the failure reason before deleting the TaskRun pod:

if reason != v1beta1.TaskRunReasonTimedOut && reason != v1beta1.TaskRunReasonCancelled {
		err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Delete(tr.Status.PodName, &metav1.DeleteOptions{})

		if err != nil && !k8serrors.IsNotFound(err) {
			logger.Infof("Failed to terminate pod: %v", err)
			return err
		}
}

If this is expected, it would help to document this behavior for TaskRuns as well as recommended best practices for preserving logs.

Steps to Reproduce the Problem

  1. Create a TaskRun with a timeout such that the timeout will be violated or cancel a running TaskRun
  2. Run kubectl get pods to find if the pod of the TaskRun has been deleted

Additional Info

  • Kubernetes version:

N/A

  • Tekton Pipeline version:

Using what is in latest as of v0.15.0.

@danielhelfand danielhelfand added the kind/bug Categorizes issue or PR as related to a bug. label Aug 3, 2020
@danielhelfand
Copy link
Member Author

/assign

Will wait for confirmation on if this is expect for TaskRuns, but can pick this up if it is unintended.

@pritidesai
Copy link
Member

/cc @afrittoli @bobcatfish

@vdemeester
Copy link
Member

vdemeester commented Aug 4, 2020

It appears that the pods for TaskRuns are being deleted due to a change implemented in #2365. In the func, failTaskRun in taskrun.go, the pod associated with the TaskRun is being deleted as noted here.

#2365 didn't change that behavior (see https://github.com/tektoncd/pipeline/pull/2365/files#diff-0239ea30db655388fa943dffd3b7a2e6L49), the cancellation feature always worked by deleting the pod. That was, at the time, the only way to do it. As @imjasonh said somewhere, we may be able to do that now using the entrypoint and some signal (what docker kill was doing, not sure the equivalent in k8s yet).

@danielhelfand
Copy link
Member Author

Yes, my misunderstanding, but now it makes sense. I now see this is what is required to actually stop the TaskRun and guess I never noticed the behavior. So I guess what could be nice here are two things:

  1. This behavior should be documented under the TaskRun documentation. My original thought when this came up was to see if this was expected by looking at docs, but now nothing is there.
  2. This should be changed to a feature to find a way to preserve logs. By either the entrypoint approach or something like Jason mentioned on Slack:

In general we should have users expect to find logs in some log persistence solution (e.g., Stackdriver) and not rely on the Pod still being around at all

@bobcatfish
Copy link
Collaborator

Copying over from slack, #381 is when we first added this and it looks like we've been deleting the pods all along - though that surprised me too and I agree it's inconsistent with how we treat other pods (and makes it harder for dashboard folks :S)

@bobcatfish
Copy link
Collaborator

@imjasonh created an alternate proposal in #2559

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2020
@ghost
Copy link

ghost commented Nov 13, 2020

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2020
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 13, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pritidesai
Copy link
Member

/reopen

I don't think we have reached to resolution for this yet ...

@pritidesai pritidesai reopened this May 7, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants