Cancel taskrun using entrypoint binary #4618

adshmh · 2022-02-24T16:37:43Z

Changes

The cancellation of taskruns is now done through the entrypoint binary
through a new flag called 'cancel_file'. This removes the need for
deleting the pods to cancel a taskrun, allowing examination of the logs
on the pods from cancelled taskruns. Part of work on issue #3238

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Docs included if any changes are user facing
Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Pods corresponding to a cancelled taskrun are no longer deleted: they are stopped instead.

tekton-robot · 2022-02-24T16:37:58Z

Hi @adshmh. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pritidesai · 2022-02-24T18:17:02Z

thanks @adshmh for this! Welcome to the community! 🎉
/ok-to-test

tekton-robot · 2022-02-24T18:19:14Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
cmd/entrypoint/waiter.go	80.0%	82.4%	2.4
pkg/entrypoint/entrypointer.go	69.7%	74.1%	4.4
pkg/pod/entrypoint.go	88.0%	87.7%	-0.3
pkg/reconciler/taskrun/taskrun.go	80.5%	80.2%	-0.2

tekton-robot · 2022-02-24T19:38:08Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
cmd/entrypoint/waiter.go	80.0%	82.4%	2.4
pkg/entrypoint/entrypointer.go	69.7%	74.7%	5.0
pkg/pod/entrypoint.go	88.0%	87.7%	-0.3
pkg/reconciler/taskrun/taskrun.go	80.5%	80.2%	-0.2

vdemeester · 2022-02-25T08:48:36Z

/assign
/hold

@adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update spec.status), but only how the taskrun reconciler handles the cancel.

I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ?

/cc @tektoncd/core-maintainers

adshmh · 2022-02-25T21:51:52Z

/assign /hold

@adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update spec.status, but only how the taskrun reconciler handles the cancel.

I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ?

Thank you for pointing this out. I will add a feature flag for the new behavior of cancel.

imjasonh

Wow, what a great change! 🎉

Thanks for your contribution, +1 to Vincent's point about a feature flag, but otherwise, this looks great.

imjasonh · 2022-02-25T22:02:33Z

cmd/entrypoint/waiter.go

 	if file == "" {
 		return nil
 	}
 	for ; ; time.Sleep(rw.waitPollingInterval) {
+		if ctx.Err() != nil {
+			return nil


Suggested change

return nil

return err

Or probably ignore it if !errors.Is(err, context.Canceled)

Thank you for the review. Fixed.

imjasonh · 2022-02-25T22:04:34Z

cmd/entrypoint/waiter_test.go

@@ -38,7 +39,7 @@ func TestRealWaiterWaitMissingFile(t *testing.T) {
 	rw := realWaiter{}
 	doneCh := make(chan struct{})
 	go func() {
-		err := rw.setWaitPollingInterval(testWaitPollingInterval).Wait(tmp.Name(), false, false)
+		err := rw.setWaitPollingInterval(testWaitPollingInterval).Wait(context.Background(), tmp.Name(), false, false)


For readability, maybe ctx := context.Background() up at the top of this test, and reuse it throughout.

Thank you for the review. Fixed.

imjasonh · 2022-02-25T22:07:21Z

pkg/entrypoint/entrypointer.go

@@ -114,7 +117,7 @@ func (e Entrypointer) Go() error {
 	}()

 	for _, f := range e.WaitFiles {
-		if err := e.Waiter.Wait(f, e.WaitFileContent, e.BreakpointOnFailure); err != nil {
+		if err := e.Waiter.Wait(context.Background(), f, e.WaitFileContent, e.BreakpointOnFailure); err != nil {


Could we have Go take a context.Context and use that here? If it's still a context.Background() inside cmd/entrypoint/main.go that's fine, it's just a bit cleaner and makes it easier to add other stuff like cancelling when we get SIGINT, which we might want to do later.

Thank you for the review. Fixed.

imjasonh · 2022-02-25T22:10:00Z

pkg/entrypoint/entrypointer_test.go

@@ -160,6 +165,18 @@ func TestEntrypointer(t *testing.T) {
 	}, {
 		desc:                "breakpointOnFailure to wait or not to wait ",
 		breakpointOnFailure: true,
+	}, {
+		desc:       "Runner completes if not cancelled",
+		cancelFile: ".",


Can we make this cancelfile or something? "." is an odd name for this file.

Thank you for the review. Fixed.

tekton-robot · 2022-03-06T19:08:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vdemeester
You can assign the PR to them by writing /assign @vdemeester in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adshmh · 2022-03-06T19:10:26Z

/assign /hold

@adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update spec.status), but only how the taskrun reconciler handles the cancel.

I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ?

/cc @tektoncd/core-maintainers

Thank you for the review. The new feature of cancelling using the entrypoint binary is now behind a feature flag, with default set to false.

tekton-robot · 2022-03-06T19:11:13Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
cmd/entrypoint/main.go	14.0%	13.8%	-0.2
cmd/entrypoint/waiter.go	80.0%	82.4%	2.4
pkg/apis/config/feature_flags.go	87.8%	86.0%	-1.8
pkg/entrypoint/entrypointer.go	69.7%	73.8%	4.1
pkg/pod/entrypoint.go	88.0%	87.7%	-0.3
pkg/reconciler/taskrun/taskrun.go	80.5%	80.5%	0.0

vdemeester

2 small questions / comments, otherwise, looks good 👍🏼

vdemeester · 2022-03-09T09:02:21Z

cmd/entrypoint/waiter.go

 	if file == "" {
 		return nil
 	}
 	for ; ; time.Sleep(rw.waitPollingInterval) {
+		if err := ctx.Err(); err != nil {


Do we also want to handle ctx.Done() ? 🤔 Not sure if that case would happen but…

Thank you for the review. Fixed.
Looking at the code again, I think using your suggestion, i.e. ctx.Done() is more readable. Updated.

On a side note, I increased the polling interval on the unit tests from 10 to 25 milliseconds, as it seems to reduce flakes significantly (from about 5% to around 0,5% in my tests). If this needs to be reverted, please let me know.

const testWaitPollingInterval = 25 * time.Millisecond

vdemeester · 2022-03-09T09:04:21Z

pkg/entrypoint/entrypointer.go

+		errChan := make(chan error, 1)
+		go func() {
+			errChan <- e.Runner.Run(ctx, e.Command...)
+			cancel()


Does it mean we'll run cancel twice ? (here and the defer)
I don't remember if it's a no-op or if it panics.. If it's a no-op we are fine though.

Thank you for the review. Yes, we will end up calling cancel twice on the context, but that is no-op (I tested to be sure).
I added the defer cancel() for both readability and ensuring clean-up. If removing it is preferred, please let me know.

tekton-robot · 2022-03-27T09:21:39Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
cmd/entrypoint/main.go	14.0%	13.8%	-0.2
cmd/entrypoint/waiter.go	80.0%	82.4%	2.4
pkg/apis/config/feature_flags.go	88.5%	87.0%	-1.4
pkg/entrypoint/entrypointer.go	69.7%	73.8%	4.1
pkg/pod/entrypoint.go	88.0%	87.7%	-0.3
pkg/reconciler/taskrun/taskrun.go	79.9%	80.0%	0.1

osherdp · 2022-05-02T12:49:52Z

@adshmh @vdemeester any idea what's left for this change to get in?
If it will have positive effects on #4035 then it will solve lots of problems for us 😃

lbernick

Thanks for this @adshmh! When a pipelineRun is cancelled, it will also cancel its child TaskRuns. I'm not sure if any of the PipelineRun cancellation logic needs to be changed, but could you add some tests to the PipelineRun reconciler for this new cancellation strategy?

In addition, could you please add some docs on the new feature flag, and update the release note to specify that this behavior is controlled by the feature flag?

lbernick · 2022-05-02T13:09:13Z

cmd/entrypoint/main.go

@@ -144,7 +147,8 @@ func main() {
 		log.Printf("non-fatal error copying credentials: %q", err)
 	}

-	if err := e.Go(); err != nil {
+	ctx := context.Background()


should this value of ctx be passed to checkForBreakpointOnFailure, instead of creating a new one in that function?

lbernick · 2022-05-02T13:11:54Z

pkg/apis/config/feature_flags.go

@@ -59,6 +59,8 @@ const (
 	DefaultSendCloudEventsForRuns = false
 	// DefaultEmbeddedStatus is the default value for "embedded-status".
 	DefaultEmbeddedStatus = FullEmbeddedStatus
+	// DefaultEnableCancelUsingEntrypoint is the default value for "enable-cancel-using-entrypoint"
+	DefaultEnableCancelUsingEntrypoint = false


A better name for this feature flag might be something that describes the user-facing behavior changes, e.g. "stopPodOnCancel"

lbernick · 2022-05-02T13:39:33Z

pkg/entrypoint/entrypointer.go

+
+		var cancelled bool
+		if e.CancelFile != "" {
+			if err := e.Waiter.Wait(ctx, e.CancelFile, true, e.BreakpointOnFailure); err != nil {


It seems a bit weird to reuse Waiter.Wait for a cancel file in this way; for example "breakpointOnFailure" isn't meaningful here.

Just to make sure I understand what this change is doing:

we were previously calling Runner.Run in the main thread (which runs the step entrypoint)

now we're running Runner.Run in a goroutine and cancelling its context when it returns

If there's a cancel file, the main thread waits for it to exist

If the cancel file exists, the Waiter returns an error, and the main thread cancels the original context

If the cancel file never exists, Runner.Run will eventually complete and this function will return

If there is no cancel file, this function will just wait for Runner.Run to complete in the goroutine

I am wondering if there's some opportunity to simplify this logic? It would be helpful to at least include a comment block explaining a bit about what this does because it's a bit hard to parse.

Thank you for the review. Indeed the logic here could benefit from some simplification. I will do another review to see if it can be simplified without making too many modifications elsewhere.

lbernick · 2022-05-02T13:40:46Z

pkg/entrypoint/entrypointer.go

 		if err == context.DeadlineExceeded {
 			output = append(output, v1beta1.PipelineResourceResult{
 				Key:        "Reason",
 				Value:      "TimeoutExceeded",
 				ResultType: v1beta1.InternalTektonResultType,
 			})
+		} else if cancelled {


I don't think this is the right type of error to return-- this seems like something that should be handled by the reconciler. In particular, PipelineResourceResult doesn't seem appropriate as it's not related to pipelineresources.

@lbernick I think the name is a bit infortunate as it is this struct that we use for returning Result I think 🙃
We could rename it at some point.

lbernick · 2022-05-02T13:42:16Z

pkg/entrypoint/entrypointer_test.go

+		cancelFile:   "cancelFile",
+		waiter:       &contextWaiter{duration: 10 * time.Millisecond},
+		runner:       &fakeLongRunner{duration: 30 * time.Millisecond},
+		shouldCancel: true,
 	}} {
 		t.Run(c.desc, func(t *testing.T) {
 			fw, fr, fpw := &fakeWaiter{}, &fakeRunner{}, &fakePostWriter{}


would it be possible to update the fakes to reflect the new behavior options, instead of using different kinds of fakes depending on what the test case is testing?

lbernick · 2022-05-02T13:46:07Z

pkg/reconciler/taskrun/taskrun.go

+// If a pod is associated to the TaskRun, it stops it
+// failTaskRun function may return an error in case the pod could not be deleted
+// failTaskRun may update the local TaskRun status, but it won't push the updates to etcd
+func (c *Reconciler) failTaskRun(ctx context.Context, tr *v1beta1.TaskRun, reason v1beta1.TaskRunReason, message string) error {


Instead of two separate functions, could this be one function that takes a parameter determining whether to cancel or fail the taskrun?

vdemeester · 2022-05-03T05:13:09Z

Thanks for this @adshmh! When a pipelineRun is cancelled, it will also cancel its child TaskRuns. I'm not sure if any of the PipelineRun cancellation logic needs to be changed, but could you add some tests to the PipelineRun reconciler for this new cancellation strategy?

From the PipelineRun perspective, nothing changes 👼🏼

tekton-robot · 2022-06-09T04:24:05Z

@adshmh: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

afrittoli · 2022-06-17T16:35:05Z

This is unfortunately not ready to merge, so I'll move it to the next milestone

afrittoli · 2022-06-28T16:45:12Z

Removing from the milestone until someone is actively working on this.

vdemeester · 2022-07-04T13:49:24Z

/assign
I'll try to carry this patch later this week 👼🏼

adshmh · 2022-07-04T14:28:33Z

/assign I'll try to carry this patch later this week 👼🏼

Sorry for the delay on this. I will follow up by addressing the remaining concerns in a few days.

vdemeester · 2022-07-05T14:34:05Z

@adshmh did the rebase here if you need : https://github.com/vdemeester/tektoncd-pipeline/tree/3238-Cancel-TaskRuns-using-entrypoint-binary

vdemeester · 2022-08-31T12:39:09Z

Carrying on #5401

tekton-robot · 2022-11-29T13:09:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-12-29T13:21:52Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

jerop · 2023-01-17T15:43:48Z

doing a clean up of stale pull requests - feel free to reopen if you pick up this work again

newer pull request in #5401

/close

tekton-robot · 2023-01-17T15:43:51Z

@jerop: Closed this PR.

In response to this:

doing a clean up of stale pull requests - feel free to reopen if you pick up this work again

newer pull request in #5401

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added the release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. label Feb 24, 2022

tekton-robot requested review from afrittoli and vdemeester February 24, 2022 16:37

tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2022

tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2022

adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from d56cac1 to e12f80f Compare February 24, 2022 19:35

tekton-robot assigned vdemeester Feb 25, 2022

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2022

tekton-robot requested a review from a team February 25, 2022 08:48

imjasonh reviewed Feb 25, 2022

View reviewed changes

adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from e12f80f to d74bcf9 Compare March 6, 2022 19:08

tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 6, 2022

vdemeester reviewed Mar 9, 2022

View reviewed changes

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2022

adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from d74bcf9 to 56bfa1d Compare March 27, 2022 09:19

tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2022

adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from 56bfa1d to b5bfc5a Compare March 27, 2022 10:25

vdemeester mentioned this pull request Apr 20, 2022

PipelineRun timeouts delete TaskRun related pods #4035

Open

lbernick reviewed May 2, 2022

View reviewed changes

vdemeester added this to the Pipelines v0.36 milestone May 3, 2022

lbernick mentioned this pull request May 19, 2022

Bubble up the image related error reason to taskrun status #4846

Merged

5 tasks

yuzp1996 mentioned this pull request May 19, 2022

add functionality that cancels the TaskRun when the pod encounters an error such as InvalidImageName #4890

Closed

vdemeester modified the milestones: Pipelines v0.36, Pipelines v0.37 May 31, 2022

tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2022

afrittoli modified the milestones: Pipelines v0.37, Pipelines v0.38 Jun 17, 2022

afrittoli removed this from the Pipelines v0.38 milestone Jun 28, 2022

vdemeester mentioned this pull request Aug 31, 2022

[Carry #3238] Cancel taskrun using entrypoint binary #5401

Closed

7 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2022

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2022

tekton-robot closed this Jan 17, 2023

chengjoey mentioned this pull request Apr 8, 2023

feat/Cancel taskrun using entrypoint binary #6511

Merged

7 tasks

Cancel taskrun using entrypoint binary #4618

Cancel taskrun using entrypoint binary #4618

Conversation

adshmh commented Feb 24, 2022

Changes

Submitter Checklist

Release Notes

tekton-robot commented Feb 24, 2022

pritidesai commented Feb 24, 2022

tekton-robot commented Feb 24, 2022

tekton-robot commented Feb 24, 2022

vdemeester commented Feb 25, 2022 • edited Loading

adshmh commented Feb 25, 2022

imjasonh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Mar 6, 2022

adshmh commented Mar 6, 2022

tekton-robot commented Mar 6, 2022

vdemeester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Mar 27, 2022

osherdp commented May 2, 2022

lbernick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester commented May 3, 2022

tekton-robot commented Jun 9, 2022

afrittoli commented Jun 17, 2022

afrittoli commented Jun 28, 2022

vdemeester commented Jul 4, 2022

adshmh commented Jul 4, 2022

vdemeester commented Jul 5, 2022

vdemeester commented Aug 31, 2022

tekton-robot commented Nov 29, 2022

tekton-robot commented Dec 29, 2022

jerop commented Jan 17, 2023

tekton-robot commented Jan 17, 2023

vdemeester commented Feb 25, 2022 •

edited

Loading