Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel taskrun using entrypoint binary #4618

Conversation

adshmh
Copy link
Contributor

@adshmh adshmh commented Feb 24, 2022

Changes

The cancellation of taskruns is now done through the entrypoint binary
through a new flag called 'cancel_file'. This removes the need for
deleting the pods to cancel a taskrun, allowing examination of the logs
on the pods from cancelled taskruns. Part of work on issue #3238

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Pods corresponding to a cancelled taskrun are no longer deleted: they are stopped instead.

@tekton-robot tekton-robot added the release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. label Feb 24, 2022
@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2022
@tekton-robot
Copy link
Collaborator

Hi @adshmh. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pritidesai
Copy link
Member

thanks @adshmh for this! Welcome to the community! 🎉
/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 24, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
cmd/entrypoint/waiter.go 80.0% 82.4% 2.4
pkg/entrypoint/entrypointer.go 69.7% 74.1% 4.4
pkg/pod/entrypoint.go 88.0% 87.7% -0.3
pkg/reconciler/taskrun/taskrun.go 80.5% 80.2% -0.2

@adshmh adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from d56cac1 to e12f80f Compare February 24, 2022 19:35
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
cmd/entrypoint/waiter.go 80.0% 82.4% 2.4
pkg/entrypoint/entrypointer.go 69.7% 74.7% 5.0
pkg/pod/entrypoint.go 88.0% 87.7% -0.3
pkg/reconciler/taskrun/taskrun.go 80.5% 80.2% -0.2

@vdemeester
Copy link
Member

vdemeester commented Feb 25, 2022

/assign
/hold

@adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update spec.status), but only how the taskrun reconciler handles the cancel.

I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ?

/cc @tektoncd/core-maintainers

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2022
@tekton-robot tekton-robot requested a review from a team February 25, 2022 08:48
@adshmh
Copy link
Contributor Author

adshmh commented Feb 25, 2022

/assign /hold

@adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update spec.status, but only how the taskrun reconciler handles the cancel.

I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ?

Thank you for pointing this out. I will add a feature flag for the new behavior of cancel.

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a great change! 🎉

Thanks for your contribution, +1 to Vincent's point about a feature flag, but otherwise, this looks great.

if file == "" {
return nil
}
for ; ; time.Sleep(rw.waitPollingInterval) {
if ctx.Err() != nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil
return err

Or probably ignore it if !errors.Is(err, context.Canceled)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Fixed.

@@ -38,7 +39,7 @@ func TestRealWaiterWaitMissingFile(t *testing.T) {
rw := realWaiter{}
doneCh := make(chan struct{})
go func() {
err := rw.setWaitPollingInterval(testWaitPollingInterval).Wait(tmp.Name(), false, false)
err := rw.setWaitPollingInterval(testWaitPollingInterval).Wait(context.Background(), tmp.Name(), false, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, maybe ctx := context.Background() up at the top of this test, and reuse it throughout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Fixed.

@@ -114,7 +117,7 @@ func (e Entrypointer) Go() error {
}()

for _, f := range e.WaitFiles {
if err := e.Waiter.Wait(f, e.WaitFileContent, e.BreakpointOnFailure); err != nil {
if err := e.Waiter.Wait(context.Background(), f, e.WaitFileContent, e.BreakpointOnFailure); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have Go take a context.Context and use that here? If it's still a context.Background() inside cmd/entrypoint/main.go that's fine, it's just a bit cleaner and makes it easier to add other stuff like cancelling when we get SIGINT, which we might want to do later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Fixed.

@@ -160,6 +165,18 @@ func TestEntrypointer(t *testing.T) {
}, {
desc: "breakpointOnFailure to wait or not to wait ",
breakpointOnFailure: true,
}, {
desc: "Runner completes if not cancelled",
cancelFile: ".",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this cancelfile or something? "." is an odd name for this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Fixed.

@adshmh adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from e12f80f to d74bcf9 Compare March 6, 2022 19:08
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vdemeester
You can assign the PR to them by writing /assign @vdemeester in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 6, 2022
@adshmh
Copy link
Contributor Author

adshmh commented Mar 6, 2022

/assign /hold

@adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update spec.status), but only how the taskrun reconciler handles the cancel.

I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ?

/cc @tektoncd/core-maintainers

Thank you for the review. The new feature of cancelling using the entrypoint binary is now behind a feature flag, with default set to false.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
cmd/entrypoint/main.go 14.0% 13.8% -0.2
cmd/entrypoint/waiter.go 80.0% 82.4% 2.4
pkg/apis/config/feature_flags.go 87.8% 86.0% -1.8
pkg/entrypoint/entrypointer.go 69.7% 73.8% 4.1
pkg/pod/entrypoint.go 88.0% 87.7% -0.3
pkg/reconciler/taskrun/taskrun.go 80.5% 80.5% 0.0

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 small questions / comments, otherwise, looks good 👍🏼

if file == "" {
return nil
}
for ; ; time.Sleep(rw.waitPollingInterval) {
if err := ctx.Err(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to handle ctx.Done() ? 🤔 Not sure if that case would happen but…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Fixed.
Looking at the code again, I think using your suggestion, i.e. ctx.Done() is more readable. Updated.

On a side note, I increased the polling interval on the unit tests from 10 to 25 milliseconds, as it seems to reduce flakes significantly (from about 5% to around 0,5% in my tests). If this needs to be reverted, please let me know.

const testWaitPollingInterval = 25 * time.Millisecond

errChan := make(chan error, 1)
go func() {
errChan <- e.Runner.Run(ctx, e.Command...)
cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we'll run cancel twice ? (here and the defer)
I don't remember if it's a no-op or if it panics.. If it's a no-op we are fine though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Yes, we will end up calling cancel twice on the context, but that is no-op (I tested to be sure).
I added the defer cancel() for both readability and ensuring clean-up. If removing it is preferred, please let me know.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2022
@adshmh adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from d74bcf9 to 56bfa1d Compare March 27, 2022 09:19
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
cmd/entrypoint/main.go 14.0% 13.8% -0.2
cmd/entrypoint/waiter.go 80.0% 82.4% 2.4
pkg/apis/config/feature_flags.go 88.5% 87.0% -1.4
pkg/entrypoint/entrypointer.go 69.7% 73.8% 4.1
pkg/pod/entrypoint.go 88.0% 87.7% -0.3
pkg/reconciler/taskrun/taskrun.go 79.9% 80.0% 0.1

@adshmh adshmh force-pushed the 3238-Cancel-TaskRuns-using-entrypoint-binary branch from 56bfa1d to b5bfc5a Compare March 27, 2022 10:25
@osherdp
Copy link

osherdp commented May 2, 2022

@adshmh @vdemeester any idea what's left for this change to get in?
If it will have positive effects on #4035 then it will solve lots of problems for us 😃

Copy link
Member

@lbernick lbernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @adshmh! When a pipelineRun is cancelled, it will also cancel its child TaskRuns. I'm not sure if any of the PipelineRun cancellation logic needs to be changed, but could you add some tests to the PipelineRun reconciler for this new cancellation strategy?

In addition, could you please add some docs on the new feature flag, and update the release note to specify that this behavior is controlled by the feature flag?

@@ -144,7 +147,8 @@ func main() {
log.Printf("non-fatal error copying credentials: %q", err)
}

if err := e.Go(); err != nil {
ctx := context.Background()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this value of ctx be passed to checkForBreakpointOnFailure, instead of creating a new one in that function?

@@ -59,6 +59,8 @@ const (
DefaultSendCloudEventsForRuns = false
// DefaultEmbeddedStatus is the default value for "embedded-status".
DefaultEmbeddedStatus = FullEmbeddedStatus
// DefaultEnableCancelUsingEntrypoint is the default value for "enable-cancel-using-entrypoint"
DefaultEnableCancelUsingEntrypoint = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name for this feature flag might be something that describes the user-facing behavior changes, e.g. "stopPodOnCancel"


var cancelled bool
if e.CancelFile != "" {
if err := e.Waiter.Wait(ctx, e.CancelFile, true, e.BreakpointOnFailure); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird to reuse Waiter.Wait for a cancel file in this way; for example "breakpointOnFailure" isn't meaningful here.

Just to make sure I understand what this change is doing:

  • we were previously calling Runner.Run in the main thread (which runs the step entrypoint)
  • now we're running Runner.Run in a goroutine and cancelling its context when it returns
  • If there's a cancel file, the main thread waits for it to exist
  • If the cancel file exists, the Waiter returns an error, and the main thread cancels the original context
  • If the cancel file never exists, Runner.Run will eventually complete and this function will return
  • If there is no cancel file, this function will just wait for Runner.Run to complete in the goroutine

I am wondering if there's some opportunity to simplify this logic? It would be helpful to at least include a comment block explaining a bit about what this does because it's a bit hard to parse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. Indeed the logic here could benefit from some simplification. I will do another review to see if it can be simplified without making too many modifications elsewhere.

if err == context.DeadlineExceeded {
output = append(output, v1beta1.PipelineResourceResult{
Key: "Reason",
Value: "TimeoutExceeded",
ResultType: v1beta1.InternalTektonResultType,
})
} else if cancelled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right type of error to return-- this seems like something that should be handled by the reconciler. In particular, PipelineResourceResult doesn't seem appropriate as it's not related to pipelineresources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbernick I think the name is a bit infortunate as it is this struct that we use for returning Result I think 🙃
We could rename it at some point.

cancelFile: "cancelFile",
waiter: &contextWaiter{duration: 10 * time.Millisecond},
runner: &fakeLongRunner{duration: 30 * time.Millisecond},
shouldCancel: true,
}} {
t.Run(c.desc, func(t *testing.T) {
fw, fr, fpw := &fakeWaiter{}, &fakeRunner{}, &fakePostWriter{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to update the fakes to reflect the new behavior options, instead of using different kinds of fakes depending on what the test case is testing?

// If a pod is associated to the TaskRun, it stops it
// failTaskRun function may return an error in case the pod could not be deleted
// failTaskRun may update the local TaskRun status, but it won't push the updates to etcd
func (c *Reconciler) failTaskRun(ctx context.Context, tr *v1beta1.TaskRun, reason v1beta1.TaskRunReason, message string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of two separate functions, could this be one function that takes a parameter determining whether to cancel or fail the taskrun?

@vdemeester
Copy link
Member

Thanks for this @adshmh! When a pipelineRun is cancelled, it will also cancel its child TaskRuns. I'm not sure if any of the PipelineRun cancellation logic needs to be changed, but could you add some tests to the PipelineRun reconciler for this new cancellation strategy?

From the PipelineRun perspective, nothing changes 👼🏼

@vdemeester vdemeester added this to the Pipelines v0.36 milestone May 3, 2022
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2022
@tekton-robot
Copy link
Collaborator

@adshmh: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@afrittoli
Copy link
Member

This is unfortunately not ready to merge, so I'll move it to the next milestone

@afrittoli afrittoli removed this from the Pipelines v0.38 milestone Jun 28, 2022
@afrittoli
Copy link
Member

Removing from the milestone until someone is actively working on this.

@vdemeester
Copy link
Member

/assign
I'll try to carry this patch later this week 👼🏼

@adshmh
Copy link
Contributor Author

adshmh commented Jul 4, 2022

/assign I'll try to carry this patch later this week 👼🏼

Sorry for the delay on this. I will follow up by addressing the remaining concerns in a few days.

@vdemeester
Copy link
Member

@vdemeester
Copy link
Member

Carrying on #5401

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2022
@jerop
Copy link
Member

jerop commented Jan 17, 2023

doing a clean up of stale pull requests - feel free to reopen if you pick up this work again

newer pull request in #5401

/close

@tekton-robot
Copy link
Collaborator

@jerop: Closed this PR.

In response to this:

doing a clean up of stale pull requests - feel free to reopen if you pick up this work again

newer pull request in #5401

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants