-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cancel taskrun using entrypoint binary #4618
Cancel taskrun using entrypoint binary #4618
Conversation
Hi @adshmh. Thanks for your PR. I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
thanks @adshmh for this! Welcome to the community! 🎉 |
The following is the coverage report on the affected files.
|
d56cac1
to
e12f80f
Compare
The following is the coverage report on the affected files.
|
/assign @adshmh thanks for this PR. From reading it at a high level, it doesn't change the way a user ask for cancel (through update I am all for that idea, but it might affect current user somehow, so I wonder if we should have this behind a feature flag that we switch on by default after a few release, wdyt ? /cc @tektoncd/core-maintainers |
Thank you for pointing this out. I will add a feature flag for the new behavior of cancel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, what a great change! 🎉
Thanks for your contribution, +1 to Vincent's point about a feature flag, but otherwise, this looks great.
cmd/entrypoint/waiter.go
Outdated
if file == "" { | ||
return nil | ||
} | ||
for ; ; time.Sleep(rw.waitPollingInterval) { | ||
if ctx.Err() != nil { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nil | |
return err |
Or probably ignore it if !errors.Is(err, context.Canceled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Fixed.
cmd/entrypoint/waiter_test.go
Outdated
@@ -38,7 +39,7 @@ func TestRealWaiterWaitMissingFile(t *testing.T) { | |||
rw := realWaiter{} | |||
doneCh := make(chan struct{}) | |||
go func() { | |||
err := rw.setWaitPollingInterval(testWaitPollingInterval).Wait(tmp.Name(), false, false) | |||
err := rw.setWaitPollingInterval(testWaitPollingInterval).Wait(context.Background(), tmp.Name(), false, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability, maybe ctx := context.Background()
up at the top of this test, and reuse it throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Fixed.
pkg/entrypoint/entrypointer.go
Outdated
@@ -114,7 +117,7 @@ func (e Entrypointer) Go() error { | |||
}() | |||
|
|||
for _, f := range e.WaitFiles { | |||
if err := e.Waiter.Wait(f, e.WaitFileContent, e.BreakpointOnFailure); err != nil { | |||
if err := e.Waiter.Wait(context.Background(), f, e.WaitFileContent, e.BreakpointOnFailure); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have Go
take a context.Context
and use that here? If it's still a context.Background()
inside cmd/entrypoint/main.go
that's fine, it's just a bit cleaner and makes it easier to add other stuff like cancelling when we get SIGINT, which we might want to do later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Fixed.
pkg/entrypoint/entrypointer_test.go
Outdated
@@ -160,6 +165,18 @@ func TestEntrypointer(t *testing.T) { | |||
}, { | |||
desc: "breakpointOnFailure to wait or not to wait ", | |||
breakpointOnFailure: true, | |||
}, { | |||
desc: "Runner completes if not cancelled", | |||
cancelFile: ".", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this cancelfile
or something? "."
is an odd name for this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Fixed.
e12f80f
to
d74bcf9
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you for the review. The new feature of cancelling using the entrypoint binary is now behind a feature flag, with default set to false. |
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 small questions / comments, otherwise, looks good 👍🏼
cmd/entrypoint/waiter.go
Outdated
if file == "" { | ||
return nil | ||
} | ||
for ; ; time.Sleep(rw.waitPollingInterval) { | ||
if err := ctx.Err(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to handle ctx.Done()
? 🤔 Not sure if that case would happen but…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Fixed.
Looking at the code again, I think using your suggestion, i.e. ctx.Done()
is more readable. Updated.
On a side note, I increased the polling interval on the unit tests from 10 to 25 milliseconds, as it seems to reduce flakes significantly (from about 5% to around 0,5% in my tests). If this needs to be reverted, please let me know.
const testWaitPollingInterval = 25 * time.Millisecond
errChan := make(chan error, 1) | ||
go func() { | ||
errChan <- e.Runner.Run(ctx, e.Command...) | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean we'll run cancel
twice ? (here and the defer
)
I don't remember if it's a no-op or if it panics.. If it's a no-op we are fine though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Yes, we will end up calling cancel
twice on the context, but that is no-op (I tested to be sure).
I added the defer cancel()
for both readability and ensuring clean-up. If removing it is preferred, please let me know.
d74bcf9
to
56bfa1d
Compare
The following is the coverage report on the affected files.
|
56bfa1d
to
b5bfc5a
Compare
@adshmh @vdemeester any idea what's left for this change to get in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @adshmh! When a pipelineRun is cancelled, it will also cancel its child TaskRuns. I'm not sure if any of the PipelineRun cancellation logic needs to be changed, but could you add some tests to the PipelineRun reconciler for this new cancellation strategy?
In addition, could you please add some docs on the new feature flag, and update the release note to specify that this behavior is controlled by the feature flag?
@@ -144,7 +147,8 @@ func main() { | |||
log.Printf("non-fatal error copying credentials: %q", err) | |||
} | |||
|
|||
if err := e.Go(); err != nil { | |||
ctx := context.Background() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this value of ctx be passed to checkForBreakpointOnFailure
, instead of creating a new one in that function?
@@ -59,6 +59,8 @@ const ( | |||
DefaultSendCloudEventsForRuns = false | |||
// DefaultEmbeddedStatus is the default value for "embedded-status". | |||
DefaultEmbeddedStatus = FullEmbeddedStatus | |||
// DefaultEnableCancelUsingEntrypoint is the default value for "enable-cancel-using-entrypoint" | |||
DefaultEnableCancelUsingEntrypoint = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name for this feature flag might be something that describes the user-facing behavior changes, e.g. "stopPodOnCancel"
|
||
var cancelled bool | ||
if e.CancelFile != "" { | ||
if err := e.Waiter.Wait(ctx, e.CancelFile, true, e.BreakpointOnFailure); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit weird to reuse Waiter.Wait for a cancel file in this way; for example "breakpointOnFailure" isn't meaningful here.
Just to make sure I understand what this change is doing:
- we were previously calling Runner.Run in the main thread (which runs the step entrypoint)
- now we're running Runner.Run in a goroutine and cancelling its context when it returns
- If there's a cancel file, the main thread waits for it to exist
- If the cancel file exists, the Waiter returns an error, and the main thread cancels the original context
- If the cancel file never exists, Runner.Run will eventually complete and this function will return
- If there is no cancel file, this function will just wait for Runner.Run to complete in the goroutine
I am wondering if there's some opportunity to simplify this logic? It would be helpful to at least include a comment block explaining a bit about what this does because it's a bit hard to parse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Indeed the logic here could benefit from some simplification. I will do another review to see if it can be simplified without making too many modifications elsewhere.
if err == context.DeadlineExceeded { | ||
output = append(output, v1beta1.PipelineResourceResult{ | ||
Key: "Reason", | ||
Value: "TimeoutExceeded", | ||
ResultType: v1beta1.InternalTektonResultType, | ||
}) | ||
} else if cancelled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right type of error to return-- this seems like something that should be handled by the reconciler. In particular, PipelineResourceResult doesn't seem appropriate as it's not related to pipelineresources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbernick I think the name is a bit infortunate as it is this struct that we use for returning Result
I think 🙃
We could rename it at some point.
cancelFile: "cancelFile", | ||
waiter: &contextWaiter{duration: 10 * time.Millisecond}, | ||
runner: &fakeLongRunner{duration: 30 * time.Millisecond}, | ||
shouldCancel: true, | ||
}} { | ||
t.Run(c.desc, func(t *testing.T) { | ||
fw, fr, fpw := &fakeWaiter{}, &fakeRunner{}, &fakePostWriter{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to update the fakes to reflect the new behavior options, instead of using different kinds of fakes depending on what the test case is testing?
// If a pod is associated to the TaskRun, it stops it | ||
// failTaskRun function may return an error in case the pod could not be deleted | ||
// failTaskRun may update the local TaskRun status, but it won't push the updates to etcd | ||
func (c *Reconciler) failTaskRun(ctx context.Context, tr *v1beta1.TaskRun, reason v1beta1.TaskRunReason, message string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of two separate functions, could this be one function that takes a parameter determining whether to cancel or fail the taskrun?
From the |
@adshmh: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is unfortunately not ready to merge, so I'll move it to the next milestone |
Removing from the milestone until someone is actively working on this. |
/assign |
Sorry for the delay on this. I will follow up by addressing the remaining concerns in a few days. |
@adshmh did the rebase here if you need : https://github.com/vdemeester/tektoncd-pipeline/tree/3238-Cancel-TaskRuns-using-entrypoint-binary |
Carrying on #5401 |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
doing a clean up of stale pull requests - feel free to reopen if you pick up this work again newer pull request in #5401 /close |
@jerop: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Changes
The cancellation of taskruns is now done through the entrypoint binary
through a new flag called 'cancel_file'. This removes the need for
deleting the pods to cancel a taskrun, allowing examination of the logs
on the pods from cancelled taskruns. Part of work on issue #3238
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
Release Notes