-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0039 Add Variable retries
and retry-count
#239
Conversation
/assign ImJasonH |
/kind tep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, apologies in advance! I kept saying "oh yeah that seems simple I will totally approve it" and then I started to think about it and now I have some questions:
- do we see these values being made available to when expressions? <-- im thinking we wouldnt do this at least initially? but i bet that if we add this variable replacement, someone will ask for it 🤔
- can a task look at the retry count of another task? e.g. TaskA can access the retry count of TaskB? <-- this is something we explicitly avoided with TEP-0028 when we added access to a task's status (Adding TEP for accessing task execution status at runtime #234 (comment))
I wonder if there's an alternative where instead of something like tasks.<taskName>.retries
we have something like context.retries
and it only exists in the context of a PipelineTask (a new level of scoping to https://github.com/tektoncd/pipeline/blob/master/docs/variables.md#variables-available-in-a-pipeline)
I also wonder if we can run this by @IdanAdar who opened #2725 and see if this would meet his needs.
Basically I'm thinking that:
- The alternative approach where you emit retry events is a better overall solution; so maybe the better answer is pursuing something like Actions and Notifications for Tekton pipeline#1740 - making it simpler to express "notifications" in response to Pipeline lifecycle events?
- If we DO add this after all, let's try to keep the scope as small as possible
Interested in your thoughts on this also @afrittoli @vdemeester
|
||
## Proposal | ||
|
||
There are 2 variable substitutions `tasks.<taskName>.retries` and `tasks.<taskName>.retry-count` needed to be supported: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritidesai just wanted to bring this to your attention cuz it feels related to adding status info https://github.com/tektoncd/community/blob/master/teps/0028-task-execution-status-at-runtime.md#proposal
feels like tasks.<taskName>.retries
and tasks.<taskName>.status
are nicely aligned? (unless we wanted to consider retries
part of the overall status info, e.g. something like tasks.<taskName>.status.retries
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @bobcatfish
feels like tasks..retries and tasks..status are nicely aligned?
yuptasks.<taskName>.retries
fits well withtasks.<taskName>.status
.
I like your suggestion of going with context
to avoid exposing retry
count of a task to other tasks.
- image: ubuntu | ||
script: | | ||
#!/usr/bin/env sh | ||
/path/to/<program-that-fails> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's interesting that this example has to be custom built to take these values as a param - if you wanted to meet the use case you mentioned and alert slack in the case of failure, you'd need one Task that both a) does the thing that needs to be retried and b) alerts slack - unless we let Tasks access the retry count of other running Tasks @_@
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late response! I agree with your suggestion of limiting retries
to the task own context. I should have added this implementation to the doc and made the doc more detailed. Did we decide to not move forward with this tep since users can use pipelines + trigger to send the notification for now and pipelines should have a more decent way to notify the users(tektoncd/pipeline#1740). I'm glad to help if there's anything I can do to make pipelines support notifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to help if there's anything I can do to make pipelines support notifications.
@afrittoli has now opened #275 about proposing a notifications feature :D
Did we decide to not move forward with this tep since users can use pipelines + trigger to send the notification for now and pipelines should have a more decent way to notify the users(tektoncd/pipeline#1740)
I don't think we decided definitively - it'd be interesting to hear from @IdanAdar who opened tektoncd/pipeline#2725 to hear his thoughts. i think there's a good chance we could add the simplified version of this variable interpolation feature in the meantime. @yaoxiaoqi is this something you still want to work on or do you want to close this PR for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to continue to work on this issue if we decide this can be implemented. If this decision can't be made shortly this PR can also be closed.
We do emit events when retries run. The body of the taskrun is sent in the payload of events, so as long as we store the retry count in the status (which we must do to keep track) the event receiver has access to that information.
|
name: failed-task | ||
params: | ||
- name: num-retries | ||
value: $(tasks.retry-me.retry-count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the task name in the variable is not necessary here. Dropping it would enforce what @bobcatfish mentioned, we must limit exposing retries
count of a task to other tasks i.e. a task can only access its own retries
count.
- name: num-retries
value: $(tasks.retry-count)
or rather
- name: num-retries
value: $(context.task.retry-count)
similar to $(context.task.name)
Interesting thought, you mean evaluating I am now wondering whether task
Use case doesn't mandate this and also this should be something just limited to its own task. We might see use case to expose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I'm thinking that:
1. The alternative approach where you emit retry events is a better overall solution; so maybe the better answer is pursuing something like tektoncd/pipeline#1740 - making it simpler to express "notifications" in response to Pipeline lifecycle events?
2. If we DO add this after all, let's try to keep the scope as small as possibleInterested in your thoughts on this also @afrittoli @vdemeester
As written in the inlined comment, where do we draw the line between tektoncd/pipeline and a full-fledge CI/CD system ? At which point do we make the tektoncd/pipeline API more complex (to grasp and to implement) instead of having several components that work well together ?
Should Tekton work on an opiniated setup of all its component (to showcase, to be used instead of each component on their own and letting user deal with the "setup" and relation between the components) ? Note: I think a Tekton GA (as opposed to tektoncd/pipeline v1 API) would only make sense for this "opiniated" setup, not if we got all component on their own
|
||
### User Stories | ||
|
||
`PipelineTask` author wants to send a Slack notification only if it fails after specified `retries`, not every time it fails.The user wants less interruption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only use case we are envision for this ?
Is there any other potential use for this feature ? On top of my head, I could see a need to be able to parametrize the task with the retry-count, for example, if the task provision something (a cluster, a VM, …) and want to name it accordingly to the retry-count (in order to keep the previous one around or something along those lines) — this particular use case could be done using different variables though (using the uid, …).
|
||
Event seems a more decent way to satisfy the user's need. An event should be emitted on a retry. The user could have an event listener (from Triggers) that runs a task to send a notification when the desired event is caught. In this way, Pipelines would not need to expose another variable. | ||
|
||
But the drawback is that it would require a lot more setup by the user, such as running Triggers when they might not use it for anything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we need to look at the "mission" of Tekton: "Tekton is a powerful and flexible open-source framework for creating CI/CD systems" (from tekton.dev).
- What are we optimizing each component (like
tektoncd/pipeline
) for ? - Is Tekton meant to provide a full-fledge CI/CD system or just a framework / set of components ?
- If Tekton is meant to provide a full-fledge CI/CD system, should it be in
tektoncd/pipeline
or some opiniated "official" setup (that would be available in thetektoncd
org) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where "native" notifications in Tekton would help.
I need to create a TEP for this since the old doc is out of date.
Notifications with Tekton are possible today via cloud events + triggers, but they require a complex setup by the user. A "native" notification mechanism would provide a way to simplify the setup for the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdemeester I wonder if it might help at some point to try to draw up some more specific "missions" for each component 🤔 I'm not sure what the answer is for the notifications feature - @afrittoli do you see it being part of pipelines or triggers or a separate project? i could see it either way - but I also could see pipelines + triggers being combined into one project (tektoncd/triggers#697)
|
||
## Alternatives | ||
|
||
Event seems a more decent way to satisfy the user's need. An event should be emitted on a retry. The user could have an event listener (from Triggers) that runs a task to send a notification when the desired event is caught. In this way, Pipelines would not need to expose another variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Events are available today
retries
and retry-count
retries
and retry-count
/test pull-community-teps-lint |
We have visited this TEP multiple times in API WG and not sure if this something we want to implement. Please let us know if you have any use cases for this proposal 🙏 |
Sure thing. tektoncd/pipeline#2725 is the only use case for now. Users can also customize their logs according to exposed retries from my point of view, which might be a trivial use case. |
Closing this for now! |
/close |
@sbwsg: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I believe the code does parameter and context substitution (ApplyParameters, ApplyContext) to the entire pipeline spec before determining the next set of TaskRuns to create (or retry). Currently a retry is done by "recycling" the TaskRun. The current TaskRun status is appended to I suppose there could be code to redo substitution of a retry counter during the recycling of the TaskRun. This means the TaskRun params will only have the retry counter from the current/final retry. |
We discussed this TEP during the API WG on Dec 7 (https://docs.google.com/document/d/17PodAxG8hV351fBhSu7Y_OIPhGTVgj6OJ2lPphYYRpU/edit#). We agreed that:
@yaoxiaoqi the TEP will probably have to be rebased for the lint to pass. It may require to update some parts to match the API WG discussion. |
8756f30
to
ae2d0a2
Compare
a98fcd6
to
b255bba
Compare
b255bba
to
34a3e03
Compare
/path/to/<program-that-fails> | ||
``` | ||
|
||
In this example, Tekton will echo `This is the last retry` when retry count is equal to 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is this is not possible. $(context.pipelineTask.retries)
is 5 in this example since its specified in the schema but $(context.pipelineTask.retry-count)
can not be equal to the number of retries. $(context.pipelineTask.retry-count)
represents the length of RetriesStatus
which would be 4 when fifth retry is starts unless you always add one to it 🙃 So the very first execution without any RetriesStatus
would set $(context.pipelineTask.retry-count)
to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the controller declares a task failure when number of RetriesStatus
is >= retries
i.e. controller does not schedule that task once the number of RetriesStatus
matches retries
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible as far as I know. The replacement of contexts means to happen after the update of RetriesStatus. For example, let's say tr
is a PipelineTask that means to fail, and its Retries is set to 3. We might misunderstand that the controller would run tr
3 times in total. But according to our code, tr
will run 4 times in total (because of this line, it executes before appending RetriesStatus).
let's say the order of taskrun and its retries is tr#0 -> tr_retry#1 -> tr_retry#2 -> tr_retry#3
.
When creating tr_retry#3
, the RetriesStatus would contain the status of tr#0
, tr_retry#1
, and tr_retry#2
. So the retry-count is 3, and it equals to the PipelineTask Retries. That's why the example works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating tr_retry#3, the RetriesStatus would contain the status of tr#0, tr_retry#1, and tr_retry#2. So the retry-count is 3, and it equals to the PipelineTask Retries. That's why the example works.
tr#0 status is captured in .status.taskRuns[].status
. After tr#0 fails and before creating(updating) taskRun i.e. tr_retry#1, the status is copied to .status.taskRuns[].status.retriesStatus
and .status.taskRuns[].status
is cleared which makes len(retriesStatus)
equal to 1.
But the context is applied before copying the status into retriesStatus i.e. .status.taskRuns[].status
has failure but .status.taskRuns[].status.retriesStatus
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the context is applied before copying the status into retriesStatus i.e.
.status.taskRuns[].status
has failure but.status.taskRuns[].status.retriesStatus
is empty.
I plan to only replace context.pipelineTask.retries
in this ApplyContexts
, since it's a const variable. Once the pipelinerun is applied it's gonna not change.
As for context.pipelineTask.retry-count
, the one that will change according to the number of retries, I'm going to replace it here, the ApplyContexts
for TaskRun
. This happens after copying the status into retriesStatus. So we don't need to add one, because the length of .status.taskRuns[].status.retriesStatus
is correct here. But the problem might be whether the name context.
pipelineTask.retry-count
is appropriate or not.
To be honest, I planned to replace the both variables in the ApplyContexts
for TaskRun
. But it seems impossible to know whether a TaskRun is derived from a PipelineTask unless I add a flag or something similar to TaskRun Spec to identify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @yaoxiaoqi 👍
/lgtm
But the problem might be whether the name context.pipelineTask.retry-count is appropriate or not.
Right, since a taskRun might not belong to any pipeline. Like we discussed earlier, a task will have access to its own retry count so I think its safe to rename the variable to context.task.retry-count
, similar to context.task.name
and context.taskrun.name
.
/assign @pritidesai |
This TEP proposes adding `retries` and current retry times to Pipeline variables in order to send notifications after a PipelineTask exhausted its retries.
34a3e03
to
b6b0129
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbwsg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
looks like this might be ready to go! need a non-google endorsement tho @pritidesai |
Apologies for not catching this @yaoxiaoqi !! but for future notice we try to avoid commits with messages like "update" - see https://github.com/tektoncd/community/blob/main/standards.md#commits |
sorry, my bad, didn't notice the commits as well 🙏 thanks @bobcatfish for catching it. |
This TEP proposes adding
retries
and current retry times to Pipeline variables in order to send notifications after a PipelineTask exhausted its retries.Related issue: tektoncd/pipeline#2725