Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0039 Add Variable retries and retry-count #239

Merged
merged 2 commits into from
Feb 4, 2021

Conversation

yaoxiaoqi
Copy link
Member

@yaoxiaoqi yaoxiaoqi commented Oct 19, 2020

This TEP proposes adding retries and current retry times to Pipeline variables in order to send notifications after a PipelineTask exhausted its retries.

Related issue: tektoncd/pipeline#2725

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 19, 2020
@yaoxiaoqi
Copy link
Member Author

/assign ImJasonH
/assign bobcatfish
/assign sbwsg

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 22, 2020
@ghost
Copy link

ghost commented Nov 2, 2020

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Nov 2, 2020
Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, apologies in advance! I kept saying "oh yeah that seems simple I will totally approve it" and then I started to think about it and now I have some questions:

  1. do we see these values being made available to when expressions? <-- im thinking we wouldnt do this at least initially? but i bet that if we add this variable replacement, someone will ask for it 🤔
  2. can a task look at the retry count of another task? e.g. TaskA can access the retry count of TaskB? <-- this is something we explicitly avoided with TEP-0028 when we added access to a task's status (Adding TEP for accessing task execution status at runtime #234 (comment))

I wonder if there's an alternative where instead of something like tasks.<taskName>.retries we have something like context.retries and it only exists in the context of a PipelineTask (a new level of scoping to https://github.com/tektoncd/pipeline/blob/master/docs/variables.md#variables-available-in-a-pipeline)

I also wonder if we can run this by @IdanAdar who opened #2725 and see if this would meet his needs.

Basically I'm thinking that:

  1. The alternative approach where you emit retry events is a better overall solution; so maybe the better answer is pursuing something like Actions and Notifications for Tekton pipeline#1740 - making it simpler to express "notifications" in response to Pipeline lifecycle events?
  2. If we DO add this after all, let's try to keep the scope as small as possible

Interested in your thoughts on this also @afrittoli @vdemeester


## Proposal

There are 2 variable substitutions `tasks.<taskName>.retries` and `tasks.<taskName>.retry-count` needed to be supported:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritidesai just wanted to bring this to your attention cuz it feels related to adding status info https://github.com/tektoncd/community/blob/master/teps/0028-task-execution-status-at-runtime.md#proposal

feels like tasks.<taskName>.retries and tasks.<taskName>.status are nicely aligned? (unless we wanted to consider retries part of the overall status info, e.g. something like tasks.<taskName>.status.retries?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @bobcatfish

feels like tasks..retries and tasks..status are nicely aligned?
yup tasks.<taskName>.retries fits well with tasks.<taskName>.status.

I like your suggestion of going with context to avoid exposing retry count of a task to other tasks.

- image: ubuntu
script: |
#!/usr/bin/env sh
/path/to/<program-that-fails>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's interesting that this example has to be custom built to take these values as a param - if you wanted to meet the use case you mentioned and alert slack in the case of failure, you'd need one Task that both a) does the thing that needs to be retried and b) alerts slack - unless we let Tasks access the retry count of other running Tasks @_@

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response! I agree with your suggestion of limiting retries to the task own context. I should have added this implementation to the doc and made the doc more detailed. Did we decide to not move forward with this tep since users can use pipelines + trigger to send the notification for now and pipelines should have a more decent way to notify the users(tektoncd/pipeline#1740). I'm glad to help if there's anything I can do to make pipelines support notifications.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to help if there's anything I can do to make pipelines support notifications.

@afrittoli has now opened #275 about proposing a notifications feature :D

Did we decide to not move forward with this tep since users can use pipelines + trigger to send the notification for now and pipelines should have a more decent way to notify the users(tektoncd/pipeline#1740)

I don't think we decided definitively - it'd be interesting to hear from @IdanAdar who opened tektoncd/pipeline#2725 to hear his thoughts. i think there's a good chance we could add the simplified version of this variable interpolation feature in the meantime. @yaoxiaoqi is this something you still want to work on or do you want to close this PR for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to continue to work on this issue if we decide this can be implemented. If this decision can't be made shortly this PR can also be closed.

@afrittoli
Copy link
Member

First of all, apologies in advance! I kept saying "oh yeah that seems simple I will totally approve it" and then I started to think about it and now I have some questions:

  1. do we see these values being made available to when expressions? <-- im thinking we wouldnt do this at least initially? but i bet that if we add this variable replacement, someone will ask for it 🤔
  2. can a task look at the retry count of another task? e.g. TaskA can access the retry count of TaskB? <-- this is something we explicitly avoided with TEP-0028 when we added access to a task's status (#234 (comment))

I wonder if there's an alternative where instead of something like tasks.<taskName>.retries we have something like context.retries and it only exists in the context of a PipelineTask (a new level of scoping to https://github.com/tektoncd/pipeline/blob/master/docs/variables.md#variables-available-in-a-pipeline)

I also wonder if we can run this by @IdanAdar who opened #2725 and see if this would meet his needs.

Basically I'm thinking that:

  1. The alternative approach where you emit retry events is a better overall solution; so maybe the better answer is pursuing something like tektoncd/pipeline#1740 - making it simpler to express "notifications" in response to Pipeline lifecycle events?

We do emit events when retries run. The body of the taskrun is sent in the payload of events, so as long as we store the retry count in the status (which we must do to keep track) the event receiver has access to that information.

  1. If we DO add this after all, let's try to keep the scope as small as possible

Interested in your thoughts on this also @afrittoli @vdemeester

name: failed-task
params:
- name: num-retries
value: $(tasks.retry-me.retry-count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the task name in the variable is not necessary here. Dropping it would enforce what @bobcatfish mentioned, we must limit exposing retries count of a task to other tasks i.e. a task can only access its own retries count.

  - name: num-retries
      value: $(tasks.retry-count)

or rather

  - name: num-retries
      value: $(context.task.retry-count)

similar to $(context.task.name)

@pritidesai
Copy link
Member

do we see these values being made available to when expressions? <-- im thinking we wouldnt do this at least initially? but i bet that if we add this variable replacement, someone will ask for it 🤔

Interesting thought, you mean evaluating when expression before every retry? is it possible? might result in multiple failed taskRuns along with skippedTasks 🤔 Also, it would skip the entire task instead of sending notification after a specific number of retries.

I am now wondering whether task params are evaluated every retry i.e. very first run with certain values and the next retry with different set of values 🤔

can a task look at the retry count of another task? e.g. TaskA can access the retry count of TaskB?

Use case doesn't mandate this and also this should be something just limited to its own task. We might see use case to expose status but not sure about exposing retries.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically I'm thinking that:
1. The alternative approach where you emit retry events is a better overall solution; so maybe the better answer is pursuing something like tektoncd/pipeline#1740 - making it simpler to express "notifications" in response to Pipeline lifecycle events?
2. If we DO add this after all, let's try to keep the scope as small as possible

Interested in your thoughts on this also @afrittoli @vdemeester

As written in the inlined comment, where do we draw the line between tektoncd/pipeline and a full-fledge CI/CD system ? At which point do we make the tektoncd/pipeline API more complex (to grasp and to implement) instead of having several components that work well together ?
Should Tekton work on an opiniated setup of all its component (to showcase, to be used instead of each component on their own and letting user deal with the "setup" and relation between the components) ? Note: I think a Tekton GA (as opposed to tektoncd/pipeline v1 API) would only make sense for this "opiniated" setup, not if we got all component on their own


### User Stories

`PipelineTask` author wants to send a Slack notification only if it fails after specified `retries`, not every time it fails.The user wants less interruption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only use case we are envision for this ?
Is there any other potential use for this feature ? On top of my head, I could see a need to be able to parametrize the task with the retry-count, for example, if the task provision something (a cluster, a VM, …) and want to name it accordingly to the retry-count (in order to keep the previous one around or something along those lines) — this particular use case could be done using different variables though (using the uid, …).


Event seems a more decent way to satisfy the user's need. An event should be emitted on a retry. The user could have an event listener (from Triggers) that runs a task to send a notification when the desired event is caught. In this way, Pipelines would not need to expose another variable.

But the drawback is that it would require a lot more setup by the user, such as running Triggers when they might not use it for anything else.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we need to look at the "mission" of Tekton: "Tekton is a powerful and flexible open-source framework for creating CI/CD systems" (from tekton.dev).

  • What are we optimizing each component (like tektoncd/pipeline) for ?
  • Is Tekton meant to provide a full-fledge CI/CD system or just a framework / set of components ?
  • If Tekton is meant to provide a full-fledge CI/CD system, should it be in tektoncd/pipeline or some opiniated "official" setup (that would be available in the tektoncd org) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where "native" notifications in Tekton would help.
I need to create a TEP for this since the old doc is out of date.

Notifications with Tekton are possible today via cloud events + triggers, but they require a complex setup by the user. A "native" notification mechanism would provide a way to simplify the setup for the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdemeester I wonder if it might help at some point to try to draw up some more specific "missions" for each component 🤔 I'm not sure what the answer is for the notifications feature - @afrittoli do you see it being part of pipelines or triggers or a separate project? i could see it either way - but I also could see pipelines + triggers being combined into one project (tektoncd/triggers#697)


## Alternatives

Event seems a more decent way to satisfy the user's need. An event should be emitted on a retry. The user could have an event listener (from Triggers) that runs a task to send a notification when the desired event is caught. In this way, Pipelines would not need to expose another variable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events are available today

@afrittoli afrittoli changed the title Add Variable retries and retry-count TEP-0026 Add Variable retries and retry-count Nov 16, 2020
@afrittoli
Copy link
Member

/test pull-community-teps-lint

@pritidesai
Copy link
Member

We have visited this TEP multiple times in API WG and not sure if this something we want to implement. Please let us know if you have any use cases for this proposal 🙏

@yaoxiaoqi
Copy link
Member Author

We have visited this TEP multiple times in API WG and not sure if this something we want to implement. Please let us know if you have any use cases for this proposal 🙏

Sure thing. tektoncd/pipeline#2725 is the only use case for now. Users can also customize their logs according to exposed retries from my point of view, which might be a trivial use case.

@bobcatfish
Copy link
Contributor

Closing this for now!

@ghost
Copy link

ghost commented Dec 2, 2020

/close

@tekton-robot
Copy link
Contributor

@sbwsg: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@afrittoli afrittoli reopened this Dec 7, 2020
@GregDritschler
Copy link
Contributor

GregDritschler commented Dec 7, 2020

I am now wondering whether task params are evaluated every retry i.e. very first run with certain values and the next retry with different set of values

I believe the code does parameter and context substitution (ApplyParameters, ApplyContext) to the entire pipeline spec before determining the next set of TaskRuns to create (or retry).

Currently a retry is done by "recycling" the TaskRun. The current TaskRun status is appended to retriesStatus and then cleared. The start time, completion time, and pod name are also cleared. The TaskRun spec is not updated because currently there's no reason to do so; it should be the same each time.

I suppose there could be code to redo substitution of a retry counter during the recycling of the TaskRun. This means the TaskRun params will only have the retry counter from the current/final retry.

@afrittoli
Copy link
Member

We discussed this TEP during the API WG on Dec 7 (https://docs.google.com/document/d/17PodAxG8hV351fBhSu7Y_OIPhGTVgj6OJ2lPphYYRpU/edit#). We agreed that:

  • It would be the first variable that is only available in the context of a pipeline task
  • It would be a variable ( $context.* ) to be passed to a Pipeline Task parameter by the pipeline
  • Not implicitly available in the task, same as other context variables

@yaoxiaoqi the TEP will probably have to be rebased for the lint to pass. It may require to update some parts to match the API WG discussion.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2020
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2020
@yaoxiaoqi yaoxiaoqi force-pushed the retry-count branch 2 times, most recently from a98fcd6 to b255bba Compare December 21, 2020 11:36
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2021
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 11, 2021
/path/to/<program-that-fails>
```

In this example, Tekton will echo `This is the last retry` when retry count is equal to 5.
Copy link
Member

@pritidesai pritidesai Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this is not possible. $(context.pipelineTask.retries) is 5 in this example since its specified in the schema but $(context.pipelineTask.retry-count) can not be equal to the number of retries. $(context.pipelineTask.retry-count) represents the length of RetriesStatus which would be 4 when fifth retry is starts unless you always add one to it 🙃 So the very first execution without any RetriesStatus would set $(context.pipelineTask.retry-count) to 1?

Copy link
Member

@pritidesai pritidesai Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the controller declares a task failure when number of RetriesStatus is >= retries i.e. controller does not schedule that task once the number of RetriesStatus matches retries:

https://github.com/tektoncd/pipeline/blob/2e26eef219ddfabf505041867d4825b47da8cb36/pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go#L96-L107

Copy link
Member Author

@yaoxiaoqi yaoxiaoqi Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible as far as I know. The replacement of contexts means to happen after the update of RetriesStatus. For example, let's say tr is a PipelineTask that means to fail, and its Retries is set to 3. We might misunderstand that the controller would run tr 3 times in total. But according to our code, tr will run 4 times in total (because of this line, it executes before appending RetriesStatus).

https://github.com/tektoncd/pipeline/blob/2e26eef219ddfabf505041867d4825b47da8cb36/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go#L232

let's say the order of taskrun and its retries is tr#0 -> tr_retry#1 -> tr_retry#2 -> tr_retry#3.

When creating tr_retry#3, the RetriesStatus would contain the status of tr#0, tr_retry#1, and tr_retry#2. So the retry-count is 3, and it equals to the PipelineTask Retries. That's why the example works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating tr_retry#3, the RetriesStatus would contain the status of tr#0, tr_retry#1, and tr_retry#2. So the retry-count is 3, and it equals to the PipelineTask Retries. That's why the example works.

tr#0 status is captured in .status.taskRuns[].status. After tr#0 fails and before creating(updating) taskRun i.e. tr_retry#1, the status is copied to .status.taskRuns[].status.retriesStatus and .status.taskRuns[].status is cleared which makes len(retriesStatus) equal to 1.

https://github.com/tektoncd/pipeline/blob/2e26eef219ddfabf505041867d4825b47da8cb36/pkg/reconciler/pipelinerun/pipelinerun.go#L656-L657

But the context is applied before copying the status into retriesStatus i.e. .status.taskRuns[].status has failure but .status.taskRuns[].status.retriesStatus is empty.

https://github.com/tektoncd/pipeline/blob/2e26eef219ddfabf505041867d4825b47da8cb36/pkg/reconciler/pipelinerun/pipelinerun.go#L456

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the context is applied before copying the status into retriesStatus i.e. .status.taskRuns[].status has failure but .status.taskRuns[].status.retriesStatus is empty.

I plan to only replace context.pipelineTask.retries in this ApplyContexts, since it's a const variable. Once the pipelinerun is applied it's gonna not change.

https://github.com/tektoncd/pipeline/blob/9751b951b9f5a35540898db4fb01d1be192dafeb/pkg/reconciler/pipelinerun/pipelinerun.go#L472

As for context.pipelineTask.retry-count, the one that will change according to the number of retries, I'm going to replace it here, the ApplyContexts for TaskRun. This happens after copying the status into retriesStatus. So we don't need to add one, because the length of .status.taskRuns[].status.retriesStatus is correct here. But the problem might be whether the name context.pipelineTask.retry-count is appropriate or not.

https://github.com/tektoncd/pipeline/blob/9751b951b9f5a35540898db4fb01d1be192dafeb/pkg/reconciler/taskrun/taskrun.go#L644

To be honest, I planned to replace the both variables in the ApplyContexts for TaskRun. But it seems impossible to know whether a TaskRun is derived from a PipelineTask unless I add a flag or something similar to TaskRun Spec to identify it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @yaoxiaoqi 👍

/lgtm

But the problem might be whether the name context.pipelineTask.retry-count is appropriate or not.

Right, since a taskRun might not belong to any pipeline. Like we discussed earlier, a task will have access to its own retry count so I think its safe to rename the variable to context.task.retry-count, similar to context.task.name and context.taskrun.name.

@bobcatfish
Copy link
Contributor

/assign @pritidesai
/assign @sbwsg

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2021
This TEP proposes adding `retries` and current retry times to Pipeline variables in
order to send notifications after a PipelineTask exhausted its retries.
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 2, 2021
Base automatically changed from master to main February 3, 2021 16:34
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2021
@bobcatfish
Copy link
Contributor

looks like this might be ready to go! need a non-google endorsement tho @pritidesai

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2021
@tekton-robot tekton-robot merged commit 8c30b1d into tektoncd:main Feb 4, 2021
@bobcatfish
Copy link
Contributor

Apologies for not catching this @yaoxiaoqi !! but for future notice we try to avoid commits with messages like "update" - see https://github.com/tektoncd/community/blob/main/standards.md#commits

@pritidesai
Copy link
Member

pritidesai commented Feb 5, 2021

sorry, my bad, didn't notice the commits as well 🙏 thanks @bobcatfish for catching it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants