Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle dag in pipelineresoultion #2821

Closed

Conversation

afrittoli
Copy link
Member

Changes

In the pipelinerun controller, today we follow this logic:

  • build the dag from the spec, using the dag module
  • resolve the pipelinerun state using the spec and status, using the
    resources module
  • get a list of candidate next tasks from the dag module, passing part
    of the state
  • get a list of next tasks from the resources module, using the list
    of candidates and the pipeline run status

The separation of concerns between the dag, resources and reconciler
modules feels a bit mixed up.

This is just a PoC that resolves part of the issue, by moving the
invocation of the dag building as well as obtaining the list of
candidates to the dag module, and aggregating the dag to the pipeline
state struct.

I did not update tests yet, this is for discussion purposes.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

/kind cleanup

@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jun 15, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign afrittoli
You can assign the PR to them by writing /assign @afrittoli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 15, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@afrittoli
Copy link
Member Author

/cc @pritidesai @vdemeester

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@afrittoli afrittoli force-pushed the move_dag_out_of_reconciler branch 2 times, most recently from 5769f6b to 0c7d850 Compare June 15, 2020 14:51
@bobcatfish
Copy link
Collaborator

it feels like pipelinerunresolution already has so many responsibilities, im not sure that giving it another one will improve it - maybe we need a new package with new responsibilities?

I also think it might be worth stepping back and taking a look at the data structures we are using; i often find that when code is getting complex and confusing, changing the shapes of the data structures being used can really help (e.g. PipelineRunState might not be the best data structure to be passing around going foward)

@bobcatfish
Copy link
Collaborator

sorry @afrittoli after looking at #2661 some more, i do agree with moving the call to the DAG logic into the resolution logic

I do still think that 1) pipelineresolution does too much and needs to be broken up and that 2) the data structures need to change

but i agree this is an improvement!

@afrittoli
Copy link
Member Author

Thanks for the review.
I also feel there is some space to simplify data structures. I also wonder if resolved resource could be abstracted out of task run and pipeline run. Maybe they're to different to share code? But it may help generalising main controller code.

@bobcatfish
Copy link
Collaborator

I also wonder if resolved resource could be abstracted out of task run and pipeline run. Maybe they're to different to share code?

That sounds amazing!

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 27, 2020
@afrittoli
Copy link
Member Author

afrittoli commented Jun 27, 2020

@bobcatfish @pritidesai I've tried to keep this PR reasonable in size, not very successfully, so some of the refactors here are not included here.

The main objectives here are to move the dag handling into pipelinerunresolution, and to ensure the various helpers do what their name mean, so that the code hopefully is more readable and maintainable.

Things that are missing (maybe more), to be handled in separate PRs:

  • We defined IsStarted and IsCancelled but through the code sometimes we use them, sometime we use part of the condition the include, sometimes we even use something different, which needs to be fixed (see GetNextTasks)
  • IsDone on the pipeline task does not imply IsCancelled - I thought it should but it's also ok as it is
  • GetNextTasks and GetFinalTasks can be combined, so that the pipelinerun controller only need to ask for the next tasks, and all the logic is encapsulated in pipelinerunresource
  • pipelineState variable in pipelinerun.go could be renamed to resolvedPipelineRun for consistency...

One thing that I noticed is that today we:

  • build the pipeline state
  • get schedulable tasks and start them
  • build the pipeline condition from the original state
  • build the pipeline status from the original state

I wonder if we should change the order to this instead:

  • build the pipeline state
  • build the pipeline condition from the original state
  • build the pipeline status from the original state
  • get schedulable tasks and start them

Just to make it clear that the new taskruns does not have an impact on the state reported in the current reconcile cycle.

One last bit, I think it might be nice to signal the running finally with a different reason.
With the new code is very easy to do so (two lines of code), and it would trigger an event (k8s/cloud) for the transition from main DAG to final tasks.

@afrittoli afrittoli force-pushed the move_dag_out_of_reconciler branch 2 times, most recently from 22c2e9c to efc275f Compare June 27, 2020 16:47
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1beta1/pipeline_validation.go 96.2% 95.9% -0.4
pkg/reconciler/pipelinerun/pipelinerun.go 85.2% 85.0% -0.2

if err := pipelineSpec.Validate(ctx); err != nil {
// This Run has failed, so we need to mark it as failed and stop reconciling it
pr.Status.MarkFailed(ReasonFailedValidation,
var reason = ReasonFailedValidation
if err.Details == "Invalid Graph" {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the graph is created in pipelinerunresources, validation happens first. In this context it was not clear that the error was an invalid graph one, so I added some extra context in the error to be able to catch it and use the specific reason.
The alternatives are:

  • skip validation here, but that's only a temp solution
  • move validation into ResolvePipelineRun - which might make sense, but not in this PR
  • simply return a validation error and rely on the error stack to provide more insight to the user
    I wonder if we really need to run validation here, but I guess we will need to do once tasks / pipelines may come from a container registry or git url.

Copy link
Member

@pritidesai pritidesai Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afrittoli pipelineSpec is being validated here for many different checks/conditions in addition to invalidGraph. I think we do need to run validation here, I dont see validation happening else where, am I missing something? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could certainly move the check on invalid graph to ResolvePipelineRun.

Also, such reason could be generated where the error is getting created in pipeline_validation.go instead of calculating here. Its a great idea to set the reason to specific validation error in addition to returning error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afrittoli pipelineSpec is being validated here for many different checks/conditions in addition to invalidGraph. I think we do need to run validation here, I dont see validation happening else where, am I missing something? 🤔

Validation is done by the webhook when resources are submitted to etcd.
However it will not be performed for resources that are not stored in etcd like tasks or pipelines coming from an external source (git/registry). I'm not sure about embedded specs - I'd expect to go through the same validation as normal resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could certainly move the check on invalid graph to ResolvePipelineRun.

That's the way I implemented it originally, but because we validate the spec here, we don't really catch the error in ResolvePipelineRun. The only way would be to move the whole validation of the spec into ResolvePipelineRun, which is something we could do eventually, but I didn't want to do it here as well.

Also, such reason could be generated where the error is getting created in pipeline_validation.go instead of calculating here. Its a great idea to set the reason to specific validation error in addition to returning error.

@@ -451,8 +451,8 @@ func TestReconcile_InvalidPipelineRuns(t *testing.T) {
))),
tb.PipelineRun("pipeline-invalid-final-graph", tb.PipelineRunNamespace("foo"), tb.PipelineRunSpec("", tb.PipelineRunPipelineSpec(
tb.PipelineTask("dag-task-1", "taskName"),
tb.FinalPipelineTask("final-task-1", "taskName"),
tb.FinalPipelineTask("final-task-1", "taskName")))),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error does not trigger an invalid dag when going through validation, it fails first on validatePipelineTaskName

t.Errorf("Expected two ConditionCheck TaskRuns to be created, but it wasn't.")
// Check that the expected TaskRun were created
actions := clients.Pipeline.Actions()
if !actions[1].Matches("create", "taskruns") || !actions[2].Matches("create", "taskruns") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a cast without checking first leads to an un-managed failure in the tests when the test fails.
My code changes initially made this test failing and the error was not very informative.

@vdemeester
Copy link
Member

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

@tekton-robot tekton-robot reopened this Aug 17, 2020
@tekton-robot
Copy link
Collaborator

@vdemeester: Reopened this PR.

In response to this:

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2020
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 15, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide the details of the skip logic to the
core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
@afrittoli afrittoli mentioned this pull request Nov 15, 2020
4 tasks
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 15, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide some of the details of the skip logic from
the core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 15, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide some of the details of the skip logic from
the core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide some of the details of the skip logic from
the core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
afrittoli added a commit to afrittoli/pipeline that referenced this pull request Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
tekton-robot pushed a commit that referenced this pull request Nov 17, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821

This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes #3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
pritidesai pushed a commit to pritidesai/pipeline that referenced this pull request Nov 17, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
(cherry picked from commit fda7a81)
tekton-robot pushed a commit that referenced this pull request Nov 17, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821

This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes #3521

Co-authored-by: Scott <sbws@google.com>

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
(cherry picked from commit fda7a81)
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2020
Base automatically changed from master to main March 11, 2021 15:28
@tekton-robot
Copy link
Collaborator

@afrittoli: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Mar 11, 2021
@tekton-robot
Copy link
Collaborator

@afrittoli: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-tekton-pipeline-go-coverage 911d8de link /test pull-tekton-pipeline-go-coverage
pull-tekton-pipeline-build-tests 911d8de link /test pull-tekton-pipeline-build-tests
pull-tekton-pipeline-unit-tests 911d8de link /test tekton-pipeline-unit-tests
pull-tekton-pipeline-integration-tests 911d8de link /test pull-tekton-pipeline-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bobcatfish
Copy link
Collaborator

@afrittoli do you have any strong feelings about keeping this open vs closing it for now?

@afrittoli
Copy link
Member Author

I really want to have time to hack on the controller again :D but it's ok to close for now

@vdemeester vdemeester closed this May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants