Fix bug where PipelineRun emits task results that were never produced #3472

ghost · 2020-10-29T20:02:18Z

Changes

Prior to this commit a Pipeline Result that references a Task Result of a failed TaskRun would end up with a value of the literal variable. For example: if Task "foo" failed, a Pipeline Result with value of "$(tasks.foo.results.bar)" would end up in the PipelineRun.Status.Results list with the literal value, "$(tasks.foo.results.bar)". It was therefore difficult to assess programatically whether results were populated correctly or were the result of some invalid TaskRun condition.

This commit fixes the bug by filtering out PipelineRun Results that reference failed TaskRuns. It was quite difficult to follow the flow of execution wrt PipelineRun Results and so ultimately I had to refactor the whole lot to figure out where the bug was. The final code is quite a bit shorter than the original and has improved test coverage to more robustly exercise the behaviour of PipelineResults in various failure scenarios.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices
Release notes block has been filled in or deleted (only if no user facing changes)

Fix bug where PipelineRun emits task results that were never produced. Previously a Pipeline Result that contained an invalid variable would be added to the PipelineRun with that unreplaced variable intact. Now a Pipeline Result that contains an invalid variable will not be emitted by the PipelineRun at all.

jerop · 2020-10-29T21:53:05Z

docs/pipelines.md

+If a `Pipeline Result` references a failed `TaskRun` then the `Pipeline Result` is
+considered invalid and will not be emitted by the completed `PipelineRun`.
+


the result could also be missing because the task was skipped as well...not only when it failed...

also successful taskrun without producing task result 🤔

the result could also be missing because the task was skipped as well...not only when it failed...

Ah, good catch, thanks @jerop, updated docs to expand the list of reasons that a pipeline result would be invalid.

Regarding the successful taskrun w/o producing task result, I think that's accounted for - see my other comment.

pritidesai · 2020-10-30T00:19:13Z

@sbwsg I haven't looked at the code yet but we might get into a scenario where task could succeed without producing any result and the same literal might end up in pipelineRun results.

ghost · 2020-10-30T11:45:43Z

@sbwsg I haven't looked at the code yet but we might get into a scenario where task could succeed without producing any result and the same literal might end up in pipelineRun results.

Thanks for the feedback @pritidesai ! I believe this is accounted for - here's the code that decides on the validity of a PipelineResult's variable reference and a new test case that exercises the behaviour of a PipelineResult referencing a Task Result that is not produced by the TaskRun. The result variable will be considered invalid and the Pipeline Result won't be emitted. LMK if you see edge cases I'm not covering here - I want to account for as many of these rules as possible with unit tests so that we can better catch regressions.

ghost · 2020-10-30T12:11:30Z

pkg/reconciler/pipelinerun/pipelinerun.go

-		c.updatePipelineResults(ctx, pr, getPipelineFunc)
+		if _, pipelineSpec, err := resources.GetPipelineData(ctx, pr, getPipelineFunc); err != nil {
+			logger.Errorf("Failed to get Pipeline Spec to process Pipeline Results: %v", err)
+			pr.Status.MarkFailed(ReasonCouldntGetPipeline,


Question for @pritidesai or @afrittoli - given that we are in a block guarded by pr.IsDone(), is it too late to call pr.Status.MarkFailed()? I hadn't noticed this prior to making these changes but the old code did this too (see updatePipelineResults in the diff).

Good point, pr.IsDone returns true when the pipeline is either successful or failed. markFailed can switch the success to failure with setting its completionTime. This can definitely cause issue since the pipeline would be marked as finished with the completionTime but it's still running.

Its missing call to finishReconcileUpdateEmitEvents which will make sure the pipeline actually finishes and doesnt continue execution after MarkFailed.

pr.Status.MarkFailed(ReasonCouldntGetPipeline, "Failed to get Pipeline Spec to process Pipeline Results %s/%s: %s", pr.Namespace, pr.Name, err) return c.finishReconcileUpdateEmitEvents(ctx, pr, before, nil)

this brings us back to similar discussion with task results, all the tasks executed successfully but a particular pipeline result which was being referenced doesn't exist so declare pipeline failed 🤔

I'm going to remove the call to MarkFailed here I think.

I've removed the call to MarkFailed and now instead am using the controller's Recorder (if available) to record a warning event (in addition to logging an error from the controller).

FYI, as part of the Run integration with pipelines I tried to have the e2e test verify the pipeline results. I had to take that out because there was a race condition between the test case waking up (upon seeing the PipelineRun was done) and the controller updating the PipelineRun with the results. I think the results should be added to the PR at the same time it is marked done. I'm just bringing this up for your awareness; it's up to you whether you want to tackle it in this PR.

that's a good point @GregDritschler

yup, we dont have to tackle everything in this PR, sorry @sbwsg for polluting this PR.

If I am not mistaken, taskRun has the similar behavior where a taskRun is marked as successful along with its completion time before populating task results (might be fixed with #3497):

pipeline/pkg/pod/status.go

Lines 272 to 282 in 05d67e4

func updateCompletedTaskRunStatus(logger *zap.SugaredLogger, trs *v1beta1.TaskRunStatus, pod *corev1.Pod) {

if DidTaskRunFail(pod) {

msg := getFailureMessage(logger, pod)

MarkStatusFailure(trs, msg)

} else {

MarkStatusSuccess(trs)

}

// update tr completed time

trs.CompletionTime = &metav1.Time{Time: time.Now()}

}

pipeline/pkg/pod/status.go

Lines 109 to 113 in 05d67e4

complete := areStepsComplete(pod) || pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed

if complete {

updateCompletedTaskRunStatus(logger, trs, pod)

} else {

After a taskRun is marked as successful, taskRun status is updated with task results (through setTaskRunStatusBasedOnStepStatus):

pipeline/pkg/pod/status.go

Lines 162 to 164 in 05d67e4

if tr.IsSuccessful() {

trs.TaskRunResults = append(trs.TaskRunResults, taskResults...)

trs.ResourcesResult = append(trs.ResourcesResult, pipelineResourceResults...)

may be worth a discussion in API WG? 😜

Thanks for the feedback, I'm going to back out the change and re-introduce MarkFailed since it's the existing behaviour and we can discuss further.

Done - MarkFailed is re-introduced.

bobcatfish · 2020-11-04T22:27:31Z

docs/pipelines.md

+- a `TaskRun` referenced by the `Pipeline Result` was skipped
+- a `TaskRun` referenced by the `Pipeline Result` didn't emit the referenced `Task Result`
+- the `Pipeline Result` uses a variable that doesn't point to a task in the Pipeline
+- the `Pipeline Result` uses a variable that doesn't point to a result of a task in the Pipeline


@sbwsg continuation of #3497 - these all seem like errors to me

the Pipeline Result uses a variable that doesn't point to a task in the Pipeline

the Pipeline Result uses a variable that doesn't point to a result of a task in the Pipeline

If I'm understanding correctly, these both mean that the Pipeline was created incorrectly - we could catch these at validation time before we start actually executing the Pipeline

a TaskRun referenced by the Pipeline Result failed <-- this would fail the pipelinerun anyway

a TaskRun referenced by the Pipeline Result was skipped

a TaskRun referenced by the Pipeline Result didn't emit the referenced Task Result

These would all be at runtime - I could imagine that someone might be annoyed if their entire Pipeline ran successfully but one of the above caused a pipeline level result not to be set - however I think that could be addressed by a combo of:

Tasks that claim to emit results but don't should fail #3497 <-- case (3) would fail after the Task ran

TEP-0048: task results without results - problem statement community#240 <-- we could include this use case if we want to be able to support defaults in the case of (2) (and maybe 3 if we let a Task specify a default result?)

Agreed on all points. Validation should catch incorrect variable usage and Tasks that don't emit any described results should fail, causing the Pipeline to also fail.

Regarding the validation, we do have this - it runs during DAG scheduling, so when we decide what to run next we find an invalid result reference and error the PipelineRun with InvalidTaskResultReference.

we could catch these at validation time before we start actually executing the Pipeline

We could push this earlier and validate all Task result references before execution begins but we would need to fetch and inspect every Task referenced by a result variable before starting the Pipeline. Otherwise we don't know what the correct result names are. So it's feasible and do-able. I think we need to decide if that's preferable vs whatever time it costs to fetch all the referenced Tasks and inspect their results before running the Pipeline.

Reading back I am realizing this morning that I was wrong in my comment on #3497. A PipelineRun was never put into a failed state just because a Pipeline Result reference is invalid. A PipelineRun is marked failed if it's unable to fetch the PipelineSpec needed to process Pipeline Results. Otherwise the way it's worked before and after this change is that missing or invalid results don't update the final Success/Fail state of the PipelineRun.

I've created #3499 to explore the possibility and costs of fetching and validating task result references before PipelineRun execution starts.

I think I'm still a bit confused about both what the current behavior is and what the behavior will be after this PR - would it make sense to maybe make this section of the docs also indicate what behavior we expect in each of these casess?

I agree and open to merge this since there is no functionality change compared to the existing behavior. And continue this discussion with #3499.

Updated the doc with more detail about the accompanying PipelineRun state when these conditions are hit.

pritidesai · 2020-11-05T20:18:42Z

pkg/reconciler/pipelinerun/resources/apply_test.go

+			Value: "$(tasks.pt1.results.foo)",
+		}, {
+			Name:  "pipeline-result-2",
+			Value: "$(tasks.pt1.results.foo), $(tasks.pt2.results.baz), $(tasks.pt1.results.bar), $(tasks.pt2.results.baz), $(tasks.pt1.results.foo)",


@sbwsg while your are updating the doc, might be worth mentioning this particular scenario.

I think without these changes (this PR):

"$(tasks.pt1.results.foo), $(tasks.pt2.results.baz), $(tasks.pt1.results.bar), $(tasks.pt2.results.baz), $(tasks.pt1.results.foo)"

if $(tasks.pt2.results.baz) was not initialized, this was resulting in:

"do, $(tasks.pt2.results.baz), mi, $(tasks.pt2.results.baz), do"

(I haven't confirmed yet)

But now with these changes, its not set at all and not part of the results any more.

yup just confirmed, on master, its producing "do, $(tasks.pt2.results.baz), mi, $(tasks.pt2.results.baz), do" 😢

Thanks @pritidesai 👍 right, that's the bad behaviour we're aiming to fix here.

I've added a Note after the list of reasons that a Pipeline Result will not be emitted. The note talks about a Pipeline Result that references multiple Taskrun Results and makes clear that any invalid reference from the Pipeline Result will mean the entire result is not emitted.

pritidesai · 2020-11-05T21:18:09Z

pkg/reconciler/pipelinerun/resources/apply.go

+// pointer is returned if the variable is invalid for any reason.
+func taskResultValue(variable string, taskStatuses map[string]*v1beta1.PipelineRunTaskRunStatus) *string {
+	variableParts := strings.Split(variable, ".")
+	if len(variableParts) != 4 || variableParts[0] != "tasks" || variableParts[2] != "results" {


NIT: tasks and results could be constants and shared with other result related logic (not part of this PR), can be done seperately

pritidesai · 2020-11-05T21:40:19Z

pkg/reconciler/pipelinerun/resources/resultrefresolution.go

-			return &result, nil
-		}
-	}
-	return nil, fmt.Errorf("Could not find result with name %s for task run %s", reference.Result, reference.PipelineTask)


This error is getting lost and not surfaced to user or logs, here is the chain of function calls: findTaskResultForPipelineResult <- resolveResultRefForPipelineResult <- extractResultRefsForPipelineResults <- convertPipelineResultToResultRefs

convertPipelineResultToResultRefs returns nil if error is not nil without logging it or returning the error along but not sure how the result reference as is makes it to the pipelineResults.

With, #3499, will able to pick up such errors and report them.

Good catch, this is the reason that I ultimately decided the new taskResultValue() func shouldn't bother returning an error and just a pointer to the string value if there is one.

pritidesai

thanks a bunch @sbwsg for fixing this issue and appreciate all the cleanup along. It's much easier to understand now.

bobcatfish

@sbwsg is it possible to update the release note to explain what the new expected behavior is? "Fix bug where PipelineRun emits task results that were never produced" <-- not clear to me what will happen now (from some discussions in the PR it sounds like this is not actually changing the functionality? but i think the update is that now task results that cannot be satisfied with be ""? Or won't be present at all?)

anyway in general it seems like this is better than the current state and it seems like a nice refactoring of the existing code - hopefully we can keep improving the error handling around results!

/approve

bobcatfish · 2020-11-25T21:17:16Z

docs/pipelines.md

+This will result in an `InvalidTaskResultReference` validation error during execution of the `Pipeline`.
+
+**Note:** Since a `Pipeline Result` can contain references to multiple `Task Results`, if any of those
+`Task Result` references are invalid the entire `Pipeline Result` is not emitted.


i think as a user i might find it hard to reason about what will happen based on this description, maybe it would be clearer to flip it around a bit, and i think it might be more clear to a user to say something like "task" or "pipeline task" instead of "task run":

PipelineRun execution will fail if a pipeline result requires any of the following:

A result from a Task that is skipped

A result from a Task that it declares but when executed it doesn't actually emit it. This should be considered a bug in the Task and may fail a TaskRun in future.

A result from a Task that it does not declare

A result from a Task that does not exit in the Pipeline

I didn't include these ones b/c i'm not sure if they WILL have that same error:

A TaskRun referenced by the Pipeline Result failed. The PipelineRun will also have failed. <-- since execution will stop in this case, i dont think there would be an error related to the result?

OK, I've updated the section to reference "Pipeline Task" throughout instead of "TaskRun".

Inre: your comments on PipelineRun execution behaviour, I think there are at least two discrete questions we're trying to answer in this secion of the doc:

What is the final "shape" of a pipeline result if it contains an invalid reference?

What happens to pipelinerun execution when pipeline results contain these invalid references?

I think that (1) is answered by the existing PR. Personally I think that (2) is independent of the results and shouldn't be documented in a section titled Emitting Results from a Pipeline since it's orthogonal: The Pipeline Results don't have any bearing on the execution status of the PipelineRun. That doesn't mean we shouldn't document it but it does seem to me a little out-of-scope to try and document PipelineRun execution state in a section dedicated to result data.

PipelineRun execution will fail if a pipeline result requires any of the following:

A result from a Task that is skipped

PipelineRun execution doesn't fail when a Task is skipped and a result references it. This is kinda an example of why I'm a tad hesitant to include documentation about pipelinerun execution state in a section dedicated to result data. I think this will come back to bite us if we try and document it here. One day we'll change the behaviour of PipelineRuns and we'll have to remember that we documented pipelinerun execution in multiple places that aren't pipelineruns.md and then comb back through to make sure all those places reflect the changed behaviour. Personally I'd prefer that we keep the documentation in pipelines.md as tightly focused as possible on how Pipeline Results behave.

bobcatfish · 2020-11-25T21:25:34Z

pkg/reconciler/pipelinerun/resources/apply.go

+// list of PipelineResults, returning the computed set of PipelineRunResults. References to
+// non-existent TaskResults or failed TaskRuns result in a PipelineResult being considered invalid
+// and omitted from the returned slice. A nil slice is returned if no results are passed in or all
+// results are invalid.


i apologize for probably having already asked this, but it seems weird to me that we just ignore the invalid values, feels like we should be returning an error?

I'm not sure it's quite right to say that we ignore the invalid values. We return a nil pointer if the value is invalid - in other words, no valid value was found for the result reference.

I don't totally agree that this is an error state but I don't feel strongly enough to dispute it. What should we do with that error when it's returned? In the old code it was silently ignored.

feels like we should be returning an error?

Returning an error might break pipelineRuns. There was no error reported in logs or to users at all. If we introduce error now, pipelineRuns which used to be successful will start failing with this error. And at the same time, the least we could do is log the error if needed but still exit with success.

We should eventually move the execution of this logic to before the pipelinerun state is set.
That way could still fail the pipeline if results are broken or invalid in some way.
Maybe something for next release.

tekton-robot · 2020-11-25T21:29:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, pritidesai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bobcatfish,pritidesai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ghost · 2020-11-30T15:39:33Z

Updated release note to specify the old and new behaviour.

ghost · 2020-11-30T15:42:15Z

but i think the update is that now task results that cannot be satisfied with be ""? Or won't be present at all?)

Yup, that's exactly right. Before the result would be emitted by the Pipeline with "$(task.foo.results.bar)" left in it unreplaced.

ghost · 2020-11-30T16:00:59Z

Just to make things a tad more concrete, here's the yaml from a PipelineRun of a pipeline result that references a skipped task result:

  status:
    completionTime: "2020-11-30T15:55:17Z"
    conditions:
    - lastTransitionTime: "2020-11-30T15:55:17Z"
      message: 'Tasks Completed: 3 (Failed: 0, Cancelled 0), Skipped: 3'
      reason: Completed
      status: "True"
      type: Succeeded
    pipelineResults:
    - name: bar
      value: $(tasks.task-should-be-skipped-1.results.bar)

Notice that: 1) the PipelineRun is in a Succeeded/True condition and 2) the pipelineResults have the variable syntax printed verbatim because the referenced task was skipped. (2) is the thing we're actually attempting to fix here. We don't want variable syntax showing up in the pipeline results. Instead we don't want the pipeline result to be emitted at all.

tekton-robot · 2020-11-30T16:34:07Z