-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where PipelineRun emits task results that were never produced #3472
Conversation
docs/pipelines.md
Outdated
If a `Pipeline Result` references a failed `TaskRun` then the `Pipeline Result` is | ||
considered invalid and will not be emitted by the completed `PipelineRun`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result could also be missing because the task was skipped as well...not only when it failed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also successful taskrun without producing task result 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result could also be missing because the task was skipped as well...not only when it failed...
Ah, good catch, thanks @jerop, updated docs to expand the list of reasons that a pipeline result would be invalid.
Regarding the successful taskrun w/o producing task result, I think that's accounted for - see my other comment.
@sbwsg I haven't looked at the code yet but we might get into a scenario where task could succeed without producing any result and the same literal might end up in pipelineRun results. |
Thanks for the feedback @pritidesai ! I believe this is accounted for - here's the code that decides on the validity of a PipelineResult's variable reference and a new test case that exercises the behaviour of a PipelineResult referencing a Task Result that is not produced by the TaskRun. The result variable will be considered invalid and the Pipeline Result won't be emitted. LMK if you see edge cases I'm not covering here - I want to account for as many of these rules as possible with unit tests so that we can better catch regressions. |
c.updatePipelineResults(ctx, pr, getPipelineFunc) | ||
if _, pipelineSpec, err := resources.GetPipelineData(ctx, pr, getPipelineFunc); err != nil { | ||
logger.Errorf("Failed to get Pipeline Spec to process Pipeline Results: %v", err) | ||
pr.Status.MarkFailed(ReasonCouldntGetPipeline, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for @pritidesai or @afrittoli - given that we are in a block guarded by pr.IsDone()
, is it too late to call pr.Status.MarkFailed()
? I hadn't noticed this prior to making these changes but the old code did this too (see updatePipelineResults
in the diff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, pr.IsDone
returns true when the pipeline is either successful
or failed
. markFailed
can switch the success
to failure
with setting its completionTime
. This can definitely cause issue since the pipeline
would be marked as finished with the completionTime
but it's still running.
Its missing call to finishReconcileUpdateEmitEvents
which will make sure the pipeline
actually finishes and doesnt continue execution after MarkFailed
.
pr.Status.MarkFailed(ReasonCouldntGetPipeline, "Failed to get Pipeline Spec to process Pipeline Results %s/%s: %s", pr.Namespace, pr.Name, err)
return c.finishReconcileUpdateEmitEvents(ctx, pr, before, nil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this brings us back to similar discussion with task results, all the tasks executed successfully but a particular pipeline result which was being referenced doesn't exist so declare pipeline failed 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to remove the call to MarkFailed
here I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the call to MarkFailed
and now instead am using the controller's Recorder (if available) to record a warning event (in addition to logging an error from the controller).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, as part of the Run integration with pipelines I tried to have the e2e test verify the pipeline results. I had to take that out because there was a race condition between the test case waking up (upon seeing the PipelineRun was done) and the controller updating the PipelineRun with the results. I think the results should be added to the PR at the same time it is marked done. I'm just bringing this up for your awareness; it's up to you whether you want to tackle it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point @GregDritschler
yup, we dont have to tackle everything in this PR, sorry @sbwsg for polluting this PR.
If I am not mistaken, taskRun has the similar behavior where a taskRun is marked as successful along with its completion time before populating task results (might be fixed with #3497):
Lines 272 to 282 in 05d67e4
func updateCompletedTaskRunStatus(logger *zap.SugaredLogger, trs *v1beta1.TaskRunStatus, pod *corev1.Pod) { | |
if DidTaskRunFail(pod) { | |
msg := getFailureMessage(logger, pod) | |
MarkStatusFailure(trs, msg) | |
} else { | |
MarkStatusSuccess(trs) | |
} | |
// update tr completed time | |
trs.CompletionTime = &metav1.Time{Time: time.Now()} | |
} |
Lines 109 to 113 in 05d67e4
complete := areStepsComplete(pod) || pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed | |
if complete { | |
updateCompletedTaskRunStatus(logger, trs, pod) | |
} else { |
After a taskRun is marked as successful, taskRun status is updated with task results (through setTaskRunStatusBasedOnStepStatus):
Lines 162 to 164 in 05d67e4
if tr.IsSuccessful() { | |
trs.TaskRunResults = append(trs.TaskRunResults, taskResults...) | |
trs.ResourcesResult = append(trs.ResourcesResult, pipelineResourceResults...) |
may be worth a discussion in API WG? 😜
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I'm going to back out the change and re-introduce MarkFailed
since it's the existing behaviour and we can discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - MarkFailed is re-introduced.
docs/pipelines.md
Outdated
- a `TaskRun` referenced by the `Pipeline Result` was skipped | ||
- a `TaskRun` referenced by the `Pipeline Result` didn't emit the referenced `Task Result` | ||
- the `Pipeline Result` uses a variable that doesn't point to a task in the Pipeline | ||
- the `Pipeline Result` uses a variable that doesn't point to a result of a task in the Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbwsg continuation of #3497 - these all seem like errors to me
- the
Pipeline Result
uses a variable that doesn't point to a task in the Pipeline - the
Pipeline Result
uses a variable that doesn't point to a result of a task in the Pipeline
If I'm understanding correctly, these both mean that the Pipeline was created incorrectly - we could catch these at validation time before we start actually executing the Pipeline
- a
TaskRun
referenced by thePipeline Result
failed <-- this would fail the pipelinerun anyway - a
TaskRun
referenced by thePipeline Result
was skipped - a
TaskRun
referenced by thePipeline Result
didn't emit the referencedTask Result
These would all be at runtime - I could imagine that someone might be annoyed if their entire Pipeline ran successfully but one of the above caused a pipeline level result not to be set - however I think that could be addressed by a combo of:
- Tasks that claim to emit results but don't should fail #3497 <-- case (3) would fail after the Task ran
- TEP-0048: task results without results - problem statement community#240 <-- we could include this use case if we want to be able to support defaults in the case of (2) (and maybe 3 if we let a Task specify a default result?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on all points. Validation should catch incorrect variable usage and Tasks that don't emit any described results should fail, causing the Pipeline to also fail.
Regarding the validation, we do have this - it runs during DAG scheduling, so when we decide what to run next we find an invalid result reference and error the PipelineRun with InvalidTaskResultReference
.
we could catch these at validation time before we start actually executing the Pipeline
We could push this earlier and validate all Task result references before execution begins but we would need to fetch and inspect every Task
referenced by a result variable before starting the Pipeline. Otherwise we don't know what the correct result names are. So it's feasible and do-able. I think we need to decide if that's preferable vs whatever time it costs to fetch all the referenced Tasks and inspect their results before running the Pipeline.
Reading back I am realizing this morning that I was wrong in my comment on #3497. A PipelineRun was never put into a failed state just because a Pipeline Result reference is invalid. A PipelineRun is marked failed if it's unable to fetch the PipelineSpec needed to process Pipeline Results. Otherwise the way it's worked before and after this change is that missing or invalid results don't update the final Success/Fail state of the PipelineRun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #3499 to explore the possibility and costs of fetching and validating task result references before PipelineRun execution starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm still a bit confused about both what the current behavior is and what the behavior will be after this PR - would it make sense to maybe make this section of the docs also indicate what behavior we expect in each of these casess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and open to merge this since there is no functionality change compared to the existing behavior. And continue this discussion with #3499.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc with more detail about the accompanying PipelineRun state when these conditions are hit.
Value: "$(tasks.pt1.results.foo)", | ||
}, { | ||
Name: "pipeline-result-2", | ||
Value: "$(tasks.pt1.results.foo), $(tasks.pt2.results.baz), $(tasks.pt1.results.bar), $(tasks.pt2.results.baz), $(tasks.pt1.results.foo)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbwsg while your are updating the doc, might be worth mentioning this particular scenario.
I think without these changes (this PR):
"$(tasks.pt1.results.foo), $(tasks.pt2.results.baz), $(tasks.pt1.results.bar), $(tasks.pt2.results.baz), $(tasks.pt1.results.foo)"
if $(tasks.pt2.results.baz)
was not initialized, this was resulting in:
"do, $(tasks.pt2.results.baz), mi, $(tasks.pt2.results.baz), do"
(I haven't confirmed yet)
But now with these changes, its not set at all and not part of the results any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup just confirmed, on master, its producing "do,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pritidesai 👍 right, that's the bad behaviour we're aiming to fix here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a Note after the list of reasons that a Pipeline Result
will not be emitted. The note talks about a Pipeline Result that references multiple Taskrun Results and makes clear that any invalid reference from the Pipeline Result will mean the entire result is not emitted.
// pointer is returned if the variable is invalid for any reason. | ||
func taskResultValue(variable string, taskStatuses map[string]*v1beta1.PipelineRunTaskRunStatus) *string { | ||
variableParts := strings.Split(variable, ".") | ||
if len(variableParts) != 4 || variableParts[0] != "tasks" || variableParts[2] != "results" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: tasks
and results
could be constants and shared with other result related logic (not part of this PR), can be done seperately
return &result, nil | ||
} | ||
} | ||
return nil, fmt.Errorf("Could not find result with name %s for task run %s", reference.Result, reference.PipelineTask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error
is getting lost and not surfaced to user or logs, here is the chain of function calls: findTaskResultForPipelineResult
<- resolveResultRefForPipelineResult
<- extractResultRefsForPipelineResults
<- convertPipelineResultToResultRefs
convertPipelineResultToResultRefs
returns nil
if error
is not nil
without logging it or returning the error along but not sure how the result reference as is makes it to the pipelineResults
.
With, #3499, will able to pick up such errors and report them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is the reason that I ultimately decided the new taskResultValue()
func shouldn't bother returning an error and just a pointer to the string value if there is one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a bunch @sbwsg for fixing this issue and appreciate all the cleanup along. It's much easier to understand now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbwsg is it possible to update the release note to explain what the new expected behavior is? "Fix bug where PipelineRun emits task results that were never produced" <-- not clear to me what will happen now (from some discussions in the PR it sounds like this is not actually changing the functionality? but i think the update is that now task results that cannot be satisfied with be ""? Or won't be present at all?)
anyway in general it seems like this is better than the current state and it seems like a nice refactoring of the existing code - hopefully we can keep improving the error handling around results!
/approve
This will result in an `InvalidTaskResultReference` validation error during execution of the `Pipeline`. | ||
|
||
**Note:** Since a `Pipeline Result` can contain references to multiple `Task Results`, if any of those | ||
`Task Result` references are invalid the entire `Pipeline Result` is not emitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think as a user i might find it hard to reason about what will happen based on this description, maybe it would be clearer to flip it around a bit, and i think it might be more clear to a user to say something like "task" or "pipeline task" instead of "task run":
PipelineRun execution will fail if a pipeline result requires any of the following:
- A result from a Task that is skipped
- A result from a Task that it declares but when executed it doesn't actually emit it. This should be considered a bug in the
Task
and may fail aTaskRun
in future. - A result from a Task that it does not declare
- A result from a Task that does not exit in the Pipeline
I didn't include these ones b/c i'm not sure if they WILL have that same error:
- A
TaskRun
referenced by thePipeline Result
failed. ThePipelineRun
will also have failed. <-- since execution will stop in this case, i dont think there would be an error related to the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've updated the section to reference "Pipeline Task" throughout instead of "TaskRun".
Inre: your comments on PipelineRun execution behaviour, I think there are at least two discrete questions we're trying to answer in this secion of the doc:
- What is the final "shape" of a pipeline result if it contains an invalid reference?
- What happens to pipelinerun execution when pipeline results contain these invalid references?
I think that (1) is answered by the existing PR. Personally I think that (2) is independent of the results and shouldn't be documented in a section titled Emitting Results from a Pipeline
since it's orthogonal: The Pipeline Results don't have any bearing on the execution status of the PipelineRun. That doesn't mean we shouldn't document it but it does seem to me a little out-of-scope to try and document PipelineRun execution state in a section dedicated to result data.
PipelineRun execution will fail if a pipeline result requires any of the following:
- A result from a Task that is skipped
PipelineRun execution doesn't fail when a Task is skipped and a result references it. This is kinda an example of why I'm a tad hesitant to include documentation about pipelinerun execution state in a section dedicated to result data. I think this will come back to bite us if we try and document it here. One day we'll change the behaviour of PipelineRuns and we'll have to remember that we documented pipelinerun execution in multiple places that aren't pipelineruns.md
and then comb back through to make sure all those places reflect the changed behaviour. Personally I'd prefer that we keep the documentation in pipelines.md
as tightly focused as possible on how Pipeline Results behave.
// list of PipelineResults, returning the computed set of PipelineRunResults. References to | ||
// non-existent TaskResults or failed TaskRuns result in a PipelineResult being considered invalid | ||
// and omitted from the returned slice. A nil slice is returned if no results are passed in or all | ||
// results are invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i apologize for probably having already asked this, but it seems weird to me that we just ignore the invalid values, feels like we should be returning an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's quite right to say that we ignore the invalid values. We return a nil
pointer if the value is invalid - in other words, no valid value was found for the result reference.
I don't totally agree that this is an error state but I don't feel strongly enough to dispute it. What should we do with that error when it's returned? In the old code it was silently ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like we should be returning an error?
Returning an error might break pipelineRuns
. There was no error reported in logs or to users at all. If we introduce error now, pipelineRuns
which used to be successful will start failing with this error. And at the same time, the least we could do is log the error if needed but still exit with success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should eventually move the execution of this logic to before the pipelinerun state is set.
That way could still fail the pipeline if results are broken or invalid in some way.
Maybe something for next release.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish, pritidesai The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Updated release note to specify the old and new behaviour. |
Yup, that's exactly right. Before the result would be emitted by the Pipeline with |
Just to make things a tad more concrete, here's the yaml from a PipelineRun of a pipeline result that references a skipped task result: status:
completionTime: "2020-11-30T15:55:17Z"
conditions:
- lastTransitionTime: "2020-11-30T15:55:17Z"
message: 'Tasks Completed: 3 (Failed: 0, Cancelled 0), Skipped: 3'
reason: Completed
status: "True"
type: Succeeded
pipelineResults:
- name: bar
value: $(tasks.task-should-be-skipped-1.results.bar) Notice that: 1) the PipelineRun is in a Succeeded/True condition and 2) the |
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
The following is the coverage report on the affected files.
|
Prior to this commit a Pipeline Result that references a Task Result of a failed TaskRun would end up with a value of the literal variable. For example: if Task "foo" failed, a Pipeline Result with value of "$(tasks.foo.results.bar)" would end up in the PipelineRun.Status.Results list with the literal value, "$(tasks.foo.results.bar)". It was therefore difficult to assess programatically whether results were populated correctly or were the result of some invalid TaskRun condition. This commit fixes the bug by filtering out PipelineRun Results that reference failed TaskRuns. It was quite difficult to follow the flow of execution wrt PipelineRun Results and so ultimately I had to refactor the whole lot to figure out where the bug was. The final code is quite a bit shorter than the original and has improved test coverage to more robustly exercise the behaviour of PipelineResults in various failure scenarios.
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-integration-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, it is really good step forward.
And thanks for moving a lot of the results code out of pipelinerun.go!!
/lgtm
// list of PipelineResults, returning the computed set of PipelineRunResults. References to | ||
// non-existent TaskResults or failed TaskRuns result in a PipelineResult being considered invalid | ||
// and omitted from the returned slice. A nil slice is returned if no results are passed in or all | ||
// results are invalid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should eventually move the execution of this logic to before the pipelinerun state is set.
That way could still fail the pipeline if results are broken or invalid in some way.
Maybe something for next release.
Looks like integration failed due to timeouts when communicating with the api server /test pull-tekton-pipeline-integration-tests |
/test pull-tekton-pipeline-integration-tests |
The integration job was marked as failed while the build log said it was still running. Not entirely sure what's up so tried again. |
Hmm. Might just need to sit and wait a bit for things to settle before rerunning. |
/test pull-tekton-pipeline-integration-tests |
Changes
Fixes #3465
Prior to this commit a Pipeline Result that references a Task Result of a failed TaskRun would end up with a value of the literal variable. For example: if Task "foo" failed, a Pipeline Result with value of "$(tasks.foo.results.bar)" would end up in the PipelineRun.Status.Results list with the literal value, "$(tasks.foo.results.bar)". It was therefore difficult to assess programatically whether results were populated correctly or were the result of some invalid TaskRun condition.
This commit fixes the bug by filtering out PipelineRun Results that reference failed TaskRuns. It was quite difficult to follow the flow of execution wrt PipelineRun Results and so ultimately I had to refactor the whole lot to figure out where the bug was. The final code is quite a bit shorter than the original and has improved test coverage to more robustly exercise the behaviour of PipelineResults in various failure scenarios.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them: