-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239
Comments
Thanks for the thorough investigation @Peaorl ! I like option 2 personally; unless we're including sidecars (and maybe we are? maybe that's why this is a problem? looking at Lines 111 to 137 in 76aff13
Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong? One con of option 3, im not sure how much control we have over the format of those fields, i.e. if they are expected to match the format of similar fields in pod status ( pipeline/vendor/k8s.io/api/core/v1/types.go Line 2335 in 76aff13
|
Of course! Thanking @sbwsg for the leads too.
Exactly,
Indeed I think option 3 supports actual reality as is the case now too (just not at a high enough resolution 😄). I think it would be interesting to know if this is why the timestamp solution was chosen in the first place.
I was thinking to write a |
When I investigated this logic before, I think one of my concerns about using taskspec step order is that it wouldn't handle a failure of an internal step (i.e. the ones inserted by tekton). I thought that's why the timestamp approach was needed. |
I see, would these internal |
I think there might be an init container or 2 but they are definitely not all init containers, e.g. PipelineResources add containers before and after.
If we went this route, we'd need to take into account the order of all steps, including internal steps - we actually added the order of these "internal" steps to our api compatibility policy and tried to make it deterministic in #970 |
This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the Step order set in the taskSpec. This order ought to be correct as Tekton enforces Steps to be scheduled in this order. In case Tekton adds extra Steps (such as for pipelineresources), Tekton already updates the taskSpec with these Steps. Therefore, Tekton accounts for these internally added Steps when sorting.
This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the Step order set in the taskSpec. This order ought to be correct as Tekton enforces Steps to be scheduled in this order. In case Tekton adds extra Steps (such as for pipelineresources), Tekton already updates the taskSpec with these Steps. Therefore, Tekton accounts for these internally added Steps when sorting.
Thanks all! I went ahead with option 2 in #3256 |
This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the Step order set in the taskSpec. This order ought to be correct as Tekton enforces Steps to be scheduled in this order. In case Tekton adds extra Steps (such as for pipelineresources), Tekton updates the taskSpec with these Steps and makes the taskSpec availavle for sorting. Therefore, Tekton accounts for these internally added Steps when sorting.
This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the container order as specified by Tekton in the podSpec. Tekton bases this order on the user provided taskSpec and Steps added internally by Tekton. Therefore, Tekton accounts for internally added Steps when sorting pod container statuses.
This commit closes #3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the container order as specified by Tekton in the podSpec. Tekton bases this order on the user provided taskSpec and Steps added internally by Tekton. Therefore, Tekton accounts for internally added Steps when sorting pod container statuses.
Preface
Users expect that the
status.conditions.message
field of a failedTaskRun
is based on the first terminatedStep
.For that reason, pod container statuses are sorted #1905.
Tekton sorts pod container statuses based on FinishedAt and StartedAt times.
Initially, sorting happened solely based on FinishedAt times.
This is troubling if the FinishedAt times of different
Steps
are the same.Therefore, #2455 introduced sorting based on StartedAt times in case FinishedAt times are equal.
This issue is about the case when both FinishedAt and StartedAt times are the same.
Expected Behavior
I have observed situations where two
Steps
have the same FinishedAt and StartedAt timestamps.Notably in the logs for #3087 of which the relevant snippet is shown here:
What we would expect in the
status.conditions.message
field (before #3087 is actually merged) is :Actual Behavior
Instead what we get is:
I.e., not the output of the first terminated
Step
.Suggestion
Potential solutions:
i. FinishedAt time
ii. StartedAt time
iii. TaskSpec
Step
orderOR
Step
order.OR
Steps
write their own higher resolution (i.e. use milliseconds) FinishedAt times as results.Then, sort pod container statuses based on these results. These results are easily filtered from user exposed task results after Introducing InternalTektonResultType as a ResultType #3138 is merged.
I'm leaning towards option 3 because it seems like the most reliable solution to always base the logs on the the first terminated
Step
. Option 3 has briefly been mentioned before but lost traction, seemingly because of the somewhat higher complexity.With option 3, the
Step
statuses in theTaskRun
could either show the K8s or the Tekton written StartedAt and FinishedAt times.Sorting just once?
Additionally, after the pod container status have been sorted, the
Steps
displayed under theTaskRun
status are sorted according to the originalStep
order in the TaskSpec, regardless of FinishedAt or StartedAt times. Indeed, if Tekton properly serializesSteps
, this should be the correctStep
order anyway.In short:
status.conditions.message
field for a terminatedTaskRun
based on the first terminatedStep
by sorting pod container statuses based on FinishedAt and StartedAt times.Steps
displayed under theTaskRun
is based on the originalStep
order in the TaskSpec regardless of FinishedAt or StartedAt times.Perhaps it would make sense to only sort the pod container statuses (based on one of the proposed solutions) and also use this order to display
Step
statuses in theTaskRun
.Steps to Reproduce the Problem
Alternatively, a test could mock
Steps
with the same StartedAt and FinishedAt times which should confirm this bug.Additional Info
The text was updated successfully, but these errors were encountered: