Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239

Peaorl · 2020-09-15T16:04:57Z

Preface

Users expect that the status.conditions.message field of a failed TaskRun is based on the first terminated Step.
For that reason, pod container statuses are sorted #1905.

Tekton sorts pod container statuses based on FinishedAt and StartedAt times.
Initially, sorting happened solely based on FinishedAt times.
This is troubling if the FinishedAt times of different Steps are the same.
Therefore, #2455 introduced sorting based on StartedAt times in case FinishedAt times are equal.

This issue is about the case when both FinishedAt and StartedAt times are the same.

Expected Behavior

I have observed situations where two Steps have the same FinishedAt and StartedAt timestamps.
Notably in the logs for #3087 of which the relevant snippet is shown here:

          steps:
          - container: step-timeout
            imageID: docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002
            name: timeout
            terminated:
              containerID: docker://9ad19a910be8fd188b47cafebb57b6508c66da3317ac13dff7a605f27cdc9448
              exitCode: 1
              finishedAt: "2020-09-10T18:10:46Z"
              reason: Error
              startedAt: "2020-09-10T18:10:46Z"
          - container: step-canceled
            imageID: docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002
            name: canceled
            terminated:
              containerID: docker://178b305cdbd4fe05b4df2d4d02c3242c664ec8d0fb0d3476680a533832b93c9f
              exitCode: 1
              finishedAt: "2020-09-10T18:10:46Z"
              reason: Error
              startedAt: "2020-09-10T18:10:46Z"

What we would expect in the status.conditions.message field (before #3087 is actually merged) is :

        status:
          completionTime: "2020-09-10T18:11:00Z"
          conditions:
          - lastTransitionTime: "2020-09-10T18:11:00Z"
            message: |
              "step-timeout" exited with code 1 (image: "docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002"); for logs run: kubectl -n arendelle-c5fjk logs run-timeout-pod-ttgfq -c step-canceled
            reason: Failed
            status: "False"
            type: Succeeded

Actual Behavior

Instead what we get is:

        status:
          completionTime: "2020-09-10T18:11:00Z"
          conditions:
          - lastTransitionTime: "2020-09-10T18:11:00Z"
            message: |
              "step-canceled" exited with code 1 (image: "docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002"); for logs run: kubectl -n arendelle-c5fjk logs run-timeout-pod-ttgfq -c step-canceled
            reason: Failed
            status: "False"
            type: Succeeded

I.e., not the output of the first terminated Step.

Suggestion

Potential solutions:

Sort pod container statuses based on the following priority:
i. FinishedAt time
ii. StartedAt time
iii. TaskSpec Step order

OR

Sort pod container statuses based on the TaskSpec Step order.

OR

Steps write their own higher resolution (i.e. use milliseconds) FinishedAt times as results.
Then, sort pod container statuses based on these results. These results are easily filtered from user exposed task results after Introducing InternalTektonResultType as a ResultType #3138 is merged.

I'm leaning towards option 3 because it seems like the most reliable solution to always base the logs on the the first terminated Step. Option 3 has briefly been mentioned before but lost traction, seemingly because of the somewhat higher complexity.
With option 3, the Step statuses in the TaskRun could either show the K8s or the Tekton written StartedAt and FinishedAt times.

Sorting just once?

Additionally, after the pod container status have been sorted, the Steps displayed under theTaskRun status are sorted according to the original Step order in the TaskSpec, regardless of FinishedAt or StartedAt times. Indeed, if Tekton properly serializes Steps, this should be the correct Step order anyway.

In short:

Tekton populates thestatus.conditions.message field for a terminated TaskRun based on the first terminated Step by sorting pod container statuses based on FinishedAt and StartedAt times.
The ordering of Steps displayed under the TaskRun is based on the original Step order in the TaskSpec regardless of FinishedAt or StartedAt times.

Perhaps it would make sense to only sort the pod container statuses (based on one of the proposed solutions) and also use this order to display Step statuses in the TaskRun.

Steps to Reproduce the Problem

Install Step timeout #3087
Run the TestStepTimeout integration test until this situation occurs.

Alternatively, a test could mock Steps with the same StartedAt and FinishedAt times which should confirm this bug.

Additional Info

Tekton Pipeline version:

 #3087

The text was updated successfully, but these errors were encountered:

bobcatfish · 2020-09-15T17:53:35Z

Thanks for the thorough investigation @Peaorl !

I like option 2 personally; unless we're including sidecars (and maybe we are? maybe that's why this is a problem? looking at

pipeline/pkg/pod/status.go

Lines 111 to 137 in 76aff13

    
           for _, s := range pod.Status.ContainerStatuses { 
        
           	if IsContainerStep(s.Name) { 
        
           		if s.State.Terminated != nil && len(s.State.Terminated.Message) != 0 { 
        
           			message, time, err := removeStartInfoFromTerminationMessage(s) 
        
           			if err != nil { 
        
           				logger.Errorf("error setting the start time of step %q in taskrun %q: %w", s.Name, tr.Name, err) 
        
           			} 
        
           			if time != nil { 
        
           				s.State.Terminated.StartedAt = *time 
        
           				s.State.Terminated.Message = message 
        
           			} 
        
           		} 
        
           		trs.Steps = append(trs.Steps, v1beta1.StepState{ 
        
           			ContainerState: *s.State.DeepCopy(), 
        
           			Name:           trimStepPrefix(s.Name), 
        
           			ContainerName:  s.Name, 
        
           			ImageID:        s.ImageID, 
        
           		}) 
        
           	} else if isContainerSidecar(s.Name) { 
        
           		trs.Sidecars = append(trs.Sidecars, v1beta1.SidecarState{ 
        
           			ContainerState: *s.State.DeepCopy(), 
        
           			Name:           TrimSidecarPrefix(s.Name), 
        
           			ContainerName:  s.Name, 
        
           			ImageID:        s.ImageID, 
        
           		}) 
        
           	} 
        
           }

i dont think we are tho) i would expect the order the steps are evaluated in to reflect the actual order the containers run in.

Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong?

One con of option 3, im not sure how much control we have over the format of those fields, i.e. if they are expected to match the format of similar fields in pod status (

pipeline/vendor/k8s.io/api/core/v1/types.go

Line 2335 in 76aff13

    
           StartedAt metav1.Time `json:"startedAt,omitempty" protobuf:"bytes,1,opt,name=startedAt"`

)

Peaorl · 2020-09-16T13:05:38Z

Of course! Thanking @sbwsg for the leads too.

unless we're including sidecars (and maybe we are? maybe that's why this is a problem?

Exactly, Sidecars are not included for formulating a TaskRun status. See the NB section here or the code.
I'm not entirely sure how that would affect this situation though?

Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong?

Indeed I think option 3 supports actual reality as is the case now too (just not at a high enough resolution 😄). I think it would be interesting to know if this is why the timestamp solution was chosen in the first place.
Otherwise I agree that option 2 should work as well assuming Tekton always schedules the Steps correctly.

One con of option 3, im not sure how much control we have over the format of those fields, i.e. if they are expected to match the format of similar fields in pod status

I was thinking to write a Task result right around here. Tekton could then use this result only to order pod containers while the timestamp in the pod status could stay the lower resolution K8s provided timestamp.

GregDritschler · 2020-09-16T13:29:26Z

Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong?

When I investigated this logic before, I think one of my concerns about using taskspec step order is that it wouldn't handle a failure of an internal step (i.e. the ones inserted by tekton). I thought that's why the timestamp approach was needed.

Peaorl · 2020-09-16T13:43:28Z

I see, would these internal Steps be initContainers? Because it seems like there's a separate container statuses field for initContainers.

bobcatfish · 2020-09-16T14:03:02Z

I see, would these internal Steps be initContainers? Because it seems like there's a separate container statuses field for initContainers.

I think there might be an init container or 2 but they are definitely not all init containers, e.g. PipelineResources add containers before and after.

using taskspec step order is that it wouldn't handle a failure of an internal step (i.e. the ones inserted by tekton).

If we went this route, we'd need to take into account the order of all steps, including internal steps - we actually added the order of these "internal" steps to our api compatibility policy and tried to make it deterministic in #970

This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the Step order set in the taskSpec. This order ought to be correct as Tekton enforces Steps to be scheduled in this order. In case Tekton adds extra Steps (such as for pipelineresources), Tekton already updates the taskSpec with these Steps. Therefore, Tekton accounts for these internally added Steps when sorting.

Peaorl · 2020-09-18T16:02:37Z

Thanks all! I went ahead with option 2 in #3256

This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the Step order set in the taskSpec. This order ought to be correct as Tekton enforces Steps to be scheduled in this order. In case Tekton adds extra Steps (such as for pipelineresources), Tekton updates the taskSpec with these Steps and makes the taskSpec availavle for sorting. Therefore, Tekton accounts for these internally added Steps when sorting.

This commit closes tektoncd#3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the container order as specified by Tekton in the podSpec. Tekton bases this order on the user provided taskSpec and Steps added internally by Tekton. Therefore, Tekton accounts for internally added Steps when sorting pod container statuses.

This commit closes #3239 Tekton determines the TaskRun status message of a failed TaskRun based on the results of the first terminated Step (pod container). Until now, Tekton sorted pod container statuses based on the FinishedAt and StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated in response to the first terminated Step could have the same timestamps as the first terminated Step. Therefore, Tekton was not always able to correctly determine what the first terminated Step was, and as a result, Tekton may set an incorrect TaskRun status message. In this commit, pod container statuses are sorted based on the container order as specified by Tekton in the podSpec. Tekton bases this order on the user provided taskSpec and Steps added internally by Tekton. Therefore, Tekton accounts for internally added Steps when sorting pod container statuses.

Peaorl added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2020

bobcatfish added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Sep 15, 2020

Peaorl mentioned this issue Sep 16, 2020

Step timeout #3087

Merged

4 tasks

Peaorl mentioned this issue Sep 18, 2020

Sort pod container statuses based on Step order in taskSpec #3256

Merged

3 tasks

ywluogg mentioned this issue Sep 21, 2020

Proposal: step timeout #1690

Closed

tekton-robot closed this as completed in #3256 Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239

Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239

Peaorl commented Sep 15, 2020

bobcatfish commented Sep 15, 2020

Peaorl commented Sep 16, 2020

GregDritschler commented Sep 16, 2020

Peaorl commented Sep 16, 2020

bobcatfish commented Sep 16, 2020

Peaorl commented Sep 18, 2020

Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239

Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239

Comments

Peaorl commented Sep 15, 2020

Preface

Expected Behavior

Actual Behavior

Suggestion

Sorting just once?

Steps to Reproduce the Problem

Additional Info

bobcatfish commented Sep 15, 2020

Peaorl commented Sep 16, 2020

GregDritschler commented Sep 16, 2020

Peaorl commented Sep 16, 2020

bobcatfish commented Sep 16, 2020

Peaorl commented Sep 18, 2020