Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect TaskRun status due to different Steps having the same StartedAt and FinishedAt times #3239

Closed
Peaorl opened this issue Sep 15, 2020 · 6 comments · Fixed by #3256
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@Peaorl
Copy link
Contributor

Peaorl commented Sep 15, 2020

Preface

Users expect that the status.conditions.message field of a failed TaskRun is based on the first terminated Step.
For that reason, pod container statuses are sorted #1905.

Tekton sorts pod container statuses based on FinishedAt and StartedAt times.
Initially, sorting happened solely based on FinishedAt times.
This is troubling if the FinishedAt times of different Steps are the same.
Therefore, #2455 introduced sorting based on StartedAt times in case FinishedAt times are equal.

This issue is about the case when both FinishedAt and StartedAt times are the same.

Expected Behavior

I have observed situations where two Steps have the same FinishedAt and StartedAt timestamps.
Notably in the logs for #3087 of which the relevant snippet is shown here:

          steps:
          - container: step-timeout
            imageID: docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002
            name: timeout
            terminated:
              containerID: docker://9ad19a910be8fd188b47cafebb57b6508c66da3317ac13dff7a605f27cdc9448
              exitCode: 1
              finishedAt: "2020-09-10T18:10:46Z"
              reason: Error
              startedAt: "2020-09-10T18:10:46Z"
          - container: step-canceled
            imageID: docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002
            name: canceled
            terminated:
              containerID: docker://178b305cdbd4fe05b4df2d4d02c3242c664ec8d0fb0d3476680a533832b93c9f
              exitCode: 1
              finishedAt: "2020-09-10T18:10:46Z"
              reason: Error
              startedAt: "2020-09-10T18:10:46Z"

What we would expect in the status.conditions.message field (before #3087 is actually merged) is :

        status:
          completionTime: "2020-09-10T18:11:00Z"
          conditions:
          - lastTransitionTime: "2020-09-10T18:11:00Z"
            message: |
              "step-timeout" exited with code 1 (image: "docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002"); for logs run: kubectl -n arendelle-c5fjk logs run-timeout-pod-ttgfq -c step-canceled
            reason: Failed
            status: "False"
            type: Succeeded

Actual Behavior

Instead what we get is:

        status:
          completionTime: "2020-09-10T18:11:00Z"
          conditions:
          - lastTransitionTime: "2020-09-10T18:11:00Z"
            message: |
              "step-canceled" exited with code 1 (image: "docker-pullable://busybox@sha256:2ca5e69e244d2da7368f7088ea3ad0653c3ce7aaccd0b8823d11b0d5de956002"); for logs run: kubectl -n arendelle-c5fjk logs run-timeout-pod-ttgfq -c step-canceled
            reason: Failed
            status: "False"
            type: Succeeded

I.e., not the output of the first terminated Step.

Suggestion

Potential solutions:

  1. Sort pod container statuses based on the following priority:
    i. FinishedAt time
    ii. StartedAt time
    iii. TaskSpec Step order

OR

  1. Sort pod container statuses based on the TaskSpec Step order.

OR

  1. Steps write their own higher resolution (i.e. use milliseconds) FinishedAt times as results.
    Then, sort pod container statuses based on these results. These results are easily filtered from user exposed task results after Introducing InternalTektonResultType as a ResultType #3138 is merged.

I'm leaning towards option 3 because it seems like the most reliable solution to always base the logs on the the first terminated Step. Option 3 has briefly been mentioned before but lost traction, seemingly because of the somewhat higher complexity.
With option 3, the Step statuses in the TaskRun could either show the K8s or the Tekton written StartedAt and FinishedAt times.

Sorting just once?

Additionally, after the pod container status have been sorted, the Steps displayed under theTaskRun status are sorted according to the original Step order in the TaskSpec, regardless of FinishedAt or StartedAt times. Indeed, if Tekton properly serializes Steps, this should be the correct Step order anyway.

In short:

  • Tekton populates thestatus.conditions.message field for a terminated TaskRun based on the first terminated Step by sorting pod container statuses based on FinishedAt and StartedAt times.
  • The ordering of Steps displayed under the TaskRun is based on the original Step order in the TaskSpec regardless of FinishedAt or StartedAt times.

Perhaps it would make sense to only sort the pod container statuses (based on one of the proposed solutions) and also use this order to display Step statuses in the TaskRun.

Steps to Reproduce the Problem

  1. Install Step timeout #3087
  2. Run the TestStepTimeout integration test until this situation occurs.

Alternatively, a test could mock Steps with the same StartedAt and FinishedAt times which should confirm this bug.

Additional Info

  • Tekton Pipeline version:
 #3087 
@Peaorl Peaorl added the kind/bug Categorizes issue or PR as related to a bug. label Sep 15, 2020
@bobcatfish
Copy link
Collaborator

Thanks for the thorough investigation @Peaorl !

I like option 2 personally; unless we're including sidecars (and maybe we are? maybe that's why this is a problem? looking at

pipeline/pkg/pod/status.go

Lines 111 to 137 in 76aff13

for _, s := range pod.Status.ContainerStatuses {
if IsContainerStep(s.Name) {
if s.State.Terminated != nil && len(s.State.Terminated.Message) != 0 {
message, time, err := removeStartInfoFromTerminationMessage(s)
if err != nil {
logger.Errorf("error setting the start time of step %q in taskrun %q: %w", s.Name, tr.Name, err)
}
if time != nil {
s.State.Terminated.StartedAt = *time
s.State.Terminated.Message = message
}
}
trs.Steps = append(trs.Steps, v1beta1.StepState{
ContainerState: *s.State.DeepCopy(),
Name: trimStepPrefix(s.Name),
ContainerName: s.Name,
ImageID: s.ImageID,
})
} else if isContainerSidecar(s.Name) {
trs.Sidecars = append(trs.Sidecars, v1beta1.SidecarState{
ContainerState: *s.State.DeepCopy(),
Name: TrimSidecarPrefix(s.Name),
ContainerName: s.Name,
ImageID: s.ImageID,
})
}
}
i dont think we are tho) i would expect the order the steps are evaluated in to reflect the actual order the containers run in.

Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong?

One con of option 3, im not sure how much control we have over the format of those fields, i.e. if they are expected to match the format of similar fields in pod status (

StartedAt metav1.Time `json:"startedAt,omitempty" protobuf:"bytes,1,opt,name=startedAt"`
)

@bobcatfish bobcatfish added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Sep 15, 2020
@Peaorl
Copy link
Contributor Author

Peaorl commented Sep 16, 2020

Of course! Thanking @sbwsg for the leads too.

unless we're including sidecars (and maybe we are? maybe that's why this is a problem?

Exactly, Sidecars are not included for formulating a TaskRun status. See the NB section here or the code.
I'm not entirely sure how that would affect this situation though?

Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong?

Indeed I think option 3 supports actual reality as is the case now too (just not at a high enough resolution 😄). I think it would be interesting to know if this is why the timestamp solution was chosen in the first place.
Otherwise I agree that option 2 should work as well assuming Tekton always schedules the Steps correctly.

One con of option 3, im not sure how much control we have over the format of those fields, i.e. if they are expected to match the format of similar fields in pod status

I was thinking to write a Task result right around here. Tekton could then use this result only to order pod containers while the timestamp in the pod status could stay the lower resolution K8s provided timestamp.

@Peaorl Peaorl mentioned this issue Sep 16, 2020
4 tasks
@GregDritschler
Copy link
Contributor

Option 3 makes a lot of sense too, but it seems like we should already know the order the steps are supposed to run in 🤔 Maybe a pro of option 3 is that that would reflect the actual reality, in the case that something went wrong?

When I investigated this logic before, I think one of my concerns about using taskspec step order is that it wouldn't handle a failure of an internal step (i.e. the ones inserted by tekton). I thought that's why the timestamp approach was needed.

@Peaorl
Copy link
Contributor Author

Peaorl commented Sep 16, 2020

I see, would these internal Steps be initContainers? Because it seems like there's a separate container statuses field for initContainers.

@bobcatfish
Copy link
Collaborator

I see, would these internal Steps be initContainers? Because it seems like there's a separate container statuses field for initContainers.

I think there might be an init container or 2 but they are definitely not all init containers, e.g. PipelineResources add containers before and after.

using taskspec step order is that it wouldn't handle a failure of an internal step (i.e. the ones inserted by tekton).

If we went this route, we'd need to take into account the order of all steps, including internal steps - we actually added the order of these "internal" steps to our api compatibility policy and tried to make it deterministic in #970

Peaorl added a commit to Peaorl/pipeline that referenced this issue Sep 18, 2020
This commit closes tektoncd#3239

Tekton determines the TaskRun status message of a failed TaskRun based
on the results of the first terminated Step (pod container). Until now,
Tekton sorted pod container statuses based on the FinishedAt and
StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated
in response to the first terminated Step could have the same timestamps
as the first terminated Step. Therefore, Tekton was not always able to
correctly determine what the first terminated Step was, and as a result,
Tekton may set an incorrect TaskRun status message.

In this commit, pod container statuses are sorted based on the Step
order set in the taskSpec. This order ought to be correct as Tekton
enforces Steps to be scheduled in this order. In case Tekton adds extra
Steps (such as for pipelineresources), Tekton already updates the
taskSpec with these Steps. Therefore, Tekton accounts for these
internally added Steps when sorting.
Peaorl added a commit to Peaorl/pipeline that referenced this issue Sep 18, 2020
This commit closes tektoncd#3239

Tekton determines the TaskRun status message of a failed TaskRun based
on the results of the first terminated Step (pod container). Until now,
Tekton sorted pod container statuses based on the FinishedAt and
StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated
in response to the first terminated Step could have the same timestamps
as the first terminated Step. Therefore, Tekton was not always able to
correctly determine what the first terminated Step was, and as a result,
Tekton may set an incorrect TaskRun status message.

In this commit, pod container statuses are sorted based on the Step
order set in the taskSpec. This order ought to be correct as Tekton
enforces Steps to be scheduled in this order. In case Tekton adds extra
Steps (such as for pipelineresources), Tekton already updates the
taskSpec with these Steps. Therefore, Tekton accounts for these
internally added Steps when sorting.
@Peaorl
Copy link
Contributor Author

Peaorl commented Sep 18, 2020

Thanks all! I went ahead with option 2 in #3256

Peaorl added a commit to Peaorl/pipeline that referenced this issue Sep 22, 2020
This commit closes tektoncd#3239

Tekton determines the TaskRun status message of a failed TaskRun based
on the results of the first terminated Step (pod container). Until now,
Tekton sorted pod container statuses based on the FinishedAt and
StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated
in response to the first terminated Step could have the same timestamps
as the first terminated Step. Therefore, Tekton was not always able to
correctly determine what the first terminated Step was, and as a result,
Tekton may set an incorrect TaskRun status message.

In this commit, pod container statuses are sorted based on the Step
order set in the taskSpec. This order ought to be correct as Tekton
enforces Steps to be scheduled in this order. In case Tekton adds extra
Steps (such as for pipelineresources), Tekton updates the taskSpec with
these Steps and makes the taskSpec availavle for sorting. Therefore,
Tekton accounts for these internally added Steps when sorting.
Peaorl added a commit to Peaorl/pipeline that referenced this issue Sep 22, 2020
This commit closes tektoncd#3239

Tekton determines the TaskRun status message of a failed TaskRun based
on the results of the first terminated Step (pod container). Until now,
Tekton sorted pod container statuses based on the FinishedAt and
StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated
in response to the first terminated Step could have the same timestamps
as the first terminated Step. Therefore, Tekton was not always able to
correctly determine what the first terminated Step was, and as a result,
Tekton may set an incorrect TaskRun status message.

In this commit, pod container statuses are sorted based on the container
order as specified by Tekton in the podSpec. Tekton bases this order on
the user provided taskSpec and Steps added internally by Tekton.
Therefore, Tekton accounts for internally added Steps when
sorting pod container statuses.
tekton-robot pushed a commit that referenced this issue Sep 24, 2020
This commit closes #3239

Tekton determines the TaskRun status message of a failed TaskRun based
on the results of the first terminated Step (pod container). Until now,
Tekton sorted pod container statuses based on the FinishedAt and
StartedAt timestamps set by Kubernetes. Occasionally, a Step terminated
in response to the first terminated Step could have the same timestamps
as the first terminated Step. Therefore, Tekton was not always able to
correctly determine what the first terminated Step was, and as a result,
Tekton may set an incorrect TaskRun status message.

In this commit, pod container statuses are sorted based on the container
order as specified by Tekton in the podSpec. Tekton bases this order on
the user provided taskSpec and Steps added internally by Tekton.
Therefore, Tekton accounts for internally added Steps when
sorting pod container statuses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants