Step start times are reported inaccurately #1512

imjasonh · 2019-11-01T18:11:44Z

Expected Behavior

Step start times are recorded and reported accurately.

Actual Behavior

Step start times are all reported to be the same as the beginning of the first step, because according to k8s all the containers started at once -- the entrypoint binary ensures the actual step's work doesn't start until a previous step finishes.

Steps to Reproduce the Problem

steps:
- image: ubuntu
  command: ['sleep', '1000']
- image: ubuntu
  command: ['started']

After completing, this TaskRun's status will incorrectly show that the second step started at the same time as the first. It should report to have started 1000 seconds after the first.

The text was updated successfully, but these errors were encountered:

othomann · 2019-11-27T18:49:49Z

How do you get the start time of each step? The only thing I could get using the taskrun describe command was:

Name:        echo-hello-world-run-k9xk2
Namespace:   default
Task Ref:    echo-hello-world

Status
STARTED          DURATION     STATUS
58 seconds ago   16 seconds   Succeeded

Input Resources
No resources

Output Resources
No resources

Params
No params

Steps
NAME    STATUS
sleep   Completed
echo    Completed

imjasonh · 2019-11-27T19:28:22Z

kubectl get taskrun <foo> -oyaml and look in the .status.steps field. They’ll all have the ~same start time, and the same as the TaskRun, even if they started executing minutes apart.

othomann · 2019-11-27T21:17:18Z

ok, got it. Right now it looks like the startedTime and finishedTime are related to the corresponding container time. When the step status is added, there is a deepcopy of the container status and then it adds several fields. I am trying to locate where the actual container is executed.

othomann · 2019-12-02T19:30:37Z

@imjasonh In order to report accurate time, the starting time and ending time of a specific entrypoint should be saved to disk and then retrieved to be able to update the status. Would it be possible for you to guide me about where this update of the status should take place?

imjasonh · 2019-12-02T20:27:55Z

@othomann That's a good question. One possible solution is to have the entrypoint binary write data to terminationMessagePath which would make it available to the controller without having to persist to some other location. The downside is that there's a 4KB size limit, and that this means steps themselves can't write to that path (or change it).

So, roughly:

Disallow user-specified terminationMessagePath and terminationMessagePolicy in steps (technically a breaking change, but probably fine)
Set that value ourselves when creating a Pod to run a TaskRun
Have the entrypoint binary write start time to that path.
Have the controller, when it sees Pod updates, pull out the start time from container state's message after it's terminated.

Code would live in pkg/pod/pod.go (to set the termination message path), pkg/pod/status.go (to pull out the start time from the status), and in cmd/entrypoint/waiter.go (to write to the path when execution starts).

ghost · 2019-12-02T20:33:45Z

Careful overwriting the termination message path: it's utilized by pipeline resources to emit resource results from the steps they inject.

See https://github.com/tektoncd/pipeline/blob/master/pkg/termination/termination.go#L27 for the package that writes the messages and https://github.com/tektoncd/pipeline/blob/master/cmd/git-init/main.go#L60 for an example of its usage.

imjasonh · 2019-12-02T20:53:25Z

Ooh thanks for catching that!

So it looks like we'll either need to find some other way to handle this, or at least have it only append to data already present there, so it can play nicely with output resources.

Could WriteMessage look for an existing file, parse it as JSON, add whatever new key+value and write it back?

ghost · 2019-12-02T20:55:58Z

Could WriteMessage look for an existing file, parse it as JSON, add whatever new key+value and write it back?

This sounds like a solid approach to me!

othomann · 2019-12-03T15:01:16Z

So you would use the termination.writeMessage(..) call to write some json file that contains the starting and ending time of the corresponding endpoint? Do you think of a particular file that can be reused?

ghost · 2019-12-03T15:04:33Z

termination.WriteMessage accepts a path string. Most of the time this will be /dev/termination-log, which is the default for the Kubernetes termination message but it can be overriden by the user if they set the terminationMessagePath field of the step.

imjasonh · 2019-12-03T15:08:50Z

I think we should block users from setting a terminationMessagePath themselves, because we'll use it to signal start/end times (and probably other stuff in the future). We can block this in validation, and likely not break many real users.

With the user out of the way, we can point the path to wherever we want when running the user's steps. It only has to be a path that isn't likely to be used by anything else. We'll reserve /tekton/ paths from users, so we could point the termination message path to /tekton/termination or something.

othomann · 2019-12-03T15:20:19Z

it would need to be a different path for each endpoint since we want to report accurate starting and ending time in the status for each step. So the path must be unique for each step, right? Also once this is done, it needs to be used to update the status of the corresponding step. What kind of events we can have to trigger a status update?
Sorry if I ask stupid questions. I am trying to get a better understanding of the way taskruns are instanciated. Thanks for your comments.

imjasonh · 2019-12-03T15:22:45Z

It doesn't have to be a unique path, since the contents of that path won't be persisted across step containers (it's not backed by a volume). Each step only needs to report its own start time (step end times are already reported correctly), plus whatever output resource metadata the step writes.

These aren't stupid questions at all, we're just designing this as we go! 😄

othomann · 2019-12-03T19:07:33Z

Is this something that I should expose as a new option for the endpoint or it should be kept hidden? Once the file is available, I don't quite see where the update needs to be done. Any hint for this?
Thanks.

imjasonh · 2019-12-03T21:13:46Z

I think we can just go with a convention that the entrypoint binary appends start time to a JSON file at /tekton/termination, and that we set that as each container's terminationMessagePath in pkg/pod/pod.go.

Let me know if you'd like to pair on this over Hangouts or Zoom, it might be easier.

othomann · 2019-12-04T02:42:29Z

@imjasonh might actually be great to do pairing on this. Let me start with the new json file first. I'll ping you tomorrow about this. Thanks.

othomann · 2019-12-04T16:32:07Z

@imjasonh let me know how you want to proceed for using Hangouts if you have time for this of course :-).

othomann · 2019-12-04T17:40:50Z

Right now I am saving this in the /tekton/termination file:

[
   {
      "name":"",
      "digest":"",
      "key":"StartedAt",
      "value":"2019-12-04 18:54:34.879767287 +0000 UTC m=+2.010932041",
      "resourceRef":{

      }
   },
   {
      "name":"",
      "digest":"",
      "key":"FinishedAt",
      "value":"2019-12-04 18:59:34.881511357 +0000 UTC m=+302.012676561",
      "resourceRef":{

      }
   }
]

In my case the two steps are 5 minutes apart and we can see the termination time is 5 minutes after the starting time. Now I think it is a matter of updating the step's statuds with these values.

imjasonh · 2019-12-04T19:24:43Z

@othomann We actually already get accurate step finish times, because completed step containers exit and report their finish time. So just reporting start time should be fine.

As for receiving the written information in the Pod status, first you'll need to set each container's terminationMessagePath to /tekton/termination, here-ish. Then when the Pod runs, its step statuses will report that information in .status.message. pkg/pod/status.go can pull that information out, parse the JSON and report the start time as that step's start time.

othomann · 2019-12-04T19:59:06Z

As far as I can see, all finished times are identical for all containers. This is why I added the finished time as well in the endpoint code.
This is what I have for one taskrun output:

MacBook-Pro-de-Olivier:~ olivier$ k get taskrun echo-hello-world-run-58r4m -oyaml
apiVersion: tekton.dev/v1alpha1
kind: TaskRun
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"tekton.dev/v1alpha1","kind":"Task","metadata":{"annotations":{},"name":"echo-hello-world","namespace":"default"},"spec":{"steps":[{"command":["sleep","300"],"image":"ubuntu","name":"sleep"},{"args":["hello world"],"command":["echo"],"image":"ubuntu","name":"echo"}]}}
  creationTimestamp: "2019-12-04T17:28:59Z"
  generateName: echo-hello-world-run-
  generation: 1
  labels:
    tekton.dev/task: echo-hello-world
  name: echo-hello-world-run-58r4m
  namespace: default
  resourceVersion: "20833065"
  selfLink: /apis/tekton.dev/v1alpha1/namespaces/default/taskruns/echo-hello-world-run-58r4m
  uid: 8ebc0c8e-16bb-11ea-8208-b21158039d49
spec:
  inputs: {}
  outputs: {}
  podTemplate: {}
  serviceAccountName: ""
  taskRef:
    kind: Task
    name: echo-hello-world
  timeout: 1h0m0s
status:
  completionTime: "2019-12-04T17:34:06Z"
  conditions:
  - lastTransitionTime: "2019-12-04T17:34:06Z"
    message: All Steps have completed executing
    reason: Succeeded
    status: "True"
    type: Succeeded
  podName: echo-hello-world-run-58r4m-pod-vljhj
  startTime: "2019-12-04T17:28:59Z"
  steps:
  - container: step-sleep
    imageID: docker.io/library/ubuntu@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d
    name: sleep
    terminated:
      containerID: containerd://5ddf7cd2c3dc40d79e97583fdf5b46c95939c487ba36a8092562ce4d9582274b
      exitCode: 0
      finishedAt: "2019-12-04T17:34:05Z"
      reason: Completed
      startedAt: "2019-12-04T17:29:04Z"
  - container: step-echo
    imageID: docker.io/library/ubuntu@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d
    name: echo
    terminated:
      containerID: containerd://42572d1ab6632c58fb76dc11035aefd501658eb9f4e94168f03b68ad7ce454fa
      exitCode: 0
      finishedAt: "2019-12-04T17:34:05Z"
      reason: Completed
      startedAt: "2019-12-04T17:29:04Z"

I believe finished time are not accurate.

imjasonh · 2019-12-04T20:14:04Z

Hmm, I'm seeing step finishedAt reported correctly. Specifying these steps:

apiVersion: tekton.dev/v1alpha1
kind: TaskRun
metadata:
  generateName: step-finish-time-
spec:
  taskSpec:
    steps:
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']

Produced these statuses (kubectl get taskrun step-finish-time-2nb82 -oyaml, fields elided):

status:
  startTime: "2019-12-04T20:06:27Z"
  completionTime: "2019-12-04T20:07:28Z" <-- took ~1m total
  steps:
  - container: step-unnamed-0
    terminated:
      startedAt: "2019-12-04T20:06:31Z" <-- 4s after Pod start time
      finishedAt: "2019-12-04T20:06:45Z" <-- 14s after start time
  - container: step-unnamed-1
    terminated:
      startedAt: "2019-12-04T20:06:31Z" <-- (same start time throughout)
      finishedAt: "2019-12-04T20:06:55Z" <-- finished 10s later
  - container: step-unnamed-2
    terminated:
      startedAt: "2019-12-04T20:06:32Z"
      finishedAt: "2019-12-04T20:07:06Z" <-- finished 11s later
  - container: step-unnamed-3
    terminated:
      startedAt: "2019-12-04T20:06:32Z"
      finishedAt: "2019-12-04T20:07:16Z" <-- finished 10s later
  - container: step-unnamed-4
    terminated:
      startedAt: "2019-12-04T20:06:33Z"
      finishedAt: "2019-12-04T20:07:27Z" <-- finished 11s later

othomann · 2019-12-06T17:29:22Z

@imjasonh could you please assign this issue to me so that nobody else will try to fix it. I got the code working now and looking at adding some test for it.

imjasonh · 2019-12-06T17:38:29Z

/assign othomann

imjasonh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Nov 1, 2019

tekton-robot assigned othomann Dec 6, 2019

othomann mentioned this issue Dec 10, 2019

Correctly report step start times #1722

Merged

2 tasks

tekton-robot closed this as completed in #1722 Jan 8, 2020

tekton-robot pushed a commit that referenced this issue Jan 8, 2020

Proposal for issue #1512

82dbcf8

AlanGreene mentioned this issue Apr 16, 2020

Unexpected start times reported for skipped steps #2416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step start times are reported inaccurately #1512

Step start times are reported inaccurately #1512

imjasonh commented Nov 1, 2019

othomann commented Nov 27, 2019

imjasonh commented Nov 27, 2019

othomann commented Nov 27, 2019

othomann commented Dec 2, 2019

imjasonh commented Dec 2, 2019

ghost commented Dec 2, 2019

imjasonh commented Dec 2, 2019

ghost commented Dec 2, 2019

othomann commented Dec 3, 2019

ghost commented Dec 3, 2019

imjasonh commented Dec 3, 2019

othomann commented Dec 3, 2019 •

edited

Loading

imjasonh commented Dec 3, 2019

othomann commented Dec 3, 2019

imjasonh commented Dec 3, 2019

othomann commented Dec 4, 2019

othomann commented Dec 4, 2019

othomann commented Dec 4, 2019 •

edited

Loading

imjasonh commented Dec 4, 2019

othomann commented Dec 4, 2019

imjasonh commented Dec 4, 2019 •

edited

Loading

othomann commented Dec 6, 2019

imjasonh commented Dec 6, 2019

Step start times are reported inaccurately #1512

Step start times are reported inaccurately #1512

Comments

imjasonh commented Nov 1, 2019

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

othomann commented Nov 27, 2019

imjasonh commented Nov 27, 2019

othomann commented Nov 27, 2019

othomann commented Dec 2, 2019

imjasonh commented Dec 2, 2019

ghost commented Dec 2, 2019

imjasonh commented Dec 2, 2019

ghost commented Dec 2, 2019

othomann commented Dec 3, 2019

ghost commented Dec 3, 2019

imjasonh commented Dec 3, 2019

othomann commented Dec 3, 2019 • edited Loading

imjasonh commented Dec 3, 2019

othomann commented Dec 3, 2019

imjasonh commented Dec 3, 2019

othomann commented Dec 4, 2019

othomann commented Dec 4, 2019

othomann commented Dec 4, 2019 • edited Loading

imjasonh commented Dec 4, 2019

othomann commented Dec 4, 2019

imjasonh commented Dec 4, 2019 • edited Loading

othomann commented Dec 6, 2019

imjasonh commented Dec 6, 2019

othomann commented Dec 3, 2019 •

edited

Loading

othomann commented Dec 4, 2019 •

edited

Loading

imjasonh commented Dec 4, 2019 •

edited

Loading