Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step start times are reported inaccurately #1512

Closed
imjasonh opened this issue Nov 1, 2019 · 23 comments · Fixed by #1722
Closed

Step start times are reported inaccurately #1512

imjasonh opened this issue Nov 1, 2019 · 23 comments · Fixed by #1722
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines.

Comments

@imjasonh
Copy link
Member

imjasonh commented Nov 1, 2019

Expected Behavior

Step start times are recorded and reported accurately.

Actual Behavior

Step start times are all reported to be the same as the beginning of the first step, because according to k8s all the containers started at once -- the entrypoint binary ensures the actual step's work doesn't start until a previous step finishes.

Steps to Reproduce the Problem

steps:
- image: ubuntu
  command: ['sleep', '1000']
- image: ubuntu
  command: ['started']

After completing, this TaskRun's status will incorrectly show that the second step started at the same time as the first. It should report to have started 1000 seconds after the first.

@imjasonh imjasonh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Nov 1, 2019
@othomann
Copy link
Contributor

How do you get the start time of each step? The only thing I could get using the taskrun describe command was:

Name:        echo-hello-world-run-k9xk2
Namespace:   default
Task Ref:    echo-hello-world

Status
STARTED          DURATION     STATUS
58 seconds ago   16 seconds   Succeeded

Input Resources
No resources

Output Resources
No resources

Params
No params

Steps
NAME    STATUS
sleep   Completed
echo    Completed

@imjasonh
Copy link
Member Author

kubectl get taskrun <foo> -oyaml and look in the .status.steps field. They’ll all have the ~same start time, and the same as the TaskRun, even if they started executing minutes apart.

@othomann
Copy link
Contributor

ok, got it. Right now it looks like the startedTime and finishedTime are related to the corresponding container time. When the step status is added, there is a deepcopy of the container status and then it adds several fields. I am trying to locate where the actual container is executed.

@othomann
Copy link
Contributor

othomann commented Dec 2, 2019

@imjasonh In order to report accurate time, the starting time and ending time of a specific entrypoint should be saved to disk and then retrieved to be able to update the status. Would it be possible for you to guide me about where this update of the status should take place?

@imjasonh
Copy link
Member Author

imjasonh commented Dec 2, 2019

@othomann That's a good question. One possible solution is to have the entrypoint binary write data to terminationMessagePath which would make it available to the controller without having to persist to some other location. The downside is that there's a 4KB size limit, and that this means steps themselves can't write to that path (or change it).

So, roughly:

  • Disallow user-specified terminationMessagePath and terminationMessagePolicy in steps (technically a breaking change, but probably fine)
  • Set that value ourselves when creating a Pod to run a TaskRun
  • Have the entrypoint binary write start time to that path.
  • Have the controller, when it sees Pod updates, pull out the start time from container state's message after it's terminated.

Code would live in pkg/pod/pod.go (to set the termination message path), pkg/pod/status.go (to pull out the start time from the status), and in cmd/entrypoint/waiter.go (to write to the path when execution starts).

@ghost
Copy link

ghost commented Dec 2, 2019

Careful overwriting the termination message path: it's utilized by pipeline resources to emit resource results from the steps they inject.

See https://github.com/tektoncd/pipeline/blob/master/pkg/termination/termination.go#L27 for the package that writes the messages and https://github.com/tektoncd/pipeline/blob/master/cmd/git-init/main.go#L60 for an example of its usage.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 2, 2019

Ooh thanks for catching that!

So it looks like we'll either need to find some other way to handle this, or at least have it only append to data already present there, so it can play nicely with output resources.

Could WriteMessage look for an existing file, parse it as JSON, add whatever new key+value and write it back?

@ghost
Copy link

ghost commented Dec 2, 2019

Could WriteMessage look for an existing file, parse it as JSON, add whatever new key+value and write it back?

This sounds like a solid approach to me!

@othomann
Copy link
Contributor

othomann commented Dec 3, 2019

So you would use the termination.writeMessage(..) call to write some json file that contains the starting and ending time of the corresponding endpoint? Do you think of a particular file that can be reused?

@ghost
Copy link

ghost commented Dec 3, 2019

termination.WriteMessage accepts a path string. Most of the time this will be /dev/termination-log, which is the default for the Kubernetes termination message but it can be overriden by the user if they set the terminationMessagePath field of the step.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 3, 2019

I think we should block users from setting a terminationMessagePath themselves, because we'll use it to signal start/end times (and probably other stuff in the future). We can block this in validation, and likely not break many real users.

With the user out of the way, we can point the path to wherever we want when running the user's steps. It only has to be a path that isn't likely to be used by anything else. We'll reserve /tekton/ paths from users, so we could point the termination message path to /tekton/termination or something.

@othomann
Copy link
Contributor

othomann commented Dec 3, 2019

it would need to be a different path for each endpoint since we want to report accurate starting and ending time in the status for each step. So the path must be unique for each step, right? Also once this is done, it needs to be used to update the status of the corresponding step. What kind of events we can have to trigger a status update?
Sorry if I ask stupid questions. I am trying to get a better understanding of the way taskruns are instanciated. Thanks for your comments.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 3, 2019

It doesn't have to be a unique path, since the contents of that path won't be persisted across step containers (it's not backed by a volume). Each step only needs to report its own start time (step end times are already reported correctly), plus whatever output resource metadata the step writes.

These aren't stupid questions at all, we're just designing this as we go! 😄

@othomann
Copy link
Contributor

othomann commented Dec 3, 2019

Is this something that I should expose as a new option for the endpoint or it should be kept hidden? Once the file is available, I don't quite see where the update needs to be done. Any hint for this?
Thanks.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 3, 2019

I think we can just go with a convention that the entrypoint binary appends start time to a JSON file at /tekton/termination, and that we set that as each container's terminationMessagePath in pkg/pod/pod.go.

Let me know if you'd like to pair on this over Hangouts or Zoom, it might be easier.

@othomann
Copy link
Contributor

othomann commented Dec 4, 2019

@imjasonh might actually be great to do pairing on this. Let me start with the new json file first. I'll ping you tomorrow about this. Thanks.

@othomann
Copy link
Contributor

othomann commented Dec 4, 2019

@imjasonh let me know how you want to proceed for using Hangouts if you have time for this of course :-).

@othomann
Copy link
Contributor

othomann commented Dec 4, 2019

Right now I am saving this in the /tekton/termination file:

[
   {
      "name":"",
      "digest":"",
      "key":"StartedAt",
      "value":"2019-12-04 18:54:34.879767287 +0000 UTC m=+2.010932041",
      "resourceRef":{

      }
   },
   {
      "name":"",
      "digest":"",
      "key":"FinishedAt",
      "value":"2019-12-04 18:59:34.881511357 +0000 UTC m=+302.012676561",
      "resourceRef":{

      }
   }
]

In my case the two steps are 5 minutes apart and we can see the termination time is 5 minutes after the starting time. Now I think it is a matter of updating the step's statuds with these values.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 4, 2019

@othomann We actually already get accurate step finish times, because completed step containers exit and report their finish time. So just reporting start time should be fine.

As for receiving the written information in the Pod status, first you'll need to set each container's terminationMessagePath to /tekton/termination, here-ish. Then when the Pod runs, its step statuses will report that information in .status.message. pkg/pod/status.go can pull that information out, parse the JSON and report the start time as that step's start time.

@othomann
Copy link
Contributor

othomann commented Dec 4, 2019

As far as I can see, all finished times are identical for all containers. This is why I added the finished time as well in the endpoint code.
This is what I have for one taskrun output:

MacBook-Pro-de-Olivier:~ olivier$ k get taskrun echo-hello-world-run-58r4m -oyaml
apiVersion: tekton.dev/v1alpha1
kind: TaskRun
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"tekton.dev/v1alpha1","kind":"Task","metadata":{"annotations":{},"name":"echo-hello-world","namespace":"default"},"spec":{"steps":[{"command":["sleep","300"],"image":"ubuntu","name":"sleep"},{"args":["hello world"],"command":["echo"],"image":"ubuntu","name":"echo"}]}}
  creationTimestamp: "2019-12-04T17:28:59Z"
  generateName: echo-hello-world-run-
  generation: 1
  labels:
    tekton.dev/task: echo-hello-world
  name: echo-hello-world-run-58r4m
  namespace: default
  resourceVersion: "20833065"
  selfLink: /apis/tekton.dev/v1alpha1/namespaces/default/taskruns/echo-hello-world-run-58r4m
  uid: 8ebc0c8e-16bb-11ea-8208-b21158039d49
spec:
  inputs: {}
  outputs: {}
  podTemplate: {}
  serviceAccountName: ""
  taskRef:
    kind: Task
    name: echo-hello-world
  timeout: 1h0m0s
status:
  completionTime: "2019-12-04T17:34:06Z"
  conditions:
  - lastTransitionTime: "2019-12-04T17:34:06Z"
    message: All Steps have completed executing
    reason: Succeeded
    status: "True"
    type: Succeeded
  podName: echo-hello-world-run-58r4m-pod-vljhj
  startTime: "2019-12-04T17:28:59Z"
  steps:
  - container: step-sleep
    imageID: docker.io/library/ubuntu@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d
    name: sleep
    terminated:
      containerID: containerd://5ddf7cd2c3dc40d79e97583fdf5b46c95939c487ba36a8092562ce4d9582274b
      exitCode: 0
      finishedAt: "2019-12-04T17:34:05Z"
      reason: Completed
      startedAt: "2019-12-04T17:29:04Z"
  - container: step-echo
    imageID: docker.io/library/ubuntu@sha256:6e9f67fa63b0323e9a1e587fd71c561ba48a034504fb804fd26fd8800039835d
    name: echo
    terminated:
      containerID: containerd://42572d1ab6632c58fb76dc11035aefd501658eb9f4e94168f03b68ad7ce454fa
      exitCode: 0
      finishedAt: "2019-12-04T17:34:05Z"
      reason: Completed
      startedAt: "2019-12-04T17:29:04Z"

I believe finished time are not accurate.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 4, 2019

Hmm, I'm seeing step finishedAt reported correctly. Specifying these steps:

apiVersion: tekton.dev/v1alpha1
kind: TaskRun
metadata:
  generateName: step-finish-time-
spec:
  taskSpec:
    steps:
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']
    - image: busybox
      args: ['-c', 'sleep 10']

Produced these statuses (kubectl get taskrun step-finish-time-2nb82 -oyaml, fields elided):

status:
  startTime: "2019-12-04T20:06:27Z"
  completionTime: "2019-12-04T20:07:28Z" <-- took ~1m total
  steps:
  - container: step-unnamed-0
    terminated:
      startedAt: "2019-12-04T20:06:31Z" <-- 4s after Pod start time
      finishedAt: "2019-12-04T20:06:45Z" <-- 14s after start time
  - container: step-unnamed-1
    terminated:
      startedAt: "2019-12-04T20:06:31Z" <-- (same start time throughout)
      finishedAt: "2019-12-04T20:06:55Z" <-- finished 10s later
  - container: step-unnamed-2
    terminated:
      startedAt: "2019-12-04T20:06:32Z"
      finishedAt: "2019-12-04T20:07:06Z" <-- finished 11s later
  - container: step-unnamed-3
    terminated:
      startedAt: "2019-12-04T20:06:32Z"
      finishedAt: "2019-12-04T20:07:16Z" <-- finished 10s later
  - container: step-unnamed-4
    terminated:
      startedAt: "2019-12-04T20:06:33Z"
      finishedAt: "2019-12-04T20:07:27Z" <-- finished 11s later

@othomann
Copy link
Contributor

othomann commented Dec 6, 2019

@imjasonh could you please assign this issue to me so that nobody else will try to fix it. I got the code working now and looking at adding some test for it.

@imjasonh
Copy link
Member Author

imjasonh commented Dec 6, 2019

/assign othomann

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants