Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource hoarding for steps of a taskrun #4347

Closed
Fabian-K opened this issue Nov 2, 2021 · 11 comments · Fixed by #4472
Closed

resource hoarding for steps of a taskrun #4347

Fabian-K opened this issue Nov 2, 2021 · 11 comments · Fixed by #4472
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation.

Comments

@Fabian-K
Copy link
Contributor

Fabian-K commented Nov 2, 2021

Hi, we noticed the following issue after upgrading from v0.26.0 to v0.28.2

Expected Behavior

When creating a taskrun with multiple steps where the steps have resource requests defined, tekton should not "hoard resources". The created pod should only request the maximum of the step containers and not the sum of all step containers. This is also described in https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#specifying-resources with "The CPU, memory, and ephemeral storage resource requests will be set to zero (also known as BestEffort), or, if specified, the minimums set through LimitRanges in that Namespace, if the container image does not have the largest resource request out of all container images in the Task. This ensures that the Pod that executes the Task only requests enough resources to run a single container image in the Task rather than hoard resources for all container images in the Task at once."

Actual Behavior

Tekton does hoard resources in 0.28.2

Steps to Reproduce the Problem

  1. Create a TaskRun with multiple steps and resource requests
  2. Check the resource requests of the created pod

TaskRun

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  generateName: run-
spec:
  taskSpec:
    steps:
      - name: step-a
        image: alpine
        script: echo "step-a"
        resources:
          requests:
            cpu: 1
      - name: step-b
        image: alpine
        script: echo "step-b"
        resources:
          requests:
            cpu: 1

Resulting Pod using Tekton v0.28.2 - unexpected resource hoarding

apiVersion: v1
kind: Pod
metadata:
  name: run-w5m78-pod-lnl5t
spec:
  containers:
    - name: step-step-a
      resources:
        requests:
          cpu: "1"
    - name: step-step-b
      resources:
        requests:
          cpu: "1"

Resulting Pod using Tekton v0.26.0 - expected behavior, no resource hoarding

apiVersion: v1
kind: Pod
metadata:
  name: run-ttnjl-pod-rwj5x
spec:
  containers:
    - name: step-step-a
      resources:
        requests:
          cpu: "1"
          ephemeral-storage: "0"
          memory: "0"
    - name: step-step-b
      resources:
        requests:
          cpu: "0"
          ephemeral-storage: "0"
          memory: "0"

Additional Info

  • Kubernetes version: v1.21.1
  • Tekton Pipeline version: v0.28.2
@Fabian-K Fabian-K added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2021
@vdemeester
Copy link
Member

Hey @Fabian-K,

Thanks for the issue.

When creating a taskrun with multiple steps where the steps have resource requests defined, tekton should not "hoard resources". The created pod should only request the maximum of the step containers and not the sum of all step containers.

This is not a "choice" we have, this is how Kubernetes and pods are working, and.. we can only try our best around this.

This is also described in https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#specifying-resources with "The CPU, memory, and ephemeral storage resource requests will be set to zero (also known as BestEffort), or, if specified, the minimums set through LimitRanges in that Namespace, […]

Right, I did update some documentation (mainly around LimitRange) but I think I missed this on the update the part of the Task and steps.

But there is one thing that is sure, we do not (want to) modify the user "asks". If a user ask for 1 CPU for each step, we "require" 1 CPU for each container (aka we are not trying to get smart here). One reasoning here is : some user may be used to Pod and Containers, and will want to have (in your example) 2 CPU for this Task, and others may want (as you want), 1 CPU for it ; it's very tricky to get it right for everybody, so the idea, at least on the step part of it, is to treat things specified by the user as non-modifiable. That said, a way forward here would be to let the user specify resource limits/requirement at the Task level instead — then you would be able to say "however it is done on steps, I want this task to hold 1 CPU for the whole Task".

@Fabian-K
Copy link
Contributor Author

Fabian-K commented Nov 2, 2021

Hi @vdemeester, thanks for the fast feedback! To me, the "smart" behavior of 0.26.0 made sense because when there are two steps requesting each 1 CPU, because they run sequentially in tekton, it is sufficient for the pod to only "request" 1 CPU and not 2. Yes, both containers still run at the same time but "when not being active", the steps should not use many resources - and by adjusting other requests to 0 by tekton, the overall requested resources of the pod matched. Because of the sequential execution, each step had the amount of CPU available that it requested. The behavior of tekton 0.28.2 kind of "blocks" 1 CPU that could be used for other workloads.

Specifying the resource requests on the task level sounds like a good idea! And also let me add: the new behavior of being "stupid" and not adjusting the requests is fine - I was just surprised by it because I did not notice the behavior change in the release notes and suddenly after the upgrade taskruns could not be scheduled any more because they effectively requested much more resources than before and the nodes were not big enough 😅

@vdemeester
Copy link
Member

Specifying the resource requests on the task level sounds like a good idea! And also let me add: the new behavior of being "stupid" and not adjusting the requests is fine - I was just surprised by it because I did not notice the behavior change in the release notes and suddenly after the upgrade taskruns could not be scheduled any more because they effectively requested much more resources than before and the nodes were not big enough

Yeah, re-reading the release-note.. I think I also did a poor job highlighting the effect on resource request (without the LimitRange use case in mind), sorry 😅 😓 .

@Fabian-K
Copy link
Contributor Author

Fabian-K commented Nov 3, 2021

So if I understand correctly, starting with 0.28 the expected behavior is "simple" aka don´t change/update any requests that are present. With that, this is not a bug and can be closed.

I think however It should make sense to update the docs as the following is not valid anymore?

pipeline/docs/tasks.md

Lines 166 to 174 in f2e25cf

- The CPU, memory, and ephemeral storage resource requests will be set
to zero (also known as
[BestEffort](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-besteffort)),
or, if specified, the minimums set through `LimitRanges` in that
`Namespace`, if the container image does not have the largest
resource request out of all container images in the `Task.` This
ensures that the Pod that executes the `Task` only requests enough
resources to run a single container image in the `Task` rather than
hoard resources for all container images in the `Task` at once.

@vdemeester
Copy link
Member

Yes, we need to fix the docs (so.. I consider it as a "doc" bug 😛 )

@jerop jerop added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 15, 2021
@jerop jerop removed the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2022
@lbernick
Copy link
Member

Docs updated in #4472.

/close

@Fabian-K you might be interested in #4470

@tekton-robot
Copy link
Collaborator

@lbernick: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

Docs updated in #4472.

/close

@Fabian-K you might be interested in #4470

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lbernick
Copy link
Member

/assign

@lbernick
Copy link
Member

/close

@tekton-robot
Copy link
Collaborator

@lbernick: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jerop jerop linked a pull request Jan 24, 2022 that will close this issue
5 tasks
@jerop
Copy link
Member

jerop commented Jan 24, 2022

fixed in #4472

thanks @lbernick 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants