resource hoarding for steps of a taskrun #4347

Fabian-K · 2021-11-02T15:34:50Z

Hi, we noticed the following issue after upgrading from v0.26.0 to v0.28.2

Expected Behavior

When creating a taskrun with multiple steps where the steps have resource requests defined, tekton should not "hoard resources". The created pod should only request the maximum of the step containers and not the sum of all step containers. This is also described in https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#specifying-resources with "The CPU, memory, and ephemeral storage resource requests will be set to zero (also known as BestEffort), or, if specified, the minimums set through LimitRanges in that Namespace, if the container image does not have the largest resource request out of all container images in the Task. This ensures that the Pod that executes the Task only requests enough resources to run a single container image in the Task rather than hoard resources for all container images in the Task at once."

Actual Behavior

Tekton does hoard resources in 0.28.2

Steps to Reproduce the Problem

Create a TaskRun with multiple steps and resource requests
Check the resource requests of the created pod

TaskRun

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  generateName: run-
spec:
  taskSpec:
    steps:
      - name: step-a
        image: alpine
        script: echo "step-a"
        resources:
          requests:
            cpu: 1
      - name: step-b
        image: alpine
        script: echo "step-b"
        resources:
          requests:
            cpu: 1

Resulting Pod using Tekton v0.28.2 - unexpected resource hoarding

apiVersion: v1
kind: Pod
metadata:
  name: run-w5m78-pod-lnl5t
spec:
  containers:
    - name: step-step-a
      resources:
        requests:
          cpu: "1"
    - name: step-step-b
      resources:
        requests:
          cpu: "1"

Resulting Pod using Tekton v0.26.0 - expected behavior, no resource hoarding

apiVersion: v1
kind: Pod
metadata:
  name: run-ttnjl-pod-rwj5x
spec:
  containers:
    - name: step-step-a
      resources:
        requests:
          cpu: "1"
          ephemeral-storage: "0"
          memory: "0"
    - name: step-step-b
      resources:
        requests:
          cpu: "0"
          ephemeral-storage: "0"
          memory: "0"

Additional Info

Kubernetes version: v1.21.1
Tekton Pipeline version: v0.28.2

The text was updated successfully, but these errors were encountered:

vdemeester · 2021-11-02T15:48:55Z

Hey @Fabian-K,

Thanks for the issue.

When creating a taskrun with multiple steps where the steps have resource requests defined, tekton should not "hoard resources". The created pod should only request the maximum of the step containers and not the sum of all step containers.

This is not a "choice" we have, this is how Kubernetes and pods are working, and.. we can only try our best around this.

This is also described in https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md#specifying-resources with "The CPU, memory, and ephemeral storage resource requests will be set to zero (also known as BestEffort), or, if specified, the minimums set through LimitRanges in that Namespace, […]

Right, I did update some documentation (mainly around LimitRange) but I think I missed this on the update the part of the Task and steps.

But there is one thing that is sure, we do not (want to) modify the user "asks". If a user ask for 1 CPU for each step, we "require" 1 CPU for each container (aka we are not trying to get smart here). One reasoning here is : some user may be used to Pod and Containers, and will want to have (in your example) 2 CPU for this Task, and others may want (as you want), 1 CPU for it ; it's very tricky to get it right for everybody, so the idea, at least on the step part of it, is to treat things specified by the user as non-modifiable. That said, a way forward here would be to let the user specify resource limits/requirement at the Task level instead — then you would be able to say "however it is done on steps, I want this task to hold 1 CPU for the whole Task".

Fabian-K · 2021-11-02T16:30:03Z

Hi @vdemeester, thanks for the fast feedback! To me, the "smart" behavior of 0.26.0 made sense because when there are two steps requesting each 1 CPU, because they run sequentially in tekton, it is sufficient for the pod to only "request" 1 CPU and not 2. Yes, both containers still run at the same time but "when not being active", the steps should not use many resources - and by adjusting other requests to 0 by tekton, the overall requested resources of the pod matched. Because of the sequential execution, each step had the amount of CPU available that it requested. The behavior of tekton 0.28.2 kind of "blocks" 1 CPU that could be used for other workloads.

Specifying the resource requests on the task level sounds like a good idea! And also let me add: the new behavior of being "stupid" and not adjusting the requests is fine - I was just surprised by it because I did not notice the behavior change in the release notes and suddenly after the upgrade taskruns could not be scheduled any more because they effectively requested much more resources than before and the nodes were not big enough 😅

vdemeester · 2021-11-02T16:43:59Z

Specifying the resource requests on the task level sounds like a good idea! And also let me add: the new behavior of being "stupid" and not adjusting the requests is fine - I was just surprised by it because I did not notice the behavior change in the release notes and suddenly after the upgrade taskruns could not be scheduled any more because they effectively requested much more resources than before and the nodes were not big enough

Yeah, re-reading the release-note.. I think I also did a poor job highlighting the effect on resource request (without the LimitRange use case in mind), sorry 😅 😓 .

Fabian-K · 2021-11-03T10:21:22Z

So if I understand correctly, starting with 0.28 the expected behavior is "simple" aka don´t change/update any requests that are present. With that, this is not a bug and can be closed.

I think however It should make sense to update the docs as the following is not valid anymore?

pipeline/docs/tasks.md

Lines 166 to 174 in f2e25cf

    
           - The CPU, memory, and ephemeral storage resource requests will be set 
        
             to zero (also known as 
        
             [BestEffort](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-besteffort)), 
        
             or, if specified, the minimums set through `LimitRanges` in that 
        
             `Namespace`, if the container image does not have the largest 
        
             resource request out of all container images in the `Task.` This 
        
             ensures that the Pod that executes the `Task` only requests enough 
        
             resources to run a single container image in the `Task` rather than 
        
             hoard resources for all container images in the `Task` at once.

vdemeester · 2021-11-03T11:31:27Z

Yes, we need to fix the docs (so.. I consider it as a "doc" bug 😛 )

lbernick · 2022-01-14T22:16:34Z

Docs updated in #4472.

/close

@Fabian-K you might be interested in #4470

tekton-robot · 2022-01-14T22:16:41Z

@lbernick: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

Docs updated in #4472.

/close

@Fabian-K you might be interested in #4470

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lbernick · 2022-01-14T22:16:50Z

/assign

lbernick · 2022-01-14T22:17:03Z

/close

tekton-robot · 2022-01-14T22:17:11Z

@lbernick: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jerop · 2022-01-24T20:15:36Z

fixed in #4472

thanks @lbernick 🎉

Fabian-K added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2021

jerop added the kind/documentation Categorizes issue or PR as related to documentation. label Nov 15, 2021

jerop removed the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2022

tekton-robot assigned lbernick Jan 14, 2022

jerop linked a pull request Jan 24, 2022 that will close this issue

Update docs to reflect resource requests of pods #4472

Merged

5 tasks

jerop closed this as completed Jan 24, 2022

lbernick mentioned this issue Mar 9, 2022

FR: Task-level (and maybe Pipeline-level) resource requests and limits #4470

Closed

lbernick mentioned this issue Apr 13, 2023

Step Resource "Request" Distribution Regression #6525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource hoarding for steps of a taskrun #4347

resource hoarding for steps of a taskrun #4347

Fabian-K commented Nov 2, 2021

vdemeester commented Nov 2, 2021

Fabian-K commented Nov 2, 2021

vdemeester commented Nov 2, 2021

Fabian-K commented Nov 3, 2021

vdemeester commented Nov 3, 2021

lbernick commented Jan 14, 2022

tekton-robot commented Jan 14, 2022

lbernick commented Jan 14, 2022

lbernick commented Jan 14, 2022

tekton-robot commented Jan 14, 2022

jerop commented Jan 24, 2022

resource hoarding for steps of a taskrun #4347

resource hoarding for steps of a taskrun #4347

Comments

Fabian-K commented Nov 2, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

vdemeester commented Nov 2, 2021

Fabian-K commented Nov 2, 2021

vdemeester commented Nov 2, 2021

Fabian-K commented Nov 3, 2021

vdemeester commented Nov 3, 2021

lbernick commented Jan 14, 2022

tekton-robot commented Jan 14, 2022

lbernick commented Jan 14, 2022

lbernick commented Jan 14, 2022

tekton-robot commented Jan 14, 2022

jerop commented Jan 24, 2022