We need the pods that are executing a taskRun to be in qosClass Guaranteed #4046

dfuessl · 2021-06-16T14:28:45Z

Expected Behavior

It should be possible to get the pods executing a taskRun in Quality of Service Class "Guaranteed".

Actual Behavior

Currently we see no way to get the Quality of Service Class pods higher than "Burstable"

Steps to Reproduce the Problem

Create a simple task
Specify resource requests and resource limits for all steps and make requests and limits equal.
Start a taskRun using this task
Find the k8s pod that executes the task
Get the pod definition: kubectl get pod -o yaml ...
Look for the string "qosClass" in the pod defininion
We find "qosClass: Burstable"

Additional Info

In k8s, when we make the resource limits equal to the requests, we get a pod with qosClass Guarateed. We would like this to be the same for a tekton class. We run Continuous Integration/Continuous Development with Tekton and we do not want the TaskRun pods to be evicted when there is heavy load on the system. But apparently this is currently not possible.

One of the possible cause may be that the initContainers (place-tools and place-scripts) have no resources specifiation at all. We do not see a way to make them run with resource requests and resource limits.

Another problem may be that all step containers but the largest run with zero resource requests. We understand that this may be for solving the issue #598. But this means that there is currently no way to make requests and limits equal for every container in the pod. Hence we cannot get the qosClass Guaranteed.

As all steps in the TaskRun will be executed sequentially, will it happen to be possible to run them all as initContainers? Currently only place-tools and place-scripts run as initContainers. This is only an idea. I cannot estimate what would be the implication of a chaneg like this.

Kubernetes version:

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

  Client version: 0.18.0
  Pipeline version: v0.24.1
  Triggers version: v0.14.1

The text was updated successfully, but these errors were encountered:

afrittoli · 2021-06-17T09:44:03Z

Thank you @dfuessl for the detailed bug report.

The current behaviour in Tekton is indeed related to the fact that step are executed sequentially, and we do not want to request to the k8s scheduler the sum of the resources needed by all steps (and init containers). See the docs for reference too.

We attempted to use initContainers for steps in the beginning of the Tekton project, but it did not work well - see this issue from 2018 (!) for more background.

Would it be possible to expand a bit more on your use case?
Are your tasks idempotent or do they require a cleanup before retry?

Some ideas:

You could use retry on tasks. This way an evicted pod would fail the TaskRun but the TaskRun would be re-executed for you. We could introduce some mechanism in the TaskRun / PipelineRun controller to do retries only in case of pod eviction.
We could spread the resources required across steps and init containers, so that the sum is what we need. It would then be possible to have resources and limits matching and the guaranteed qosClass. I don't know yet what the implications of this would be, it's just an idea, I'm not sure if it's viable.

/cc @bobcatfish @imjasonh

I'm going to convert this into a feature request, as the current behaviour is as designed - even though this design does not currently meet your needs 🙏

dfuessl · 2021-06-17T16:42:19Z

Thank you very much, @afrittoli, for your answer.

I have read through #224 and I understand now that it is not a good idea to user initContainers for all the steps in a task.

Your use case: Some pushes to (and merges on) the git repository, that contains the code base of our java application, will trigger a pipelineRun. We clone the git repo, build the system, do statical code analysis, execute several test suites (using web application servers, ephemeral databases and mock web services) and do other quality checks. We push test results to special git repositories. When everything is ok, we push the produced java artifacts to an artifact repository. Then we deploy the application to a server and do some additional tests. Finally an e-mail will be sent to the developers. Unfortunately, there are some tasks which are not idempotent.

I think, re-executing a taskRun that has been evicted due to high workload is not the best idea, because this would make the workload even higher. Also some tasks are not really idempotent. On the other hand, our Java developers would like to rely on our pipelines. Especially the merges in the git repository are not repeatable for them, so we have a problem.

Spreading resources across steps and init containers? I am not quite sure, if I understand this really well, but I think the limits apply to each step inidvidually. So we would have the resulting limit of the pod equal to the sum of the limits. And as we need the requests to be equal to the limit, also the request for the pod would be equal to the sum of the request. Hence we would be back to pre #598 .

The key point in my opinion is that k8s expects all containers to execute in parallel, so it needs the sum of the resources of the containers (besides initContainers) whereas tekton executes all containers (except sidecars) sequentially.

In our pipelines, currently, it would not be a big problem to go back to pre #598 (concerning the memory usage). We usually have big sidecars and one big step which consume lots of memory. The other steps are auxiliary and they need not much resources. But in general it will, of course, be a real problem.

Maybe we must redesign our pipelines so that each task has only one step. But this will produce a lot of ugly boilerplate code. We would have to rethink the modularity and the design of our container images. And then we will still have the problem with te initSteps (place-tools and place-scripts). Is there away (e. g. via podTemplates) to give them resource requests and limits?

tekton-robot · 2021-10-15T12:57:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-11-14T13:10:29Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-12-14T14:04:28Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-12-14T14:04:29Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dfuessl added the kind/bug Categorizes issue or PR as related to a bug. label Jun 16, 2021

afrittoli added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 17, 2021

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 14, 2021

tekton-robot closed this as completed Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We need the pods that are executing a taskRun to be in qosClass Guaranteed #4046

We need the pods that are executing a taskRun to be in qosClass Guaranteed #4046

dfuessl commented Jun 16, 2021

afrittoli commented Jun 17, 2021

dfuessl commented Jun 17, 2021

tekton-robot commented Oct 15, 2021

tekton-robot commented Nov 14, 2021

tekton-robot commented Dec 14, 2021

tekton-robot commented Dec 14, 2021

We need the pods that are executing a taskRun to be in qosClass Guaranteed #4046

We need the pods that are executing a taskRun to be in qosClass Guaranteed #4046

Comments

dfuessl commented Jun 16, 2021

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

afrittoli commented Jun 17, 2021

dfuessl commented Jun 17, 2021

tekton-robot commented Oct 15, 2021

tekton-robot commented Nov 14, 2021

tekton-robot commented Dec 14, 2021

tekton-robot commented Dec 14, 2021