Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need the pods that are executing a taskRun to be in qosClass Guaranteed #4046

Closed
dfuessl opened this issue Jun 16, 2021 · 6 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dfuessl
Copy link

dfuessl commented Jun 16, 2021

Expected Behavior

It should be possible to get the pods executing a taskRun in Quality of Service Class "Guaranteed".

Actual Behavior

Currently we see no way to get the Quality of Service Class pods higher than "Burstable"

Steps to Reproduce the Problem

  1. Create a simple task
  2. Specify resource requests and resource limits for all steps and make requests and limits equal.
  3. Start a taskRun using this task
  4. Find the k8s pod that executes the task
  5. Get the pod definition: kubectl get pod -o yaml ...
  6. Look for the string "qosClass" in the pod defininion
  7. We find "qosClass: Burstable"

Additional Info

In k8s, when we make the resource limits equal to the requests, we get a pod with qosClass Guarateed. We would like this to be the same for a tekton class. We run Continuous Integration/Continuous Development with Tekton and we do not want the TaskRun pods to be evicted when there is heavy load on the system. But apparently this is currently not possible.

One of the possible cause may be that the initContainers (place-tools and place-scripts) have no resources specifiation at all. We do not see a way to make them run with resource requests and resource limits.

Another problem may be that all step containers but the largest run with zero resource requests. We understand that this may be for solving the issue #598. But this means that there is currently no way to make requests and limits equal for every container in the pod. Hence we cannot get the qosClass Guaranteed.

As all steps in the TaskRun will be executed sequentially, will it happen to be possible to run them all as initContainers? Currently only place-tools and place-scripts run as initContainers. This is only an idea. I cannot estimate what would be the implication of a chaneg like this.

  • Kubernetes version:

    Output of kubectl version:

    Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
    
    
  • Tekton Pipeline version:

  Client version: 0.18.0
  Pipeline version: v0.24.1
  Triggers version: v0.14.1
@dfuessl dfuessl added the kind/bug Categorizes issue or PR as related to a bug. label Jun 16, 2021
@afrittoli
Copy link
Member

Thank you @dfuessl for the detailed bug report.

The current behaviour in Tekton is indeed related to the fact that step are executed sequentially, and we do not want to request to the k8s scheduler the sum of the resources needed by all steps (and init containers). See the docs for reference too.

We attempted to use initContainers for steps in the beginning of the Tekton project, but it did not work well - see this issue from 2018 (!) for more background.

Would it be possible to expand a bit more on your use case?
Are your tasks idempotent or do they require a cleanup before retry?

Some ideas:

  • You could use retry on tasks. This way an evicted pod would fail the TaskRun but the TaskRun would be re-executed for you. We could introduce some mechanism in the TaskRun / PipelineRun controller to do retries only in case of pod eviction.
  • We could spread the resources required across steps and init containers, so that the sum is what we need. It would then be possible to have resources and limits matching and the guaranteed qosClass. I don't know yet what the implications of this would be, it's just an idea, I'm not sure if it's viable.

/cc @bobcatfish @imjasonh

I'm going to convert this into a feature request, as the current behaviour is as designed - even though this design does not currently meet your needs 🙏

@afrittoli afrittoli added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jun 17, 2021
@dfuessl
Copy link
Author

dfuessl commented Jun 17, 2021

Thank you very much, @afrittoli, for your answer.

I have read through #224 and I understand now that it is not a good idea to user initContainers for all the steps in a task.

Your use case: Some pushes to (and merges on) the git repository, that contains the code base of our java application, will trigger a pipelineRun. We clone the git repo, build the system, do statical code analysis, execute several test suites (using web application servers, ephemeral databases and mock web services) and do other quality checks. We push test results to special git repositories. When everything is ok, we push the produced java artifacts to an artifact repository. Then we deploy the application to a server and do some additional tests. Finally an e-mail will be sent to the developers. Unfortunately, there are some tasks which are not idempotent.

I think, re-executing a taskRun that has been evicted due to high workload is not the best idea, because this would make the workload even higher. Also some tasks are not really idempotent. On the other hand, our Java developers would like to rely on our pipelines. Especially the merges in the git repository are not repeatable for them, so we have a problem.

Spreading resources across steps and init containers? I am not quite sure, if I understand this really well, but I think the limits apply to each step inidvidually. So we would have the resulting limit of the pod equal to the sum of the limits. And as we need the requests to be equal to the limit, also the request for the pod would be equal to the sum of the request. Hence we would be back to pre #598 .

The key point in my opinion is that k8s expects all containers to execute in parallel, so it needs the sum of the resources of the containers (besides initContainers) whereas tekton executes all containers (except sidecars) sequentially.

In our pipelines, currently, it would not be a big problem to go back to pre #598 (concerning the memory usage). We usually have big sidecars and one big step which consume lots of memory. The other steps are auxiliary and they need not much resources. But in general it will, of course, be a real problem.

Maybe we must redesign our pipelines so that each task has only one step. But this will produce a lot of ugly boilerplate code. We would have to rethink the modularity and the design of our container images. And then we will still have the problem with te initSteps (place-tools and place-scripts). Is there away (e. g. via podTemplates) to give them resource requests and limits?

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 14, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants