Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline controller shouldn't retry creating pod when the error cannot be mitigated by retry #4092

Closed
jialindai opened this issue Jul 13, 2021 · 5 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@jialindai
Copy link

Expected Behavior

When executing one pipelinerun, sometimes the pod creation will fail due to error which cannot be mitigated by retrying. In this case, pipeline controller should simply fail the pipelinerun without retrying creating pod.

In my case, such error is pod creation failure due to not enough quota in namespace.

Actual Behavior

Pipeline controller keep trying to create pod even there is no enough quota in namespace.

Steps to Reproduce the Problem

  1. Create one namespace with limited resource quota
  2. Create one pipeline which will consume more resource than allowed in the namespace
  3. Pipeline controller will retry to create the pod

Additional Info

  • Kubernetes version:

    Output of kubectl version:

1.18


- Tekton Pipeline version:
v0.22.0

<!-- Any other additional information -->
@jialindai jialindai added the kind/bug Categorizes issue or PR as related to a bug. label Jul 13, 2021
@bobcatfish
Copy link
Collaborator

hey @jialindai !

Pipeline controller keep trying to create pod even there is no enough quota in namespace.

I'm wondering, how would the pipeline controller know if that situation couldn't be mitigated? i.e. what if more quota became available in the namespace later? It sounds like that won't happen in your case, but I think it could for someone else (if I'm wrong maybe you can provide some more details about your setup - e.g. is there some way to conclusively know that quota won't be available?)

You might also fine #734 interesting, which is all about scheduling in resource constrained environments - in that case we intentionally retry and backoff, waiting until resources are available.

For your specific case, it might make sense for you to create a controller (or maybe even cron?) which observes PipelineRuns in the state you are describing and cancels them.

@bobcatfish bobcatfish added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 15, 2021
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 14, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants