-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reattempt pod creation in the face of ResourceQuota errors #905
Conversation
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits on tests.
c, _ := test.SeedTestData(t, d) | ||
observer, _ := observer.New(zap.InfoLevel) | ||
testHandler := NewTimeoutHandler(c.Kube, c.Pipeline, stopCh, zap.New(observer).Sugar()) | ||
dur := 50 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this duration come from? Is it just random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed variable to be more clear on its purpose here - it's the timer's duration. Added another variable beneath this one to document the timer's failure case more clearly.
} | ||
testHandler.SetTaskRunCallbackFunc(callback) | ||
testHandler.SetTaskRunTimer(taskRun, dur) | ||
<-time.After(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I always get a weird feeling when I see hardcoded sleeps in tests. What about another channel you wait on with a select (and a timeout like this) that you write to in the callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, updated to a select {}
that relies on a done channel and an expiration timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I left some minor feedback and then my most major piece of feedback: could waitRun
and setTimer
share more code (i.e. become one function) and then be able to share tests as well?
/meow boxes
pkg/reconciler/timeout_handler.go
Outdated
b.count += 1 | ||
b.deadline = time.Now().Add(backoffDuration(b.count)) | ||
t.backoffs[runObj.GetRunKey()] = b | ||
return b.count, b.deadline, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super minor, but im curious why you wanted to return b.count
and b.deadline
as two separate values here, vs. returning b
directly? (i.e. return b, true
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backoff
struct felt to me like an implementation detail of the internal state of the backoff-tracking mechanism and I didn't sense a strong reason to expose a new public type in the package's interface. I don't feel strongly about this and am happy to change it but am curious what might be the benefit of making it public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but am curious what might be the benefit of making it public?
The only one I can think of is simplifying the signature of the function a bit - since you already have a structure!
It reminded me of a pattern I've often seen where a function takes too many parameters or returns too many things, and this is often a sign that another abstraction can be added that would simplify the interface (this website talks about it a bit: https://refactoring.guru/smells/long-parameter-list disclaimer I have never seen this website before haha) - and it seemed like what was happening here was doing the opposite
Just a super minor thing tho!
Reason: reason, | ||
Message: fmt.Sprintf("%s: %v", msg, err), | ||
}) | ||
c.Recorder.Eventf(tr, corev1.EventTypeWarning, "BuildCreationFailed", "Failed to create build pod %q: %v", tr.Name, err) | ||
c.Logger.Errorf("Failed to create build pod for task %q :%v", err, tr.Name) | ||
c.Logger.Errorf("Failed to create build pod for task %q: %v", tr.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about moving the logic around determining the status of the Run out of this function and into a separate function that could have its own tests?
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! My feedback is becoming more and more minor X'D
// NumAttempts reflects the number of times a given StatusKey has been delayed | ||
NumAttempts uint | ||
// NextAttempt is the point in time at which this backoff expires | ||
NextAttempt time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nice, i like the new names!
t.Run(tc.description, func(t *testing.T) { | ||
receivedDuration := GetTimeout(tc.inputDuration) | ||
if receivedDuration != tc.expectedDuration { | ||
t.Errorf("expected %q received %q", tc.expectedDuration.String(), receivedDuration.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! i think the test is super clear this way :D
stopCh := make(chan struct{}) | ||
c, _ := test.SeedTestData(t, d) | ||
observer, _ := observer.New(zap.InfoLevel) | ||
testHandler := NewTimeoutHandler(c.Kube, c.Pipeline, stopCh, zap.New(observer).Sugar()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just realized something but this is totally just scope creep so I totally understand if you want to ignore me XD
I looked at this test and I was like "why does it need a kube client and a pipeline client"? and after looking into it it looks like those clients only exist in the timeout handler to support this one call to CheckTimeouts
- which means that instead of passing the clients into the timeout handler, the CheckTimeouts
function could take them as arguments :D which if I'm right would simplify the setup required for this test
anyway feel free to ignore this XD
timerFailDeadline := 100 * time.Millisecond | ||
doneCh := make(chan struct{}) | ||
callback := func(_ interface{}) { | ||
close(doneCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also have this function set a boolean or something that would could use to assert that the function was called or not called as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially implemented it using a bool that the callback mutated but this felt a little icky due to the inline sleep:
done := false
testHandler.SetTaskRunCallbackFunc(func(_ interface{}) { done = true })
go testHandler.SetTaskRunTimer(taskRun, timerDuration)
<-time.After(timerFailDeadline) // ick?
if !done {
t.Errorf("timer did not execute task run callback func within expected time")
}
I kind of like how the select
approach races the two channels against each other. Full credit to @dlorenc (#905 (comment)) for pointing this out.
I may have completely misinterpreted either or both approaches though so lmk if there's something I'm missing.
packageJitterFunc := jitterFunc | ||
for _, tc := range testcases { | ||
t.Run(tc.description, func(t *testing.T) { | ||
jitterFunc = tc.jitterFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super minor: you could do this assignment before the for loop (i.e. doesnt need to happen for every test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobcatfish I am guessing this is resolved as the jitterFunc
comes from the test cases (so got to be packageJitterFunc
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor comments 👼
packageJitterFunc := jitterFunc | ||
for _, tc := range testcases { | ||
t.Run(tc.description, func(t *testing.T) { | ||
jitterFunc = tc.jitterFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobcatfish I am guessing this is resolved as the jitterFunc
comes from the test cases (so got to be packageJitterFunc
)
for _, tc := range testcases { | ||
t.Run(tc.description, func(t *testing.T) { | ||
jitterFunc = tc.jitterFunc | ||
result := backoffDuration(tc.inputCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could pass the jitterFunc
as an argument to backoffDuration
instead. This would remove the need to set (or have) the jitterFunc
global variable (just for the test) 👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# in code
backoffDuration(b.NumAttempts, rand.Intn)
# in tests
backoffDuration(tc.inputCount, tc.jitterFunc)
@bobcatfish @vdemeester thanks for the feedback so far. I've updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
I'll let final words to @bobcatfish
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbwsg, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -122,22 +181,22 @@ func (t *TimeoutSet) checkPipelineRunTimeouts(namespace string) { | |||
|
|||
// CheckTimeouts function iterates through all namespaces and calls corresponding | |||
// taskrun/pipelinerun timeout functions | |||
func (t *TimeoutSet) CheckTimeouts() { | |||
namespaces, err := t.kubeclientset.CoreV1().Namespaces().List(metav1.ListOptions{}) | |||
func (t *TimeoutSet) CheckTimeouts(kubeclientset kubernetes.Interface, pipelineclientset clientset.Interface) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
niiiiice thanks for this cleanup with the clients!!! 😍 😍 😍
Looks awesome!! Thanks @sbwsg :D and thanks for the clean up too ^.^ /lgtm |
ah @sbwsg, needs a rebase 🙇♂️ |
TaskRuns create pods. Pods created in a namespace with ResourceQuota can be rejected on the basis of their resource requirements. This rejection manifests in the TaskRun reconciler as an error from the createPod() function. Prior to this commit all errors from createPod() were considered fatal and the associated TaskRun would be marked as failed. This commit introduces a process for reattempting pod creation in the face of ResourceQuota errors. When a ResourceQuota error is encountered, the TR reconciler now performs the following work: - The TaskRun is marked as Succeeded/Unknown with reason ExceededResourceQuota and a message including the number of reattempts that have been made - A reconcile is scheduled for the TaskRun using a backoff w/ jitter strategy Pod creation will be continually reattmpted until the ResourceQuota errors stop or the TaskRun times out.
🤞 |
/lgtm |
Scott has been provided super detailed helpful reviews for a while now (58 reviews as of https://tekton.devstats.cd.foundation/d/46/pr-reviews-by-contributor?orgId=1&var-period=d&var-repo_name=tektoncd%2Fpipeline&var-reviewers="sbwsg"). He has contributed useful and technically challenging features such as tektoncd#905 (recreate pods in face of ResourceQuota errors), drove completion of tektoncd#936 (Graceful sidecar support) and also tektoncd#871 (enforcing default TaskRun timeouts).
Scott has been provided super detailed helpful reviews for a while now (58 reviews as of https://tekton.devstats.cd.foundation/d/46/pr-reviews-by-contributor?orgId=1&var-period=d&var-repo_name=tektoncd%2Fpipeline&var-reviewers="sbwsg"). He has contributed useful and technically challenging features such as #905 (recreate pods in face of ResourceQuota errors), drove completion of #936 (Graceful sidecar support) and also #871 (enforcing default TaskRun timeouts).
Changes
Fixes #734
TLDR
TaskRuns used to fail immediately if their pods hit the ceiling of a ResourceQuota. Now they don't fail immediately and instead reattempt creating the pod until success or timeout.
Long Version:
TaskRuns create pods. Pods created in a namespace with ResourceQuota can be rejected on the basis of their resource requirements. This rejection manifests in the TaskRun reconciler as an error from the
createPod() function.
Prior to this commit all errors from createPod() were considered fatal and the associated TaskRun would be marked as failed. This commit introduces a process for reattempting pod creation in the face of ResourceQuota errors.
When a ResourceQuota error is encountered, the TR reconciler now performs the following work:
Pod creation will be continually reattmpted until the ResourceQuota errors stop or the TaskRun times out.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
Includes docs (if user facing)Screenshots
Before
TaskRuns that hit a ResourceQuota ceiling immediately fail out:
After
TaskRuns that hit a ResourceQuota are placed into a Succeeded/Unknown state and retry their pod creation until success or timeout.
Release Notes