Reattempt pod creation in the face of ResourceQuota errors #905

ghost · 2019-05-24T17:52:20Z

Changes

Fixes #734

TLDR

TaskRuns used to fail immediately if their pods hit the ceiling of a ResourceQuota. Now they don't fail immediately and instead reattempt creating the pod until success or timeout.

Long Version:

TaskRuns create pods. Pods created in a namespace with ResourceQuota can be rejected on the basis of their resource requirements. This rejection manifests in the TaskRun reconciler as an error from the
createPod() function.

Prior to this commit all errors from createPod() were considered fatal and the associated TaskRun would be marked as failed. This commit introduces a process for reattempting pod creation in the face of ResourceQuota errors.

When a ResourceQuota error is encountered, the TR reconciler now performs the following work:

The TaskRun is marked as Succeeded/Unknown with reason ExceededResourceQuota and a message including the number of reattempts that have been made
A reconcile is scheduled for the TaskRun using a backoff w/ jitter strategy

Pod creation will be continually reattmpted until the ResourceQuota errors stop or the TaskRun times out.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
~~Includes docs (if user facing)~~
Commit messages follow commit message best practices

Screenshots

Before

TaskRuns that hit a ResourceQuota ceiling immediately fail out:

After

TaskRuns that hit a ResourceQuota are placed into a Succeeded/Unknown state and retry their pod creation until success or timeout.

Release Notes

Task pods that hit resource limits imposed by ResourceQuota objects in a namespace will no longer fail their TaskRun immediately. Pod creation will be repeatedly tried until successfully started or the TaskRun times out.

abayer · 2019-05-24T18:33:53Z

/lgtm

dlorenc

A few minor nits on tests.

pkg/reconciler/timeout_handler_test.go

dlorenc · 2019-05-25T14:13:11Z

pkg/reconciler/timeout_handler_test.go

+	c, _ := test.SeedTestData(t, d)
+	observer, _ := observer.New(zap.InfoLevel)
+	testHandler := NewTimeoutHandler(c.Kube, c.Pipeline, stopCh, zap.New(observer).Sugar())
+	dur := 50 * time.Millisecond


where does this duration come from? Is it just random?

Renamed variable to be more clear on its purpose here - it's the timer's duration. Added another variable beneath this one to document the timer's failure case more clearly.

dlorenc · 2019-05-25T14:14:18Z

pkg/reconciler/timeout_handler_test.go

+	}
+	testHandler.SetTaskRunCallbackFunc(callback)
+	testHandler.SetTaskRunTimer(taskRun, dur)
+	<-time.After(100 * time.Millisecond)


Hmm, I always get a weird feeling when I see hardcoded sleeps in tests. What about another channel you wait on with a select (and a timeout like this) that you write to in the callback?

Nice, updated to a select {} that relies on a done channel and an expiration timer.

vdemeester

/lgtm

vdemeester · 2019-05-28T13:34:46Z

/lgtm

bobcatfish

Looking good! I left some minor feedback and then my most major piece of feedback: could waitRun and setTimer share more code (i.e. become one function) and then be able to share tests as well?

/meow boxes

bobcatfish · 2019-05-28T17:44:12Z

pkg/reconciler/timeout_handler.go

+	b.count += 1
+	b.deadline = time.Now().Add(backoffDuration(b.count))
+	t.backoffs[runObj.GetRunKey()] = b
+	return b.count, b.deadline, true


this is super minor, but im curious why you wanted to return b.count and b.deadline as two separate values here, vs. returning b directly? (i.e. return b, true)

The backoff struct felt to me like an implementation detail of the internal state of the backoff-tracking mechanism and I didn't sense a strong reason to expose a new public type in the package's interface. I don't feel strongly about this and am happy to change it but am curious what might be the benefit of making it public?

but am curious what might be the benefit of making it public?

The only one I can think of is simplifying the signature of the function a bit - since you already have a structure!

It reminded me of a pattern I've often seen where a function takes too many parameters or returns too many things, and this is often a sign that another abstraction can be added that would simplify the interface (this website talks about it a bit: https://refactoring.guru/smells/long-parameter-list disclaimer I have never seen this website before haha) - and it seemed like what was happening here was doing the opposite

Just a super minor thing tho!

pkg/reconciler/timeout_handler.go

pkg/reconciler/timeout_handler_test.go

bobcatfish · 2019-05-28T18:01:45Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

 				Reason:  reason,
 				Message: fmt.Sprintf("%s: %v", msg, err),
 			})
 			c.Recorder.Eventf(tr, corev1.EventTypeWarning, "BuildCreationFailed", "Failed to create build pod %q: %v", tr.Name, err)
-			c.Logger.Errorf("Failed to create build pod for task %q :%v", err, tr.Name)
+			c.Logger.Errorf("Failed to create build pod for task %q: %v", tr.Name, err)


what do you think about moving the logic around determining the status of the Run out of this function and into a separate function that could have its own tests?

pkg/reconciler/v1alpha1/taskrun/taskrun.go

tekton-robot · 2019-05-28T18:10:18Z

@bobcatfish:

In response to this:

Looking good! I left some minor feedback and then my most major piece of feedback: could waitRun and setTimer share more code (i.e. become one function) and then be able to share tests as well?

/meow boxes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bobcatfish

Looking great! My feedback is becoming more and more minor X'D

bobcatfish · 2019-05-28T20:12:09Z

pkg/reconciler/timeout_handler.go

+	// NumAttempts reflects the number of times a given StatusKey has been delayed
+	NumAttempts uint
+	// NextAttempt is the point in time at which this backoff expires
+	NextAttempt time.Time


ah nice, i like the new names!

bobcatfish · 2019-05-28T20:13:45Z

pkg/reconciler/timeout_handler_test.go

+		t.Run(tc.description, func(t *testing.T) {
+			receivedDuration := GetTimeout(tc.inputDuration)
+			if receivedDuration != tc.expectedDuration {
+				t.Errorf("expected %q received %q", tc.expectedDuration.String(), receivedDuration.String())


nice! i think the test is super clear this way :D

bobcatfish · 2019-05-28T20:17:32Z

pkg/reconciler/timeout_handler_test.go

+	stopCh := make(chan struct{})
+	c, _ := test.SeedTestData(t, d)
+	observer, _ := observer.New(zap.InfoLevel)
+	testHandler := NewTimeoutHandler(c.Kube, c.Pipeline, stopCh, zap.New(observer).Sugar())


i just realized something but this is totally just scope creep so I totally understand if you want to ignore me XD

I looked at this test and I was like "why does it need a kube client and a pipeline client"? and after looking into it it looks like those clients only exist in the timeout handler to support this one call to CheckTimeouts - which means that instead of passing the clients into the timeout handler, the CheckTimeouts function could take them as arguments :D which if I'm right would simplify the setup required for this test

anyway feel free to ignore this XD

bobcatfish · 2019-05-28T20:18:20Z

pkg/reconciler/timeout_handler_test.go

+	timerFailDeadline := 100 * time.Millisecond
+	doneCh := make(chan struct{})
+	callback := func(_ interface{}) {
+		close(doneCh)


you could also have this function set a boolean or something that would could use to assert that the function was called or not called as expected

I initially implemented it using a bool that the callback mutated but this felt a little icky due to the inline sleep:

done := false testHandler.SetTaskRunCallbackFunc(func(_ interface{}) { done = true }) go testHandler.SetTaskRunTimer(taskRun, timerDuration) <-time.After(timerFailDeadline) // ick? if !done { t.Errorf("timer did not execute task run callback func within expected time") }

I kind of like how the select approach races the two channels against each other. Full credit to @dlorenc (#905 (comment)) for pointing this out.

I may have completely misinterpreted either or both approaches though so lmk if there's something I'm missing.

bobcatfish · 2019-05-28T20:20:11Z

pkg/reconciler/timeout_handler_test.go

+	packageJitterFunc := jitterFunc
+	for _, tc := range testcases {
+		t.Run(tc.description, func(t *testing.T) {
+			jitterFunc = tc.jitterFunc


super minor: you could do this assignment before the for loop (i.e. doesnt need to happen for every test)

@bobcatfish I am guessing this is resolved as the jitterFunc comes from the test cases (so got to be packageJitterFunc)

vdemeester

few minor comments 👼

vdemeester · 2019-05-29T08:13:54Z

pkg/reconciler/timeout_handler_test.go

+	packageJitterFunc := jitterFunc
+	for _, tc := range testcases {
+		t.Run(tc.description, func(t *testing.T) {
+			jitterFunc = tc.jitterFunc


@bobcatfish I am guessing this is resolved as the jitterFunc comes from the test cases (so got to be packageJitterFunc)

vdemeester · 2019-05-29T08:15:29Z

pkg/reconciler/timeout_handler_test.go

+	for _, tc := range testcases {
+		t.Run(tc.description, func(t *testing.T) {
+			jitterFunc = tc.jitterFunc
+			result := backoffDuration(tc.inputCount)


I wonder if we could pass the jitterFunc as an argument to backoffDuration instead. This would remove the need to set (or have) the jitterFunc global variable (just for the test) 👼

# in code backoffDuration(b.NumAttempts, rand.Intn) # in tests backoffDuration(tc.inputCount, tc.jitterFunc)

ghost · 2019-05-29T14:19:29Z

@bobcatfish @vdemeester thanks for the feedback so far. I've updated jitterFunc to be an arg passed to backoffDuration(), moved taskrun status -related code to its own method (w/ UTs) and modified CheckTimeouts() to take clientsets as arguments. That last change required some small refactoring in controller/main.go and the pipelinerun / taskrun reconcilers.

vdemeester

Looks good to me
I'll let final words to @bobcatfish

tekton-robot · 2019-05-29T14:37:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbwsg, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bobcatfish · 2019-05-29T18:32:30Z

pkg/reconciler/timeout_handler.go

@@ -122,22 +181,22 @@ func (t *TimeoutSet) checkPipelineRunTimeouts(namespace string) {

 // CheckTimeouts function iterates through all namespaces and calls corresponding
 // taskrun/pipelinerun timeout functions
-func (t *TimeoutSet) CheckTimeouts() {
-	namespaces, err := t.kubeclientset.CoreV1().Namespaces().List(metav1.ListOptions{})
+func (t *TimeoutSet) CheckTimeouts(kubeclientset kubernetes.Interface, pipelineclientset clientset.Interface) {


niiiiice thanks for this cleanup with the clients!!! 😍 😍 😍

bobcatfish · 2019-05-29T18:33:47Z

Looks awesome!! Thanks @sbwsg :D and thanks for the clean up too ^.^

/lgtm

vdemeester · 2019-05-29T19:00:20Z

ah @sbwsg, needs a rebase 🙇‍♂️

TaskRuns create pods. Pods created in a namespace with ResourceQuota can be rejected on the basis of their resource requirements. This rejection manifests in the TaskRun reconciler as an error from the createPod() function. Prior to this commit all errors from createPod() were considered fatal and the associated TaskRun would be marked as failed. This commit introduces a process for reattempting pod creation in the face of ResourceQuota errors. When a ResourceQuota error is encountered, the TR reconciler now performs the following work: - The TaskRun is marked as Succeeded/Unknown with reason ExceededResourceQuota and a message including the number of reattempts that have been made - A reconcile is scheduled for the TaskRun using a backoff w/ jitter strategy Pod creation will be continually reattmpted until the ResourceQuota errors stop or the TaskRun times out.

ghost · 2019-05-29T20:09:09Z

🤞

vdemeester · 2019-05-29T20:19:01Z

/lgtm

Scott has been provided super detailed helpful reviews for a while now (58 reviews as of https://tekton.devstats.cd.foundation/d/46/pr-reviews-by-contributor?orgId=1&var-period=d&var-repo_name=tektoncd%2Fpipeline&var-reviewers="sbwsg"). He has contributed useful and technically challenging features such as tektoncd#905 (recreate pods in face of ResourceQuota errors), drove completion of tektoncd#936 (Graceful sidecar support) and also tektoncd#871 (enforcing default TaskRun timeouts).

Scott has been provided super detailed helpful reviews for a while now (58 reviews as of https://tekton.devstats.cd.foundation/d/46/pr-reviews-by-contributor?orgId=1&var-period=d&var-repo_name=tektoncd%2Fpipeline&var-reviewers="sbwsg"). He has contributed useful and technically challenging features such as #905 (recreate pods in face of ResourceQuota errors), drove completion of #936 (Graceful sidecar support) and also #871 (enforcing default TaskRun timeouts).

tekton-robot requested review from bobcatfish and dlorenc May 24, 2019 17:52

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label May 24, 2019

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 24, 2019

tekton-robot assigned abayer May 24, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2019

dlorenc reviewed May 25, 2019

View reviewed changes

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2019

vdemeester reviewed May 28, 2019

View reviewed changes

tekton-robot assigned vdemeester May 28, 2019

tekton-robot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels May 28, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2019

bobcatfish reviewed May 28, 2019

View reviewed changes

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2019

bobcatfish reviewed May 28, 2019

View reviewed changes

vdemeester reviewed May 29, 2019

View reviewed changes

vdemeester approved these changes May 29, 2019

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2019

bobcatfish reviewed May 29, 2019

View reviewed changes

tekton-robot assigned bobcatfish May 29, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2019

tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label May 29, 2019

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2019

tekton-robot merged commit d1e178d into tektoncd:master May 29, 2019

ghost deleted the retry-resource-quota-errors branch May 30, 2019 12:48

bobcatfish mentioned this pull request Aug 19, 2019

Add @sbswg as an owner 😼 #1220

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reattempt pod creation in the face of ResourceQuota errors #905

Reattempt pod creation in the face of ResourceQuota errors #905

ghost commented May 24, 2019 •

edited by ghost

Loading

abayer commented May 24, 2019

dlorenc left a comment

dlorenc May 25, 2019

ghost May 28, 2019

dlorenc May 25, 2019

ghost May 28, 2019

vdemeester left a comment

vdemeester commented May 28, 2019

bobcatfish left a comment

bobcatfish May 28, 2019

ghost May 28, 2019

bobcatfish May 28, 2019

bobcatfish May 28, 2019

tekton-robot commented May 28, 2019

bobcatfish left a comment

bobcatfish May 28, 2019

bobcatfish May 28, 2019

bobcatfish May 28, 2019

bobcatfish May 28, 2019

ghost May 29, 2019

bobcatfish May 28, 2019

vdemeester May 29, 2019

vdemeester left a comment

vdemeester May 29, 2019

vdemeester May 29, 2019 •

edited

Loading

vdemeester May 29, 2019

ghost commented May 29, 2019

vdemeester left a comment

tekton-robot commented May 29, 2019

bobcatfish May 29, 2019

bobcatfish commented May 29, 2019

vdemeester commented May 29, 2019

ghost commented May 29, 2019

vdemeester commented May 29, 2019

Reattempt pod creation in the face of ResourceQuota errors #905

Reattempt pod creation in the face of ResourceQuota errors #905

Conversation

ghost commented May 24, 2019 • edited by ghost Loading

Changes

TLDR

Long Version:

Submitter Checklist

Screenshots

Release Notes

abayer commented May 24, 2019

dlorenc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

vdemeester commented May 28, 2019

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented May 28, 2019

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester May 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented May 29, 2019

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented May 29, 2019

Choose a reason for hiding this comment

bobcatfish commented May 29, 2019

vdemeester commented May 29, 2019

ghost commented May 29, 2019

vdemeester commented May 29, 2019

ghost commented May 24, 2019 •

edited by ghost

Loading

vdemeester May 29, 2019 •

edited

Loading