Retry initial backend configuration #3230

coreypobrien · 2018-10-12T04:05:41Z

Initially when nginx starts, it isn't configured. After loading the initial configuration it will start listening for dynamic configuration. To avoid sending the initial list of backends before nginx is actually listening, there is currently a static 1 second sleep during the initial setup.

For larger or more complex configurations, the delay for nginx to begin listening for backend configurations can exceed 1 second. In that case, there is no retry which means the configuration is never loaded which means nginx health checks fail forever sending it into a crash loop.

This PR adds a retry backoff during the initial configuration to handle these larger delays. For most deployments, this should have identical behavior of sleeping 1s and then working. For the longer ones, if the initial call to /configuration/backends fails, the subsequent retries should eventually apply the backend configuration.

aledbf · 2018-10-12T09:55:22Z

@coreypobrien this makes sense but please use this approach https://github.com/kubernetes/ingress-nginx/blob/master/cmd/nginx/main.go#L205-227

aledbf · 2018-10-12T13:20:57Z

/lgtm
/approve

aledbf · 2018-10-12T13:21:05Z

@coreypobrien thanks!

ElvinEfendi · 2018-10-12T13:35:07Z

/hold

ElvinEfendi · 2018-10-12T13:41:20Z

internal/ingress/controller/controller.go

+			//  start listening on the configured port (default 18080)
+			//  For large configurations it might take a while so we loop
+			//  and back off
+			steps = 10
 			time.Sleep(1 * time.Second)


I was wishing we could get rid of this with this PR.

We definitely could, but there will always be at least 1 failure without the initial sleep. To avoid always failing on the first attempt, I think we should leave in the initial 1 second sleep.

@coreypobrien what do you think we get rid of isFirstSync completely? We can unconditionally set steps to 10.
IMO it's OK to have configureDynamically failing for few times during first sync.

That way we can simplify this code and also make sure we have retry logic for subsequent reconfigurations as well.

ElvinEfendi · 2018-10-12T13:51:51Z

internal/ingress/controller/controller.go

+			}
+			glog.Warningf("Dynamic reconfiguration failed: %v", err)
+			return false, nil
+		})


When this fails after steps can you make sure we log an error? So that people can setup their alerts accordingly.

ElvinEfendi · 2018-10-12T13:54:41Z

internal/ingress/controller/controller.go

+			//  start listening on the configured port (default 18080)
+			//  For large configurations it might take a while so we loop
+			//  and back off
+			steps = 10


With 10 steps (and other retry configuration below), what's the longest time it can potentially take to finally give up and fail?

try 1 wait for 0 + rand(1, 1+0.1)*1.5 < 1.65 try 2 wait for prev_duration + rand(1, 1+0.1)*1.5 < 1.65 + rand(1, 1+0.1)*1.5 < 1.65 * 2 try 3 wait for < 1.65 * 3 try 4 wait for < 1.65 * 4 == by this time it would take < 1.65 + 1.65 * 2 + 1.65 * 3 == ... try 9 wait will be < 1.65 * 9 try last time exit

so by the time it's trying n'th time it'd have waited 1.65 * (1 + 2 + ... + n - 1) seconds worst case. So setting n to 10, it would wait 45 seconds before it fails.

can you confirm it will work like above? Is this realistic? It seems to me settings steps to 10 is a bit too much. It can mask a real problem instead of failing fast.

@ElvinEfendi here

package main import ( "fmt" "time" "k8s.io/apimachinery/pkg/util/wait" ) func main() { retry := wait.Backoff{ Steps: 10, Duration: 1 * time.Second, Factor: 1.5, Jitter: 0.1, } start := time.Now() wait.ExponentialBackoff(retry, func() (bool, error) { fmt.Printf("%s\n", time.Now().Sub(start).Round(time.Millisecond)) return false, nil }) fmt.Printf("Total time: %v\n", time.Now().Sub(start).Round(time.Millisecond)) }

go run main.go 0s 1.061s 2.702s 5.102s 8.625s 13.902s 22.018s 33.483s 50.837s 1m16.715s Total time: 1m16.715s

I think five steps is more than enough

Or use a factor of 1.1

go run main.go 0s 1.061s 2.264s 3.555s 4.944s 6.471s 8.192s 9.975s 11.955s 14.119s Total time: 14.119s

thanks @aledbf

ElvinEfendi · 2018-10-12T16:59:48Z

/hold cancel

ElvinEfendi · 2018-10-12T17:00:18Z

we will address the above suggestions in a subsequent PR

ElvinEfendi · 2018-10-12T17:00:36Z

/lgtm

k8s-ci-robot · 2018-10-12T17:00:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, coreypobrien, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ElvinEfendi,aledbf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 12, 2018

k8s-ci-robot requested review from aledbf and ElvinEfendi October 12, 2018 04:05

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 12, 2018

Retry initial backend configuration

ee6bb94

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 12, 2018

Switch to using wait.ExponentialBackoff

e0020e2

k8s-ci-robot assigned aledbf Oct 12, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2018

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2018

ElvinEfendi reviewed Oct 12, 2018

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2018

k8s-ci-robot assigned ElvinEfendi Oct 12, 2018

k8s-ci-robot merged commit 9af9ef5 into kubernetes:master Oct 12, 2018

coreypobrien deleted the backoff-dynamic-config branch October 12, 2018 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry initial backend configuration #3230

Retry initial backend configuration #3230

coreypobrien commented Oct 12, 2018

aledbf commented Oct 12, 2018

aledbf commented Oct 12, 2018

aledbf commented Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

ElvinEfendi Oct 12, 2018

coreypobrien Oct 12, 2018

ElvinEfendi Oct 12, 2018

ElvinEfendi Oct 12, 2018

ElvinEfendi Oct 12, 2018

ElvinEfendi Oct 12, 2018 •

edited

Loading

aledbf Oct 12, 2018 •

edited

Loading

aledbf Oct 12, 2018

ElvinEfendi Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

k8s-ci-robot commented Oct 12, 2018

Retry initial backend configuration #3230

Retry initial backend configuration #3230

Conversation

coreypobrien commented Oct 12, 2018

aledbf commented Oct 12, 2018

aledbf commented Oct 12, 2018

aledbf commented Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

ElvinEfendi Oct 12, 2018

Choose a reason for hiding this comment

coreypobrien Oct 12, 2018

Choose a reason for hiding this comment

ElvinEfendi Oct 12, 2018

Choose a reason for hiding this comment

ElvinEfendi Oct 12, 2018

Choose a reason for hiding this comment

ElvinEfendi Oct 12, 2018

Choose a reason for hiding this comment

ElvinEfendi Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

aledbf Oct 12, 2018 • edited Loading

Choose a reason for hiding this comment

aledbf Oct 12, 2018

Choose a reason for hiding this comment

ElvinEfendi Oct 12, 2018

Choose a reason for hiding this comment

ElvinEfendi commented Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

ElvinEfendi commented Oct 12, 2018

k8s-ci-robot commented Oct 12, 2018

ElvinEfendi Oct 12, 2018 •

edited

Loading

aledbf Oct 12, 2018 •

edited

Loading