Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry initial backend configuration #3230

Merged
merged 2 commits into from
Oct 12, 2018
Merged

Retry initial backend configuration #3230

merged 2 commits into from
Oct 12, 2018

Conversation

coreypobrien
Copy link
Contributor

Initially when nginx starts, it isn't configured. After loading the initial configuration it will start listening for dynamic configuration. To avoid sending the initial list of backends before nginx is actually listening, there is currently a static 1 second sleep during the initial setup.

For larger or more complex configurations, the delay for nginx to begin listening for backend configurations can exceed 1 second. In that case, there is no retry which means the configuration is never loaded which means nginx health checks fail forever sending it into a crash loop.

This PR adds a retry backoff during the initial configuration to handle these larger delays. For most deployments, this should have identical behavior of sleeping 1s and then working. For the longer ones, if the initial call to /configuration/backends fails, the subsequent retries should eventually apply the backend configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 12, 2018
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 12, 2018
@aledbf
Copy link
Member

aledbf commented Oct 12, 2018

@coreypobrien this makes sense but please use this approach https://github.com/kubernetes/ingress-nginx/blob/master/cmd/nginx/main.go#L205-227

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 12, 2018
@aledbf
Copy link
Member

aledbf commented Oct 12, 2018

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2018
@aledbf
Copy link
Member

aledbf commented Oct 12, 2018

@coreypobrien thanks!

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2018
@ElvinEfendi
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2018
// start listening on the configured port (default 18080)
// For large configurations it might take a while so we loop
// and back off
steps = 10
time.Sleep(1 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wishing we could get rid of this with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely could, but there will always be at least 1 failure without the initial sleep. To avoid always failing on the first attempt, I think we should leave in the initial 1 second sleep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coreypobrien what do you think we get rid of isFirstSync completely? We can unconditionally set steps to 10.
IMO it's OK to have configureDynamically failing for few times during first sync.

That way we can simplify this code and also make sure we have retry logic for subsequent reconfigurations as well.

}
glog.Warningf("Dynamic reconfiguration failed: %v", err)
return false, nil
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this fails after steps can you make sure we log an error? So that people can setup their alerts accordingly.

// start listening on the configured port (default 18080)
// For large configurations it might take a while so we loop
// and back off
steps = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 10 steps (and other retry configuration below), what's the longest time it can potentially take to finally give up and fail?

Copy link
Member

@ElvinEfendi ElvinEfendi Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try 1
wait for 0 + rand(1, 1+0.1)*1.5 < 1.65
try 2
wait for prev_duration + rand(1, 1+0.1)*1.5 < 1.65 + rand(1, 1+0.1)*1.5 < 1.65 * 2
try 3
wait for < 1.65 * 3
try 4
wait for < 1.65 * 4
== by this time it would take < 1.65 + 1.65 * 2 + 1.65 * 3 ==
...
try 9
wait will be < 1.65 * 9
try last time
exit

so by the time it's trying n'th time it'd have waited 1.65 * (1 + 2 + ... + n - 1) seconds worst case. So setting n to 10, it would wait 45 seconds before it fails.

can you confirm it will work like above? Is this realistic? It seems to me settings steps to 10 is a bit too much. It can mask a real problem instead of failing fast.

Copy link
Member

@aledbf aledbf Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ElvinEfendi here

package main

import (
	"fmt"
	"time"

	"k8s.io/apimachinery/pkg/util/wait"
)

func main() {
	retry := wait.Backoff{
		Steps:    10,
		Duration: 1 * time.Second,
		Factor:   1.5,
		Jitter:   0.1,
	}

	start := time.Now()
	wait.ExponentialBackoff(retry, func() (bool, error) {
		fmt.Printf("%s\n", time.Now().Sub(start).Round(time.Millisecond))
		return false, nil
	})

	fmt.Printf("Total time: %v\n", time.Now().Sub(start).Round(time.Millisecond))
}
go run main.go 
0s
1.061s
2.702s
5.102s
8.625s
13.902s
22.018s
33.483s
50.837s
1m16.715s
Total time: 1m16.715s

I think five steps is more than enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use a factor of 1.1

go run main.go 
0s
1.061s
2.264s
3.555s
4.944s
6.471s
8.192s
9.975s
11.955s
14.119s
Total time: 14.119s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @aledbf

@ElvinEfendi
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2018
@ElvinEfendi
Copy link
Member

we will address the above suggestions in a subsequent PR

@ElvinEfendi
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, coreypobrien, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9af9ef5 into kubernetes:master Oct 12, 2018
@coreypobrien coreypobrien deleted the backoff-dynamic-config branch October 12, 2018 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants