-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop etcd from retrying failures #6529
Conversation
adds logging for #6447 |
8e63b27
to
10d3b65
Compare
A few thoughts. I don't think "retry on 50x because leader election might be happening" makes sense on a single node cluster... though I could be wrong... maybe even a single node cluster re-elects itself regularly. For our single-node case, if we fail rather than retry, our tests will still fail, but with a surfaced 50x error instead of the failed retry. Easier to debug (maybe), but still a failure. |
@liggitt do you think it makes sense to retry on failures in a multi-node cluster? Why not let the call fail explicitly? |
I thought etcd had a "try again" mechanic for high load scenarios.
We'd need to understand what all of those are, as well as the changes
coming from upstream with the new etcd client.
|
figures. Wrong flake: UI https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8156/
re[test] @jwforres |
@spadgett same error you saw but the screenshot is different
|
I think I saw that on the way through. It was just a retry check on a transport level problems. I can refactor the etcdclient to maintain that and avoid the generic retry. |
Alright, I've got a bunch of these from https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8164 e2e
|
@liggitt when you were looking at this before, you added checks for which PIDs were running. Did you also add checks for how much CPU and iops they were using and did it merge into master? The timeouts checked in https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L1007 are tight (https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L993 and https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L65 as an example), but not unreasonable. I'd like to rule out any weird starvation problem. |
@ironcladlou this is the pull that disables the retry |
10d3b65
to
f785f05
Compare
@liggitt this now fails on the first error if the etcd cluster only has one member. I'd like to have this just to stop loosing the original error. |
// NeverRetryOnFailure is a retry function for the etcdClient. If there's only one machine, master election doesn't make much sense, | ||
// so we don't bother to retry, we simply dump the failure and return the error directly. | ||
func NeverRetryOnFailure(cluster *etcdclient.Cluster, numReqs int, lastResp http.Response, err error) error { | ||
if len(cluster.Machines) > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, but this is going to make things worse before they get better... it means that GET retries that are currently succeeding will now fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree, but this is going to make things worse before they get better... it means that GET retries that are currently succeeding will now fail
You'd think, but see the chain from #6529 (comment). The errors that we're currently getting are only possible on mutation requests. You' saying that you suspect we've gotten read errors before, but they've been different read errors that have always been recoverable? I'd claim that's thin, but I will admit its possible.
Forgive me if this conclusion was already drawn, but it seems like just adding logging on failure serves the immediate need (revealing currently hidden errors) without risking a behavioral change by eliminating retries. |
Just organizing my thoughts and creating some links so I can backtrack more easily.
|
The latest failure looks like an ordinary build flake. We may have been hitting this for a while: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8187/ |
re[test] @liggitt I still want this. Do you want it formatted differently? |
re[test] to check jenkins job |
@danmcp Does this mean anything to you? https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8730/consoleText
|
@deads2k Yeah it's harmless afailk. Jenkins is just updating the link to: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/lastSuccessfulBuild/ And it's hitting a race condition. It might mean the link will be wrong for a while. But it should recover. |
@liggitt we really need this to be able to track flakes. Comments if you've got 'em. |
re[test] |
Evaluated for origin test up to f785f05 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/130/) |
irc agreement from @liggitt. [merge] |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4654/) (Image: devenv-rhel7_3178) |
Evaluated for origin merge up to f785f05 |
Merged by openshift-bot
Stop the etcdclient from retrying on our behalf and add logging to it. We might actually want to do this, but for now its for gathering info about flakes.
@liggitt @smarterclayton opinions? I guess I'll see how often this is actually happening for us, but I like the idea of trying once and failing instead of trying multiple times.
[test]