Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop etcd from retrying failures #6529

Merged
merged 1 commit into from
Jan 19, 2016

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Jan 4, 2016

Stop the etcdclient from retrying on our behalf and add logging to it. We might actually want to do this, but for now its for gathering info about flakes.

@liggitt @smarterclayton opinions? I guess I'll see how often this is actually happening for us, but I like the idea of trying once and failing instead of trying multiple times.

[test]

@deads2k
Copy link
Contributor Author

deads2k commented Jan 4, 2016

adds logging for #6447

@liggitt
Copy link
Contributor

liggitt commented Jan 4, 2016

A few thoughts. I don't think "retry on 50x because leader election might be happening" makes sense on a single node cluster... though I could be wrong... maybe even a single node cluster re-elects itself regularly.

For our single-node case, if we fail rather than retry, our tests will still fail, but with a surfaced 50x error instead of the failed retry. Easier to debug (maybe), but still a failure.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 4, 2016

@liggitt do you think it makes sense to retry on failures in a multi-node cluster? Why not let the call fail explicitly?

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 4, 2016 via email

@deads2k
Copy link
Contributor Author

deads2k commented Jan 4, 2016

figures. Wrong flake: UI https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8156/

1)  unauthenticated user should be able to log in
  - Error: Wait timed out after 3876ms
      at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1425:29
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15
    at [object Object].webdriver.promise.ControlFlow.runInNewFrame_ (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1654:20)
    at notify (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:465:12)

re[test] @jwforres

@jwforres
Copy link
Member

jwforres commented Jan 4, 2016

@spadgett same error you saw but the screenshot is different
On Jan 4, 2016 3:59 PM, "David Eads" notifications@github.com wrote:

figures. Wrong flake: UI
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8156/

  1. unauthenticated user should be able to log in
  • Error: Wait timed out after 3876ms
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1425:29
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15
    at [object Object].webdriver.promise.ControlFlow.runInNewFrame_ (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1654:20)
    at notify (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:465:12)

re[test] @jwforres https://github.com/jwforres


Reply to this email directly or view it on GitHub
#6529 (comment).

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

I thought etcd had a "try again" mechanic for high load scenarios.
We'd need to understand what all of those are, as well as the changes
coming from upstream with the new etcd client.

I think I saw that on the way through. It was just a retry check on a transport level problems. I can refactor the etcdclient to maintain that and avoid the generic retry.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

Alright, I've got a bunch of these from https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8164 e2e

E0104 23:06:58.584194   24011 etcd.go:144] etcd failure response: HTTP/1.1 500 Internal Server Error
Content-Length: 100
Content-Type: application/json
Date: Mon, 04 Jan 2016 23:06:58 GMT
X-Etcd-Cluster-Id: 2a228130bae82796
X-Etcd-Index: 0

{"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out","index":0}

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

@liggitt when you were looking at this before, you added checks for which PIDs were running. Did you also add checks for how much CPU and iops they were using and did it merge into master?

The timeouts checked in https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L1007 are tight (https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L993 and https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L65 as an example), but not unreasonable. I'd like to rule out any weird starvation problem.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

@ironcladlou this is the pull that disables the retry

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

@liggitt this now fails on the first error if the etcd cluster only has one member. I'd like to have this just to stop loosing the original error.

// NeverRetryOnFailure is a retry function for the etcdClient. If there's only one machine, master election doesn't make much sense,
// so we don't bother to retry, we simply dump the failure and return the error directly.
func NeverRetryOnFailure(cluster *etcdclient.Cluster, numReqs int, lastResp http.Response, err error) error {
if len(cluster.Machines) > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but this is going to make things worse before they get better... it means that GET retries that are currently succeeding will now fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, but this is going to make things worse before they get better... it means that GET retries that are currently succeeding will now fail

You'd think, but see the chain from #6529 (comment). The errors that we're currently getting are only possible on mutation requests. You' saying that you suspect we've gotten read errors before, but they've been different read errors that have always been recoverable? I'd claim that's thin, but I will admit its possible.

@ironcladlou
Copy link
Contributor

Forgive me if this conclusion was already drawn, but it seems like just adding logging on failure serves the immediate need (revealing currently hidden errors) without risking a behavioral change by eliminating retries.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

Just organizing my thoughts and creating some links so I can backtrack more easily.

  1. The name is weird, but I'm pretty sure that this https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/etcdhttp/client.go#L139 is the method for handling http requests to etcd.
  2. On a mutation, you end up here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L541
  3. That call launches some async work (server side), but if something goes wrong, you end up here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L554 (still server-side).
  4. This is worthwhile, because this pull shows us (client-side), the actual error string "cause":"etcdserver: request timed out", which only originates from that parseProposeCtxErr here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L1017.
  5. It gets dropped onto a channel here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/raft/node.go#L387
  6. It gets popped from the channel here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/raft/node.go#L279-L281 and this part must succeed or no action would be performed.
  7. That eventually ends up here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/raft/raft.go#L509-L522
  8. To end up in the case we've gotten to, that work in raft.go, must have taken too long, but eventually succeeded. Likely candidates are the disk write and the broadcast. I'll probably add some timing/trace points there to see if something goes bonkers.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 5, 2016

The latest failure looks like an ordinary build flake. We may have been hitting this for a while: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8187/

@deads2k
Copy link
Contributor Author

deads2k commented Jan 11, 2016

re[test]

@liggitt I still want this. Do you want it formatted differently?

@deads2k
Copy link
Contributor Author

deads2k commented Jan 18, 2016

re[test] to check jenkins job

@deads2k
Copy link
Contributor Author

deads2k commented Jan 18, 2016

@danmcp Does this mean anything to you? https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8730/consoleText

ln builds/lastSuccessfulBuild /var/lib/jenkins/jobs/test_pull_requests_origin/lastSuccessful failed
java.nio.file.FileAlreadyExistsException: /var/lib/jenkins/jobs/test_pull_requests_origin/lastSuccessful
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.createSymbolicLink(UnixFileSystemProvider.java:457)
    at java.nio.file.Files.createSymbolicLink(Files.java:1043)
    at sun.reflect.GeneratedMethodAccessor1553.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at hudson.Util.createSymlinkJava7(Util.java:1227)
    at hudson.Util.createSymlink(Util.java:1143)
    at hudson.model.Run.createSymlink(Run.java:1845)
    at hudson.model.Run.updateSymlinks(Run.java:1826)
    at hudson.model.Run.execute(Run.java:1739)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:98)
    at hudson.model.Executor.run(Executor.java:408)

@danmcp
Copy link

danmcp commented Jan 18, 2016

@deads2k Yeah it's harmless afailk. Jenkins is just updating the link to:

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/lastSuccessfulBuild/

And it's hitting a race condition. It might mean the link will be wrong for a while. But it should recover.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 19, 2016

@liggitt we really need this to be able to track flakes. Comments if you've got 'em.

@deads2k
Copy link
Contributor Author

deads2k commented Jan 19, 2016

re[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to f785f05

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/130/)

@deads2k
Copy link
Contributor Author

deads2k commented Jan 19, 2016

irc agreement from @liggitt. [merge]

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4654/) (Image: devenv-rhel7_3178)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to f785f05

openshift-bot pushed a commit that referenced this pull request Jan 19, 2016
@openshift-bot openshift-bot merged commit 07529cf into openshift:master Jan 19, 2016
@deads2k deads2k deleted the log-etcd-failures branch February 26, 2016 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants