stop etcd from retrying failures #6529

deads2k · 2016-01-04T19:50:42Z

Stop the etcdclient from retrying on our behalf and add logging to it. We might actually want to do this, but for now its for gathering info about flakes.

@liggitt @smarterclayton opinions? I guess I'll see how often this is actually happening for us, but I like the idea of trying once and failing instead of trying multiple times.

[test]

deads2k · 2016-01-04T19:51:23Z

adds logging for #6447

liggitt · 2016-01-04T20:21:00Z

A few thoughts. I don't think "retry on 50x because leader election might be happening" makes sense on a single node cluster... though I could be wrong... maybe even a single node cluster re-elects itself regularly.

For our single-node case, if we fail rather than retry, our tests will still fail, but with a surfaced 50x error instead of the failed retry. Easier to debug (maybe), but still a failure.

deads2k · 2016-01-04T20:46:58Z

@liggitt do you think it makes sense to retry on failures in a multi-node cluster? Why not let the call fail explicitly?

smarterclayton · 2016-01-04T20:57:32Z

I thought etcd had a "try again" mechanic for high load scenarios. We'd need to understand what all of those are, as well as the changes coming from upstream with the new etcd client.

deads2k · 2016-01-04T20:58:59Z

figures. Wrong flake: UI https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8156/

1)  unauthenticated user should be able to log in
  - Error: Wait timed out after 3876ms
      at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1425:29
    at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15
    at [object Object].webdriver.promise.ControlFlow.runInNewFrame_ (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1654:20)
    at notify (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:465:12)

re[test] @jwforres

jwforres · 2016-01-04T22:13:41Z

@spadgett same error you saw but the screenshot is different
On Jan 4, 2016 3:59 PM, "David Eads" notifications@github.com wrote:

figures. Wrong flake: UI
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8156/

unauthenticated user should be able to log in

Error: Wait timed out after 3876ms
at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1425:29
at /data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/goog/base.js:1582:15
at [object Object].webdriver.promise.ControlFlow.runInNewFrame_ (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:1654:20)
at notify (/data/src/github.com/openshift/origin/assets/node_modules/protractor/node_modules/selenium-webdriver/lib/webdriver/promise.js:465:12)

re[test] @jwforres https://github.com/jwforres

—
Reply to this email directly or view it on GitHub
#6529 (comment).

deads2k · 2016-01-05T01:04:34Z

I thought etcd had a "try again" mechanic for high load scenarios.
We'd need to understand what all of those are, as well as the changes
coming from upstream with the new etcd client.

I think I saw that on the way through. It was just a retry check on a transport level problems. I can refactor the etcdclient to maintain that and avoid the generic retry.

deads2k · 2016-01-05T12:51:04Z

Alright, I've got a bunch of these from https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8164 e2e

E0104 23:06:58.584194   24011 etcd.go:144] etcd failure response: HTTP/1.1 500 Internal Server Error
Content-Length: 100
Content-Type: application/json
Date: Mon, 04 Jan 2016 23:06:58 GMT
X-Etcd-Cluster-Id: 2a228130bae82796
X-Etcd-Index: 0

{"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out","index":0}

deads2k · 2016-01-05T13:14:17Z

@liggitt when you were looking at this before, you added checks for which PIDs were running. Did you also add checks for how much CPU and iops they were using and did it merge into master?

The timeouts checked in https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L1007 are tight (https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L993 and https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L65 as an example), but not unreasonable. I'd like to rule out any weird starvation problem.

deads2k · 2016-01-05T13:34:01Z

@ironcladlou this is the pull that disables the retry

deads2k · 2016-01-05T18:09:25Z

@liggitt this now fails on the first error if the etcd cluster only has one member. I'd like to have this just to stop loosing the original error.

liggitt · 2016-01-05T18:26:40Z

pkg/cmd/server/etcd/etcd.go

+// NeverRetryOnFailure is a retry function for the etcdClient.  If there's only one machine, master election doesn't make much sense,
+// so we don't bother to retry, we simply dump the failure and return the error directly.
+func NeverRetryOnFailure(cluster *etcdclient.Cluster, numReqs int, lastResp http.Response, err error) error {
+	if len(cluster.Machines) > 1 {


I don't disagree, but this is going to make things worse before they get better... it means that GET retries that are currently succeeding will now fail

I don't disagree, but this is going to make things worse before they get better... it means that GET retries that are currently succeeding will now fail

You'd think, but see the chain from #6529 (comment). The errors that we're currently getting are only possible on mutation requests. You' saying that you suspect we've gotten read errors before, but they've been different read errors that have always been recoverable? I'd claim that's thin, but I will admit its possible.

ironcladlou · 2016-01-05T18:34:54Z

Forgive me if this conclusion was already drawn, but it seems like just adding logging on failure serves the immediate need (revealing currently hidden errors) without risking a behavioral change by eliminating retries.

deads2k · 2016-01-05T18:45:37Z

Just organizing my thoughts and creating some links so I can backtrack more easily.

The name is weird, but I'm pretty sure that this https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/etcdhttp/client.go#L139 is the method for handling http requests to etcd.
On a mutation, you end up here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L541
That call launches some async work (server side), but if something goes wrong, you end up here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L554 (still server-side).
This is worthwhile, because this pull shows us (client-side), the actual error string "cause":"etcdserver: request timed out", which only originates from that parseProposeCtxErr here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/etcdserver/server.go#L1017.
It gets dropped onto a channel here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/raft/node.go#L387
It gets popped from the channel here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/raft/node.go#L279-L281 and this part must succeed or no action would be performed.
That eventually ends up here: https://github.com/openshift/origin/blob/master/Godeps/_workspace/src/github.com/coreos/etcd/raft/raft.go#L509-L522
To end up in the case we've gotten to, that work in raft.go, must have taken too long, but eventually succeeded. Likely candidates are the disk write and the broadcast. I'll probably add some timing/trace points there to see if something goes bonkers.

deads2k · 2016-01-05T19:24:01Z

The latest failure looks like an ordinary build flake. We may have been hitting this for a while: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8187/

deads2k · 2016-01-11T18:10:51Z

re[test]

@liggitt I still want this. Do you want it formatted differently?

deads2k · 2016-01-18T14:58:49Z

re[test] to check jenkins job

deads2k · 2016-01-18T15:02:34Z

@danmcp Does this mean anything to you? https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8730/consoleText

ln builds/lastSuccessfulBuild /var/lib/jenkins/jobs/test_pull_requests_origin/lastSuccessful failed
java.nio.file.FileAlreadyExistsException: /var/lib/jenkins/jobs/test_pull_requests_origin/lastSuccessful
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.createSymbolicLink(UnixFileSystemProvider.java:457)
    at java.nio.file.Files.createSymbolicLink(Files.java:1043)
    at sun.reflect.GeneratedMethodAccessor1553.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at hudson.Util.createSymlinkJava7(Util.java:1227)
    at hudson.Util.createSymlink(Util.java:1143)
    at hudson.model.Run.createSymlink(Run.java:1845)
    at hudson.model.Run.updateSymlinks(Run.java:1826)
    at hudson.model.Run.execute(Run.java:1739)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:98)
    at hudson.model.Executor.run(Executor.java:408)

danmcp · 2016-01-18T15:06:28Z

@deads2k Yeah it's harmless afailk. Jenkins is just updating the link to:

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/lastSuccessfulBuild/

And it's hitting a race condition. It might mean the link will be wrong for a while. But it should recover.

deads2k · 2016-01-19T01:49:55Z

@liggitt we really need this to be able to track flakes. Comments if you've got 'em.

deads2k · 2016-01-19T12:45:02Z

re[test]

openshift-bot · 2016-01-19T12:45:18Z

Evaluated for origin test up to f785f05

openshift-bot · 2016-01-19T13:59:00Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/130/)

deads2k · 2016-01-19T14:05:02Z

irc agreement from @liggitt. [merge]

openshift-bot · 2016-01-19T14:05:24Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4654/) (Image: devenv-rhel7_3178)

openshift-bot · 2016-01-19T14:05:24Z

Evaluated for origin merge up to f785f05

Merged by openshift-bot

deads2k force-pushed the log-etcd-failures branch from 8e63b27 to 10d3b65 Compare January 4, 2016 19:51

stop etcd from retrying failures

f785f05

deads2k force-pushed the log-etcd-failures branch from 10d3b65 to f785f05 Compare January 5, 2016 18:08

liggitt reviewed Jan 5, 2016
View reviewed changes

openshift-bot pushed a commit that referenced this pull request Jan 19, 2016

Merge pull request #6529 from deads2k/log-etcd-failures

07529cf

Merged by openshift-bot

openshift-bot merged commit 07529cf into openshift:master Jan 19, 2016

deads2k deleted the log-etcd-failures branch February 26, 2016 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop etcd from retrying failures #6529

stop etcd from retrying failures #6529

deads2k commented Jan 4, 2016

deads2k commented Jan 4, 2016

liggitt commented Jan 4, 2016

deads2k commented Jan 4, 2016

smarterclayton commented Jan 4, 2016 via email

deads2k commented Jan 4, 2016

jwforres commented Jan 4, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

liggitt Jan 5, 2016

deads2k Jan 5, 2016

ironcladlou commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 11, 2016

deads2k commented Jan 18, 2016

deads2k commented Jan 18, 2016

danmcp commented Jan 18, 2016

deads2k commented Jan 19, 2016

deads2k commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

deads2k commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

stop etcd from retrying failures #6529

stop etcd from retrying failures #6529

Conversation

deads2k commented Jan 4, 2016

deads2k commented Jan 4, 2016

liggitt commented Jan 4, 2016

deads2k commented Jan 4, 2016

smarterclayton commented Jan 4, 2016 via email

deads2k commented Jan 4, 2016

jwforres commented Jan 4, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

liggitt Jan 5, 2016

Choose a reason for hiding this comment

deads2k Jan 5, 2016

Choose a reason for hiding this comment

ironcladlou commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 5, 2016

deads2k commented Jan 11, 2016

deads2k commented Jan 18, 2016

deads2k commented Jan 18, 2016

danmcp commented Jan 18, 2016

deads2k commented Jan 19, 2016

deads2k commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

deads2k commented Jan 19, 2016

openshift-bot commented Jan 19, 2016

openshift-bot commented Jan 19, 2016