Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to propose on members [https://127.0.0.1:24001]" #6447

Closed
deads2k opened this issue Dec 21, 2015 · 11 comments
Closed

"failed to propose on members [https://127.0.0.1:24001]" #6447

deads2k opened this issue Dec 21, 2015 · 11 comments
Assignees
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2

Comments

@deads2k
Copy link
Contributor

deads2k commented Dec 21, 2015

This bug happens because the etcd server can successfully write, but the sync to the wal can be super slow. That results in the etcd server replying with a 500. The etcd client then retries the call automatically, which fails because the action was already taken.

This can manifest as:

  1. unexpected error: namespaces "hammer-project" already exists
  2. etcdhttp: got unexpected response error (etcdserver: request timed out)
  3. Unable to initialize namespaces: unable to persist the updated namespace UID allocations: uidallocation "" cannot be updated: another caller has already initialized the resource

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8038/consoleText

FAILURE after 12.429s: hack/../test/cmd/builds.sh:92: executing 'oc start-build --from-webhook=https://127.0.0.1:28443/oapi/v1/namespaces/cmd-builds/buildconfigs/ruby-sample-build/webhooks/secret101/generic' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
error: server rejected our request 500
remote: {
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Internal error occurred: could not generate a build: 501: All the given peers are not reachable (failed to propose on members [https://127.0.0.1:24001] twice [last error: Unexpected HTTP status code]) [0]",
  "reason": "InternalError",
  "details": {
    "causes": [
      {
        "message": "could not generate a build: 501: All the given peers are not reachable (failed to propose on members [https://127.0.0.1:24001] twice [last error: Unexpected HTTP status code]) [0]"
      }
    ]
  },
  "code": 500
}
!!! Error in hack/../test/cmd/../../hack/cmd_util.sh:193
    'return 1' exited with status 1
Call stack:
    1: hack/../test/cmd/../../hack/cmd_util.sh:193 os::cmd::expect_success(...)
    2: hack/../test/cmd/builds.sh:92 main(...)
Exiting with status 1
!!! Error in hack/test-cmd.sh:286
    '${test}' exited with status 1
Call stack:
    1: hack/test-cmd.sh:286 main(...)
Exiting with status 1
[FAIL] !!!!! Test Failed !!!!

See #6065 for more details.

@deads2k deads2k added priority/P2 kind/test-flake Categorizes issue or PR as related to test flakes. labels Dec 21, 2015
@deads2k
Copy link
Contributor Author

deads2k commented Dec 21, 2015

Another occurrence here:

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/8043/consoleText

In project test on server https://172.30.0.1:443

2 warnings identified, use 'oc status -v' to see details.
Error from server: 501: All the given peers are not reachable (failed to propose on members [https://172.18.6.21:4001] twice [last error: Unexpected HTTP status code]) [0]
!!! Error in hack/../test/end-to-end/core.sh:160
    'oc delete pod cli-with-token' exited with status 1
Call stack:
    1: hack/../test/end-to-end/core.sh:160 main(...)
Exiting with status 1

[FAIL] !!!!! Test Failed !!!!

@stevekuznetsov
Copy link
Contributor

flake here:

FAILURE after 63.617s: hack/../test/cmd/builds.sh:108: executing 'oc process -f examples/sample-app/application-template-dockerbuild.json -l build=docker | oc create -f -' expecting success: the command returned the wrong error code
Standard output from the command:
imagestream "origin-ruby-sample" created
deploymentconfig "frontend" created
service "database" created
deploymentconfig "database" created
Standard error from the command:
Error from server: Timeout: request did not complete within allowed duration
Error from server: 501: All the given peers are not reachable (failed to propose on members [https://127.0.0.1:24001] twice [last error: Unexpected HTTP status code]) [0]
Error from server: imageStream "ruby-22-centos7" already exists
Error from server: buildconfig "ruby-sample-build" already exists
[FAIL] !!!!! Test Failed !!!!

@0xmichalis
Copy link
Contributor

Another

Error from server: 501: All the given peers are not reachable (failed to propose on members [https://172.18.4.116:4001] twice [last error: Unexpected HTTP status code]) [0]
!!! Error in hack/../test/end-to-end/core.sh:164
    'oc delete pod cli-with-token-2' exited with status 1
Call stack:
    1: hack/../test/end-to-end/core.sh:164 main(...)
Exiting with status 1

https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4448/consoleFull

@rhcarvalho
Copy link
Contributor

@eparis
Copy link
Member

eparis commented Jan 8, 2016

@aveshagarwal
Copy link
Contributor

When I run ./hack/test-end-to-end.sh with latest origin, I am seeing similar issue:

[INFO] Running a CLI command in a container using the service account
Waiting for pod test/cli-with-token to be running, status is Pending, pod ready: false
Waiting for pod test/cli-with-token to be running, status is Pending, pod ready: false
Error attaching, falling back to logs: error executing remote command: Error executing command in container: container not found ("cli-with-token")
F0108 20:00:54.584136 1 helpers.go:96] Error in configuration: default cluster has no server defined
!!! Error in ./hack/../test/end-to-end/core.sh:160
'[ "$(cat ${LOG_DIR}/cli-with-token.log | grep 'Using in-cluster configuration')" ]' exited with status 1
Call stack:
1: ./hack/../test/end-to-end/core.sh:160 main(...)
Exiting with status 1
!!! Error in ./hack/test-end-to-end.sh:51
'${OS_ROOT}/test/end-to-end/core.sh' exited with status 1
Call stack:
1: ./hack/test-end-to-end.sh:51 main(...)
Exiting with status 1

[FAIL] !!!!! Test Failed !!!!

@deads2k
Copy link
Contributor Author

deads2k commented Jan 11, 2016

It looks like this is being caused by sudden latency in disk IO. See #6542 (comment) for details.

Since this seems to be an environmental problem, we're currently working around this problem by using a ramdisk. This frees the merge and test queue for now.

@smarterclayton
Copy link
Contributor

Something's back

I0130 13:13:02.433088    5645 decoder.go:141] decoding stream as JSON
2016-01-30 13:13:10.435005 E | etcdhttp: got unexpected response error (etcdserver: request timed out)
E0130 13:13:10.435279    5645 etcd.go:95] etcd failure response: HTTP/1.1 500 Internal Server Error
Content-Length: 100
Content-Type: application/json
Date: Sat, 30 Jan 2016 18:13:10 GMT
X-Etcd-Cluster-Id: 3a66c6e8db3c8d30
X-Etcd-Index: 0

{"errorCode":300,"message":"Raft Internal Error","cause":"etcdserver: request timed out","index":0}

@deads2k deads2k assigned smarterclayton and unassigned deads2k Feb 1, 2016
@deads2k
Copy link
Contributor Author

deads2k commented Feb 1, 2016

@smarterclayton happened after the CI flow changed. I wonder if its exceeding the pre-allocated space now. Since I'm rebasing and @liggitt already tried, you want a go?

@smarterclayton
Copy link
Contributor

Jordan's looking at it but it's very likely the builds are moved

On Mon, Feb 1, 2016 at 8:03 AM, David Eads notifications@github.com wrote:

@smarterclayton https://github.com/smarterclayton happened after the CI
flow changed. I wonder if its exceeding the pre-allocated space now. Since
I'm rebasing and @liggitt https://github.com/liggitt already tried, you
want a go?


Reply to this email directly or view it on GitHub
#6447 (comment).

@smarterclayton
Copy link
Contributor

I think this is resolved now.

ironcladlou added a commit to ironcladlou/openshift-ansible that referenced this issue Oct 7, 2016
Master startup can fail when ec2 transparently reallocates the block
storage, causing etcd writes to temporarily fail. Retry failures blindly
just
once to allow time for this transient condition to to resolve and for
systemd
to restart the master (which will eventually succeed).

etcd-io/etcd#3864
openshift/origin#6065
openshift/origin#6447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2
Projects
None yet
Development

No branches or pull requests

9 participants