Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins flake on pinging /healthz #6031

Closed
stevekuznetsov opened this issue Nov 23, 2015 · 12 comments
Closed

Jenkins flake on pinging /healthz #6031

stevekuznetsov opened this issue Nov 23, 2015 · 12 comments
Assignees
Labels
component/kubernetes kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2

Comments

@stevekuznetsov
Copy link
Contributor

Seen here:

ERROR: gave up waiting for https://127.0.0.1:28443/healthz

!!! Error in hack/test-cmd.sh:362
    'return 1' exited with status 1
Call stack:
    1: hack/test-cmd.sh:362 main(...)
Exiting with status 1
@miminar
Copy link

miminar commented Nov 27, 2015

Can also be seen in #5141, #5578. #5819, #5932, #5971, #5997, #6020, #6063.

@php-coder
Copy link
Contributor

In my case extended tests failed with the same error. I have found the problem, there will be running another instance of openshift:

[INFO] Scan of OpenShift related processes already up via ps -ef    | grep openshift : 
root      8392 16944  0 13:25 pts/0    00:00:00 sudo /data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/openshift start --public-master=localhost --volume-dir=/opt/openshift
root      8396  8392 22 13:25 pts/0    00:00:08 /data/src/github.com/openshift/origin/_output/local/bin/linux/amd64/openshift start --public-master=localhost --volume-dir=/opt/openshift
vagrant   8668  8476  0 13:26 pts/0    00:00:00 grep openshift
[INFO] Starting OpenShift server
[INFO] OpenShift server start at: 
Tue Dec  1 13:26:19 UTC 2015
ERROR: gave up waiting for https://10.0.2.15:8443/healthz

!!! Error in ./test/extended/../../hack/util.sh:364
    'return 1' exited with status 1
Call stack:
    1: ./test/extended/../../hack/util.sh:364 start_os_server(...)
    2: ./test/extended/core.sh:118 main(...)
Exiting with status 1

HTH

@stevekuznetsov
Copy link
Contributor Author

This is a common problem when running locally, yes. I do not think that this is the issue on Jenkins, however.

@knobunc
Copy link
Contributor

knobunc commented Dec 17, 2015

Hitting this too :-(

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/7931/consoleText

[INFO] Scan of OpenShift related processes already up via ps -ef | grep openshift :
ec2-user 26025 26024 0 15:40 ? 00:00:00 /bin/bash /data/src/github.com/openshift/origin/hack/update-generated-swagger-spec.sh _output/verify-generated-swagger-spec
ec2-user 26120 26025 0 15:40 ? 00:00:00 grep openshift
[INFO] Starting OpenShift server
[INFO] OpenShift server start at:
Thu Dec 17 15:40:37 EST 2015
ERROR: gave up waiting for https://127.0.0.1:38443/healthz

!!! Error in /data/src/github.com/openshift/origin/hack/../hack/util.sh:364
'return 1' exited with status 1
Call stack:
1: /data/src/github.com/openshift/origin/hack/../hack/util.sh:364 start_os_master(...)
2: /data/src/github.com/openshift/origin/hack/update-generated-swagger-spec.sh:55 main(...)
Exiting with status 1

@liggitt
Copy link
Contributor

liggitt commented Dec 19, 2015

@liggitt
Copy link
Contributor

liggitt commented Jan 14, 2016

@stevekuznetsov
Copy link
Contributor Author

@pweil- are you still looking into this? The LDAP extended run hits this pretty often, and there's never seemingly any other OpenShift-related processes up at the time. Seems like a deeper failure of the API server init

@pweil-
Copy link
Contributor

pweil- commented Mar 16, 2016

no, not at the moment. There was an update in the timeout and more debugging added so we closed my debug PR #6171 (comment).

@stevekuznetsov
Copy link
Contributor Author

Hm. We're not seeing it terribly often on the PR job as things fail in the verify/unit test phase often, but there is still some prevalence of this issue. On the LDAP job, there is very little that could flake before this so we're seeing it more often. Doesn't seem as though the fixes you mention did much to alleviate the problem.

@stevekuznetsov
Copy link
Contributor Author

Seen this three times in a row on the LDAP job.

@stevekuznetsov
Copy link
Contributor Author

@pweil- continuing to see this -- the LDAP job is up to around 20% failure rate due to this issue alone.

@smarterclayton
Copy link
Contributor

The LDAP job needs to get david's etcd fix thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kubernetes kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2
Projects
None yet
Development

No branches or pull requests

8 participants