Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: error: replication controller "docker-registry-1" has failed progressing #12806

Closed
sjenning opened this issue Feb 3, 2017 · 14 comments
Closed
Assignees
Labels
component/apps kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2

Comments

@sjenning
Copy link
Contributor

sjenning commented Feb 3, 2017

I've tried to merge #12762 twice and both times got

error: replication controller "docker-registry-1" has failed progressing

Any insight?

@sjenning sjenning added the kind/test-flake Categorizes issue or PR as related to test flakes. label Feb 3, 2017
@sjenning sjenning changed the title error: replication controller "docker-registry-1" has failed progressing flake: error: replication controller "docker-registry-1" has failed progressing Feb 3, 2017
@ncdc
Copy link
Contributor

ncdc commented Feb 7, 2017

I'm seeing this in your test's logs:

I0202 18:56:06.433174    9406 audit.go:125] 2017-02-02T18:56:06.433129762-05:00 AUDIT: id="1d42ef71-399c-4a7a-8d92-555b731ebcff" ip="172.18.5.114" method="POST" user="system:openshift-master" as="<self>" asgroups="<lookup>" namespace="default" uri="/oapi/v1/namespaces/default/deploymentconfigs/docker-registry/instantiate"
[snip]
I0202 18:56:06.486140    9406 audit.go:45] 2017-02-02T18:56:06.486124886-05:00 AUDIT: id="1d42ef71-399c-4a7a-8d92-555b731ebcff" response="204"
E0202 18:56:06.486346    9406 apiserver.go:487] apiserver was unable to write a JSON response: http: request method or response status code does not allow body
E0202 18:56:06.486367    9406 errors.go:63] apiserver received an error that is not an unversioned.Status: http: request method or response status code does not allow body
I0202 18:56:06.486391    9406 audit.go:45] 2017-02-02T18:56:06.486381375-05:00 AUDIT: id="1d42ef71-399c-4a7a-8d92-555b731ebcff" response="500"
I0202 18:56:06.486445    9406 audit.go:125] 2017-02-02T18:56:06.486403117-05:00 AUDIT: id="453b58dc-4c3b-4cf4-82bc-1b29ad6297d8" ip="172.18.5.114" method="PUT" user="system:openshift-master" as="<self>" asgroups="<lookup>" namespace="openshift-infra" uri="/api/v1/namespaces/openshift-infra/secrets/replication-controller-dockercfg-z4sdl"
I0202 18:56:06.486608    9406 panics.go:76] POST /oapi/v1/namespaces/default/deploymentconfigs/docker-registry/instantiate: (53.809838ms) 500
[logging stack trace snip]
logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12P\n\x04\n\x00\x12\x00\x12\aFailure\x1a:deployment config \"docker-registry\" cannot be instantiated\"\x000\xcc\x01\x1a\x00\"\x00"

logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12V\n\x04\n\x00\x12\x00\x12\aFailure\x1a@http: request method or response status code does not allow body\"\x000\xf4\x03\x1a\x00\"\x00"
 [[openshift/v1.5.0 (linux/amd64) openshift/3eed789] 172.18.5.114:41756]

Maybe that's why? cc @mfojtik @smarterclayton

I'm removing component/imageregistry for now as it doesn't seem pertinent.

@ncdc ncdc assigned mfojtik and unassigned legionus Feb 7, 2017
@ncdc
Copy link
Contributor

ncdc commented Feb 7, 2017

I also see this:

I0202 18:56:09.535338    9406 factory.go:488] About to try and schedule pod docker-registry-1-deploy
I0202 18:56:09.535351    9406 scheduler.go:93] Attempting to schedule pod: default/docker-registry-1-deploy
I0202 18:56:09.535364    9406 factory.go:511] Ignoring node 172.18.5.114 with Ready condition status False
I0202 18:56:09.535372    9406 scheduler.go:97] Failed to schedule pod: default/docker-registry-1-deploy
I0202 18:56:09.535376    9406 factory.go:581] Unable to schedule default docker-registry-1-deploy: no nodes are registered to the cluster; waiting

@ncdc
Copy link
Contributor

ncdc commented Feb 7, 2017

Yeah, it looks like the node never transitioned to ready.

@ncdc
Copy link
Contributor

ncdc commented Feb 7, 2017

It looks like the openshift process was only alive for 12 or 13 seconds. I'm surprised the DC would transition to Failed that quickly, but I'm not up to speed with the DC inner workings. Here's the build console in question if anyone feels like looking in more detail: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/10984/consoleFull

@mfojtik
Copy link
Contributor

mfojtik commented Feb 10, 2017

@ncdc seems like the node problem (although the quick transition of DC is worrysome)... mind bumpint this to P2? (seems pretty rare flake).

@ncdc
Copy link
Contributor

ncdc commented Feb 10, 2017

P2 works for me

@smarterclayton
Copy link
Contributor

What "node problem"

@ncdc
Copy link
Contributor

ncdc commented Feb 10, 2017

@mfojtik actually I think this is probably deployment related. It looks like, given a few more seconds, the node probably would have returned a ready condition of true.

@mfojtik
Copy link
Contributor

mfojtik commented Feb 14, 2017

@ncdc @smarterclayton the timing from the log:

I0202 18:56:06.486608    9406 panics.go:76] POST /oapi/v1/namespaces/default/deploymentconfigs/docker-registry/instantiate: (53.809838ms) 500
[logging stack trace snip]
logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12P\n\x04\n\x00\x12\x00\x12\aFailure\x1a:deployment config \"docker-registry\" cannot be instantiated\"\x000\xcc\x01\x1a\x00\"\x00"
 18:56:09.535376    9406 factory.go:581] Unable to schedule default docker-registry-1-deploy: no nodes are registered to the cluster; waiting

(I would expect the DC controller will reconcile when the node is available for the deployer pod to run). @Kargakis do you remember what the instantiate endpoint does in that case?

@0xmichalis
Copy link
Contributor

The controllers have no notion of nodes being up or down. If this depends on instantiate we can retry longer with a backoff.

@mfojtik
Copy link
Contributor

mfojtik commented Jul 12, 2017

@Kargakis this haven't reccured recently, but I guess we want to still increase the interval till we fail the deployment because there are no nodes available?

@mfojtik
Copy link
Contributor

mfojtik commented Oct 11, 2017

Closing due to age.

@mfojtik mfojtik closed this as completed Oct 11, 2017
@simo5 simo5 reopened this Apr 12, 2018
@simo5
Copy link
Contributor

simo5 commented Apr 12, 2018

@simo5
Copy link
Contributor

simo5 commented Apr 13, 2018

sounds like the job was using outdated images, closing again

@simo5 simo5 closed this as completed Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/apps kind/test-flake Categorizes issue or PR as related to test flakes. priority/P2
Projects
None yet
Development

No branches or pull requests

8 participants