flake: error: replication controller "docker-registry-1" has failed progressing #12806

sjenning · 2017-02-03T21:42:26Z

I've tried to merge #12762 twice and both times got

error: replication controller "docker-registry-1" has failed progressing

Any insight?

The text was updated successfully, but these errors were encountered:

ncdc · 2017-02-07T18:08:34Z

I'm seeing this in your test's logs:

I0202 18:56:06.433174    9406 audit.go:125] 2017-02-02T18:56:06.433129762-05:00 AUDIT: id="1d42ef71-399c-4a7a-8d92-555b731ebcff" ip="172.18.5.114" method="POST" user="system:openshift-master" as="<self>" asgroups="<lookup>" namespace="default" uri="/oapi/v1/namespaces/default/deploymentconfigs/docker-registry/instantiate"
[snip]
I0202 18:56:06.486140    9406 audit.go:45] 2017-02-02T18:56:06.486124886-05:00 AUDIT: id="1d42ef71-399c-4a7a-8d92-555b731ebcff" response="204"
E0202 18:56:06.486346    9406 apiserver.go:487] apiserver was unable to write a JSON response: http: request method or response status code does not allow body
E0202 18:56:06.486367    9406 errors.go:63] apiserver received an error that is not an unversioned.Status: http: request method or response status code does not allow body
I0202 18:56:06.486391    9406 audit.go:45] 2017-02-02T18:56:06.486381375-05:00 AUDIT: id="1d42ef71-399c-4a7a-8d92-555b731ebcff" response="500"
I0202 18:56:06.486445    9406 audit.go:125] 2017-02-02T18:56:06.486403117-05:00 AUDIT: id="453b58dc-4c3b-4cf4-82bc-1b29ad6297d8" ip="172.18.5.114" method="PUT" user="system:openshift-master" as="<self>" asgroups="<lookup>" namespace="openshift-infra" uri="/api/v1/namespaces/openshift-infra/secrets/replication-controller-dockercfg-z4sdl"
I0202 18:56:06.486608    9406 panics.go:76] POST /oapi/v1/namespaces/default/deploymentconfigs/docker-registry/instantiate: (53.809838ms) 500
[logging stack trace snip]
logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12P\n\x04\n\x00\x12\x00\x12\aFailure\x1a:deployment config \"docker-registry\" cannot be instantiated\"\x000\xcc\x01\x1a\x00\"\x00"

logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12V\n\x04\n\x00\x12\x00\x12\aFailure\x1a@http: request method or response status code does not allow body\"\x000\xf4\x03\x1a\x00\"\x00"
 [[openshift/v1.5.0 (linux/amd64) openshift/3eed789] 172.18.5.114:41756]

Maybe that's why? cc @mfojtik @smarterclayton

I'm removing component/imageregistry for now as it doesn't seem pertinent.

ncdc · 2017-02-07T18:10:35Z

I also see this:

I0202 18:56:09.535338    9406 factory.go:488] About to try and schedule pod docker-registry-1-deploy
I0202 18:56:09.535351    9406 scheduler.go:93] Attempting to schedule pod: default/docker-registry-1-deploy
I0202 18:56:09.535364    9406 factory.go:511] Ignoring node 172.18.5.114 with Ready condition status False
I0202 18:56:09.535372    9406 scheduler.go:97] Failed to schedule pod: default/docker-registry-1-deploy
I0202 18:56:09.535376    9406 factory.go:581] Unable to schedule default docker-registry-1-deploy: no nodes are registered to the cluster; waiting

ncdc · 2017-02-07T18:20:08Z

Yeah, it looks like the node never transitioned to ready.

ncdc · 2017-02-07T18:32:04Z

It looks like the openshift process was only alive for 12 or 13 seconds. I'm surprised the DC would transition to Failed that quickly, but I'm not up to speed with the DC inner workings. Here's the build console in question if anyone feels like looking in more detail: https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/10984/consoleFull

mfojtik · 2017-02-10T10:34:17Z

@ncdc seems like the node problem (although the quick transition of DC is worrysome)... mind bumpint this to P2? (seems pretty rare flake).

ncdc · 2017-02-10T12:03:25Z

P2 works for me

smarterclayton · 2017-02-10T16:14:41Z

What "node problem"

ncdc · 2017-02-10T16:20:56Z

@mfojtik actually I think this is probably deployment related. It looks like, given a few more seconds, the node probably would have returned a ready condition of true.

mfojtik · 2017-02-14T10:15:24Z

@ncdc @smarterclayton the timing from the log:

I0202 18:56:06.486608    9406 panics.go:76] POST /oapi/v1/namespaces/default/deploymentconfigs/docker-registry/instantiate: (53.809838ms) 500
[logging stack trace snip]
logging error output: "k8s\x00\n\f\n\x02v1\x12\x06Status\x12P\n\x04\n\x00\x12\x00\x12\aFailure\x1a:deployment config \"docker-registry\" cannot be instantiated\"\x000\xcc\x01\x1a\x00\"\x00"
 18:56:09.535376    9406 factory.go:581] Unable to schedule default docker-registry-1-deploy: no nodes are registered to the cluster; waiting

(I would expect the DC controller will reconcile when the node is available for the deployer pod to run). @Kargakis do you remember what the instantiate endpoint does in that case?

0xmichalis · 2017-02-14T10:35:40Z

The controllers have no notion of nodes being up or down. If this depends on instantiate we can retry longer with a backoff.

mfojtik · 2017-07-12T09:56:29Z

@Kargakis this haven't reccured recently, but I guess we want to still increase the interval till we fail the deployment because there are no nodes available?

mfojtik · 2017-10-11T13:38:28Z

Closing due to age.

simo5 · 2018-04-12T20:51:07Z

I am now seeing this error here:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/19328/test_pull_request_origin_extended_ldap_groups/27/

Any insights on what it may be ?

simo5 · 2018-04-13T19:03:39Z

sounds like the job was using outdated images, closing again

sjenning added the kind/test-flake Categorizes issue or PR as related to test flakes. label Feb 3, 2017

sjenning changed the title ~~error: replication controller "docker-registry-1" has failed progressing~~ flake: error: replication controller "docker-registry-1" has failed progressing Feb 3, 2017

pweil- added component/apps priority/P1 labels Feb 6, 2017

pweil- assigned legionus Feb 6, 2017

pweil- added the component/imageregistry label Feb 6, 2017

ncdc removed the component/imageregistry label Feb 7, 2017

ncdc assigned mfojtik and unassigned legionus Feb 7, 2017

mfojtik added priority/P2 and removed priority/P1 labels Feb 10, 2017

mfojtik closed this as completed Oct 11, 2017

simo5 reopened this Apr 12, 2018

simo5 mentioned this issue Apr 12, 2018

Fix LDAP test to catch the correct output. #19328

Merged

simo5 closed this as completed Apr 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake: error: replication controller "docker-registry-1" has failed progressing #12806

flake: error: replication controller "docker-registry-1" has failed progressing #12806

sjenning commented Feb 3, 2017

ncdc commented Feb 7, 2017

ncdc commented Feb 7, 2017

ncdc commented Feb 7, 2017

ncdc commented Feb 7, 2017

mfojtik commented Feb 10, 2017

ncdc commented Feb 10, 2017

smarterclayton commented Feb 10, 2017

ncdc commented Feb 10, 2017

mfojtik commented Feb 14, 2017

0xmichalis commented Feb 14, 2017

mfojtik commented Jul 12, 2017

mfojtik commented Oct 11, 2017

simo5 commented Apr 12, 2018

simo5 commented Apr 13, 2018

flake: error: replication controller "docker-registry-1" has failed progressing #12806

flake: error: replication controller "docker-registry-1" has failed progressing #12806

Comments

sjenning commented Feb 3, 2017

ncdc commented Feb 7, 2017

ncdc commented Feb 7, 2017

ncdc commented Feb 7, 2017

ncdc commented Feb 7, 2017

mfojtik commented Feb 10, 2017

ncdc commented Feb 10, 2017

smarterclayton commented Feb 10, 2017

ncdc commented Feb 10, 2017

mfojtik commented Feb 14, 2017

0xmichalis commented Feb 14, 2017

mfojtik commented Jul 12, 2017

mfojtik commented Oct 11, 2017

simo5 commented Apr 12, 2018

simo5 commented Apr 13, 2018