Confusing(and broken?) deployment behavior #18293

bparees · 2018-01-25T17:50:40Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18278/test_pull_request_origin_extended_image_ecosystem/450/

specifically the [image_ecosystem][ruby][Slow] hot deploy for openshift ruby image Rails example should work with hot deploy [Suite:openshift] 10m29s test

In this test we create a buildconfig and DC that is triggered by that build. The DC also has a config change trigger.

We expect exactly 1 deployment to occur (when the build completes), but we appear to be getting two. This did not used to happen.

Looking at the events, we see:

the build starts at 23:09:25

Jan 24 23:19:28.534: INFO: At 2018-01-24 23:09:25 +0000 UTC - event for rails-postgresql-example-1-build: {default-scheduler } Scheduled: Successfully assigned rails-postgresql-example-1-build to ip-172-18-11-23.ec2.internal
Jan 24 23:19:28.534: INFO: At 2018-01-24 23:09:36 +0000 UTC - event for rails-postgresql-example-1: {build-controller } BuildStarted: Build extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example-1 is now running

Roughly simultaneously, the build completes and a deployment 2 is created (not sure where the event for deployment 1 is??) which cancels deployment 1. Presumably deployment 2 is being triggered by the newly built/pushed image.

Jan 24 23:19:28.534: INFO: At 2018-01-24 23:14:47 +0000 UTC - event for rails-postgresql-example: {deploymentconfig-controller } DeploymentAwaitingCancellation: Deployment of version 2 awaiting cancellation of older running deployments
Jan 24 23:19:28.534: INFO: At 2018-01-24 23:14:47 +0000 UTC - event for rails-postgresql-example: {deploymentconfig-controller } DeploymentCancelled: Cancelled deployment "rails-postgresql-example-1" superceded by version 2
Jan 24 23:19:28.534: INFO: At 2018-01-24 23:14:47 +0000 UTC - event for rails-postgresql-example-1-deploy: {default-scheduler } Scheduled: Successfully assigned rails-postgresql-example-1-deploy to ip-172-18-11-23.ec2.internal
Jan 24 23:19:28.534: INFO: At 2018-01-24 23:14:48 +0000 UTC - event for rails-postgresql-example-1: {build-controller } BuildCompleted: Build extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example-1 completed successfully
Jan 24 23:19:28.534: INFO: At 2018-01-24 23:14:53 +0000 UTC - event for rails-postgresql-example: {deployer-controller } RolloutCancelled: Rollout for "extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example-1" cancelled
Jan 24 23:19:28.534: INFO: At 2018-01-24 23:14:53 +0000 UTC - event for rails-postgresql-example: {deploymentconfig-controller } DeploymentCreated: Created new replication controller "rails-postgresql-example-2" for version 2

So my question is:

Assuming deployment 2 was caused by the build completion, what caused deployment 1? Was it the config change trigger (despite the imagestreamtag not existing yet?)

Should deployment 1 have been created? It didn't used to work that way.

you can see the template w/ the DC in question here:
https://github.com/openshift/rails-ex/blob/6a59aa15bf863fde71e0bbfa43c5344290eed8f6/openshift/templates/rails-postgresql.json#L146

And note that this test just ran, so it should have included the informer stale cache fix.

My impression is that deployment 1 got created by the configchangetrigger, and then hung waiting for the imagestreamtag to resolve. When the build updates the imagestreamtag, deployment 1 starts to proceed, but gets canceled because deployment 2 is triggered.

@tnozicka @mfojtik @Kargakis

(marking as bug because it appears to be a change in behavior. if it's expected we can live with it, but i need to know it's intentional)

The text was updated successfully, but these errors were encountered:

tnozicka · 2018-01-25T20:54:02Z

My impression is that deployment 1 got created by the configchangetrigger, and then hung waiting for the imagestreamtag to resolve. When the build updates the imagestreamtag, deployment 1 starts to proceed, but gets canceled because deployment 2 is triggered.

It should wait for the image to be injected on the first deployment. In addition we now have explicit check not to deploy when empty image is " ". So if it was really caused by that, it's not intentional.

I am out today but will check master logs when I am back what caused those deployments.

bparees · 2018-01-26T16:53:25Z

this happened again here:
https://ci.openshift.redhat.com/jenkins/job/test_branch_origin_extended_builds/353/consoleFull

build starts:

Jan 26 13:38:53.564: INFO: At 2018-01-26 13:34:36 +0000 UTC - event for s2icontext-1: {build-controller } BuildStarted: Build extended-test-contextdir-6n2tf-tdxtf/s2icontext-1 is now running

build completes/frontend-1 deployment starts

Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:14 +0000 UTC - event for frontend: {deploymentconfig-controller } DeploymentCreated: Created new replication controller "frontend-1" for version 1
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:14 +0000 UTC - event for frontend: {deploymentconfig-controller } DeploymentCancelled: Cancelled deployment "frontend-1" superceded by version 2
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:14 +0000 UTC - event for frontend: {deploymentconfig-controller } DeploymentAwaitingCancellation: Deployment of version 2 awaiting cancellation of older running deployments
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:14 +0000 UTC - event for frontend-1-deploy: {kubelet ip-172-18-3-187.ec2.internal} SuccessfulMountVolume: MountVolume.SetUp succeeded for volume "deployer-token-b754m" 
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:14 +0000 UTC - event for frontend-1-deploy: {default-scheduler } Scheduled: Successfully assigned frontend-1-deploy to ip-172-18-3-187.ec2.internal
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:18 +0000 UTC - event for s2icontext-1: {build-controller } BuildCompleted: Build extended-test-contextdir-6n2tf-tdxtf/s2icontext-1 completed successfully
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:20 +0000 UTC - event for frontend-1-deploy: {kubelet ip-172-18-3-187.ec2.internal} Pulled: Container image "openshift/origin-deployer:4f709b4" already present on machine
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:20 +0000 UTC - event for frontend-1-deploy: {kubelet ip-172-18-3-187.ec2.internal} Failed: Error: cannot find volume "deployer-token-b754m" to mount into container "deployment"

frontend deployment 2 starts a few seconds later and cancels deployment 1

Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:21 +0000 UTC - event for frontend: {deploymentconfig-controller } DeploymentCreated: Created new replication controller "frontend-2" for version 2
Jan 26 13:38:53.564: INFO: At 2018-01-26 13:36:21 +0000 UTC - event for frontend: {deployer-controller } RolloutCancelled: Rollout for "extended-test-contextdir-6n2tf-tdxtf/frontend-1" cancelled

tnozicka · 2018-01-27T10:49:38Z

Finally managed to download logs from the first case; (conference wifi sucks)

Both are caused by config change and around the same time.

I0124 23:09:24.843274   19833 deploymentconfig_controller.go:146] Postponing rollout #0 for DeploymentConfig extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example because of invalid or unresolved image for container #0 (name=rails-postgresql-example)
...
I0124 23:09:24.894827   19833 deploymentconfig_controller.go:146] Postponing rollout #0 for DeploymentConfig extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example because of invalid or unresolved image for container #0 (name=rails-postgresql-example)
...
I0124 23:14:47.583118   19833 deploymentconfig_controller.go:563] Rolling out initial deployment for extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example as it now have images available
...
I0124 23:14:47.680831   19833 deploymentconfig_controller.go:496] Error syncing deployment config extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example: Operation cannot be fulfilled on deploymentconfigs.apps.openshift.io "rails-postgresql-example": the object has been modified; please apply your changes to the latest version and try again
...
I0124 23:14:47.680942   19833 deploymentconfig_controller.go:594] Rolling out #2 deployment for extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example caused by image changes (docker-registry.default.svc:5000/extended-test-s2i-ruby-llfdg-mr96q/rails-postgresql-example@sha256:db5a145f7d886f56a2c92a95e16b588163587dc0c1a784a7c7db34fd4a484181)

@mfojtik this reminds me the case of the registry decorator. Once it injects image with registry IP and the second one with registry DNS name. WDYT?

bparees · 2018-01-27T17:18:12Z

It does look like the registry url is not configured:

time="2018-01-26T13:23:27.767041115Z" level=info msg="Using \"172.30.57.92:5000\" as Docker Registry URL" go.version=go1.9.2 instance.id=95089825-99f8-4880-b5db-7974ec030eac

i'm not sure why that is, seems like the extended test job is not deploying the registry properly (ie with a registry url configured). Which is odd since I thought it used the ansible installer to set up the cluster and the ansible installer definitely sets that value. i will dig into this some more from my side as well.

bparees · 2018-01-27T17:25:18Z

@mfojtik @tnozicka incidentally did something change w/ regards to this recently? because i'm pretty sure we weren't seeing this before (certainly not as of 2 months ago when these tests were all passing regularly).

bparees · 2018-01-28T04:47:06Z

Though the ansible installer does appear to have set it:

TASK [openshift_hosted : Update registry environment variables when pushing via dns] ***
task path: /usr/share/ansible/openshift-ansible/roles/openshift_hosted/tasks/registry.yml:44
ok: [localhost] => {
    "ansible_facts": {
        "openshift_hosted_registry_env_vars": {
            "REGISTRY_OPENSHIFT_SERVER_ADDR": "docker-registry.default.svc:5000"
        }
    }, 
    "changed": false, 
    "generated_timestamp": "2018-01-26 13:08:54.974821"
}

it looks like there is an issue w/ the registry not respecting that value, i've opened a PR to temporarily revert a recent ansible change as well as an issue against the registry since it should be respecting it.

openshift/openshift-ansible#6913
openshift/image-registry#58

assigning to myself until i see evidence that these multiple deploys still happen after the reversion. thanks for the triage, @tnozicka

mfojtik · 2018-01-28T12:26:39Z

@bparees double-check what version of registry we run in the CI, didn't you consolidated the variables recently?

We can perhaps fix the create call for image stream to automatically rewrite the pull spec, similar to decorator... If you change the DNS for registry later, you will be broken anyway, but we can fix the case of double-triggered DC.

bparees · 2018-01-28T19:11:23Z

@bparees double-check what version of registry we run in the CI, didn't you consolidated the variables recently?

I fixed ansible to use what is supposed to be the "current" variable for setting the url, but the registry doesn't appear to actually respect that variable (hence the issue i opened above, and reverting the change in ansible for now).

We can perhaps fix the create call for image stream to automatically rewrite the pull spec, similar to decorator

you mean rewrite it based on the hostname in the push request like we discussed the other day? yeah, that would be ideal, but for now i think the short term solution is still to make sure the master+registry are aligned on what the url is, and at the moment i think the problem is that the registry isn't getting the right url set/reading the variable correctly (per the issue i opened).

bparees · 2018-01-29T03:57:59Z

This looks to be fixed by my reverting the ansible changes, so we'll chase it from there.

bparees added kind/bug Categorizes issue or PR as related to a bug. priority/P1 component/apps labels Jan 25, 2018

bparees assigned mfojtik Jan 25, 2018

bparees changed the title ~~Confusing deployment behavior~~ Confusing(and broken?) deployment behavior Jan 25, 2018

bparees assigned bparees and unassigned mfojtik Jan 28, 2018

bparees closed this as completed Jan 29, 2018

skynardo mentioned this issue Mar 19, 2018

Multiple running deployments reported in a test - invariant violated #16003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing(and broken?) deployment behavior #18293

Confusing(and broken?) deployment behavior #18293

bparees commented Jan 25, 2018

tnozicka commented Jan 25, 2018

bparees commented Jan 26, 2018

tnozicka commented Jan 27, 2018

bparees commented Jan 27, 2018

bparees commented Jan 27, 2018

bparees commented Jan 28, 2018

mfojtik commented Jan 28, 2018

bparees commented Jan 28, 2018

bparees commented Jan 29, 2018

Confusing(and broken?) deployment behavior #18293

Confusing(and broken?) deployment behavior #18293

Comments

bparees commented Jan 25, 2018

tnozicka commented Jan 25, 2018

bparees commented Jan 26, 2018

tnozicka commented Jan 27, 2018

bparees commented Jan 27, 2018

bparees commented Jan 27, 2018

bparees commented Jan 28, 2018

mfojtik commented Jan 28, 2018

bparees commented Jan 28, 2018

bparees commented Jan 29, 2018