LimitRange defaults flake #11094

bparees · 2016-09-26T12:19:15Z

• Failure in Spec Teardown (AfterEach) [300.360 seconds]
[k8s.io] LimitRange
/data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:793
  should create a LimitRange with defaults and ensure pod has those defaults applied. [AfterEach]
  /data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/limit_range.go:102

  Sep 26 03:54:52.649: Couldn't delete ns: "e2e-tests-limitrange-c4yb1": namespace e2e-tests-limitrange-c4yb1 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-tests-limitrange-c4yb1 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})

  /data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:338

as seen in:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/6466/consoleFull

The text was updated successfully, but these errors were encountered:

gnufied · 2016-10-05T17:13:08Z

@derekwaynecarr are you working on this? If not I would like to take a stab at it?

derekwaynecarr · 2016-10-06T02:33:34Z

Please do.

On Wednesday, October 5, 2016, Hemant Kumar notifications@github.com
wrote:

@derekwaynecarr https://github.com/derekwaynecarr are you working on
this? If not I would like to take a stab at it?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11094 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbIzXm7uBtVMb6KYBHtoMy4LbBzg5ks5qw9qpgaJpZM4KGdMQ
.

mfojtik · 2016-10-10T09:11:42Z

bumping the priority as this can be seen more and more often.

gnufied · 2016-10-11T02:40:38Z

So far from my debugging, I have found following:

# request to delete namespace received
110427:I1010 13:28:50.104496   13879 audit.go:97] AUDIT: id="db017281-5cda-4eab-81f7-4cc4cc031cc4" ip="xxxxx.xxxx" method="DELETE" user="system:serviceaccount:openshift-infra:namespace-controller" as="<self>" asgroups="" namespace="e2e-tests-limitrange-9t4ap" uri="/api/v1/namespaces/e2e-tests-limitrange-9t4ap"
# response for delete request
110460:I1010 13:28:50.145487   13879 audit.go:25] AUDIT: id="db017281-5cda-4eab-81f7-4cc4cc031cc4" response="200"
# check if namespace was truly deleted
110468:I1010 13:28:50.171589   13879 audit.go:97] AUDIT: id="3c66e5dc-eb8f-41cf-a8c4-3effe4d68262" ip="xxxxxx.xxxx.xxxx" method="GET" user="system:serviceaccount:openshift-infra:namespace-controller" as="<self>" asgroups="" namespace="e2e-tests-limitrange-9t4ap" uri="/api/v1/namespaces/e2e-tests-limitrange-9t4ap"
110469-I1010 13:28:50.173091   13879 audit.go:25] AUDIT: id="3c66e5dc-eb8f-41cf-a8c4-3effe4d68262" response="404"

And yet the conformance test logs print:

Oct 10 13:22:56.858: INFO: Waiting up to 1m0s for all nodes to be ready
STEP: Destroying namespace "e2e-tests-limitrange-9t4ap" for this suite.
Oct 10 13:27:56.903: INFO: Couldn't delete ns: "e2e-tests-limitrange-9t4ap": namespace

Which makes it look like 5 minute timeout is sometimes not enough for namespace to be deleted. I am thinking of increasing Namespace deletion timeout just for this e2e test case.

gnufied · 2016-10-12T14:46:45Z

I have opened a PR in upstream repo - kubernetes/kubernetes#34614

0xmichalis · 2016-10-13T12:24:33Z

Another instance of this in the deployment suite:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/7195/testReport/junit/(root)/Extended/deploymentconfigs_when_run_iteratively__Conformance__should_immediately_start_a_new_deployment/

cc: @mfojtik

gnufied · 2016-10-20T18:30:30Z

So I ran some comparison between time taken to delete a namespace in openshift vs k8s:

For OpenShift

Time taken for deleting namespace 240.018105771
Time taken for deleting namespace 240.005871789

For k8s:

Time is took to delete a namespace is 125.002902668
Time is took to delete a namespace is 205.002347593
Time is took to delete a namespace is 125.002965305

This is running on same hardware btw. @liggitt What you think? Can we merge the flake PR?

liggitt · 2016-10-20T18:34:46Z

those times are in seconds? is this test creating any openshift resources or just kube resources?

gnufied · 2016-10-20T18:54:39Z

The times are in seconds yes. The tests are only creating kube resources. In fact - e2e test file(limit_range.go) in kube and origin are identical. But when I did oc get namespace foo -o yaml - there is an additional finalizer listed for namespaces created in openshift .

liggitt · 2016-10-20T22:27:10Z

@derekwaynecarr do you know why the original namespace finalizer would add 1-2 minutes to clean up a namespace containing no origin resources? Seems like it would be a bunch of list calls returning nothing, then a single finalize call to remove the origin finalizer. Both controllers already retry the finalize call on conflict errors.

gnufied · 2016-10-21T01:52:58Z

Just to make it clearer - please take above benchmarks with pinch of salt. I only posted highest values I saw. The time taken varies between 20s to 240s in case of origin and 20s to 205s in case of k8s.

I am looking at bits of code that gets invoked when a namespace is deleted.

liggitt · 2016-10-21T01:55:04Z

Right, but I'd expect 2-3 seconds overhead for the origin namespace max

gnufied · 2016-10-26T18:26:19Z

@liggitt @derekwaynecarr so I did some more debugging around this and found that - origin finalizer is actually running multiple times when a namespace is deleted. And it isn't running multiple times because previous invocation of finalizer failed or something, but it is running even if previous invocation deleted all the origin resources and was successful.

So basically this https://github.com/openshift/origin/blob/master/pkg/project/admission/lifecycle/admission.go#L52 code runs again and again (like 5-10 times) while the namespace is being deleted and it adds back the previously deleted openshift finalizer.

I tried printing attributes of admission when this happens and got something like:

I1026 10:14:38.653781   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE
I1026 10:14:38.703963   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE
I1026 10:14:38.804018   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE
I1026 10:14:38.856005   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE

liggitt · 2016-10-26T19:37:13Z

caused by the project lifecycle admission plugin re-adding the origin finalizer when admitting creations to a terminating namespace

the plugin used to check if the namespace was terminating and reject out-of-hand, but that meant a rapid (delete ns, create ns, create resource) would either fail (if the plugin rejected if it thought the ns was terminating), or incorrectly not add the origin finalizer (if the plugin skipped adding the finalizer if it thought the ns was terminating)

mustfix for 1.4

liggitt · 2016-10-27T04:55:32Z

debugged back to before the rebase onto kube 1.4 and still encountered the issue

bparees added priority/P2 kind/test-flake Categorizes issue or PR as related to test flakes. labels Sep 26, 2016

bparees assigned derekwaynecarr Sep 26, 2016

bparees mentioned this issue Sep 26, 2016

Add support for GitLab Push Event #11031

Merged

deads2k mentioned this issue Sep 27, 2016

delete a unused function in the 'pkg/cmd/admin/policy/policy.go' #10654

Merged

fabianofranz mentioned this issue Sep 27, 2016

Fixed an error description of func, and deleted redundant codes #10778

Merged

pweil- mentioned this issue Sep 29, 2016

Add details for field.Required #11034

Merged

csrwng mentioned this issue Sep 29, 2016

cluster up: do not re-initialize a cluster that already has been initialized #11146

Merged

This was referenced Oct 5, 2016

Volume strategic merge #11062

Merged

Updates oc run examples #11188

Merged

ncdc mentioned this issue Oct 5, 2016

Enable exec/http proxy e2e test #10641

Merged

This was referenced Oct 6, 2016

Deployment conditions api #11214

Merged

extended: deployment with multiple containers using a single ICT #11221

Merged

danwinship mentioned this issue Oct 8, 2016

Add "oadm pod-network" tests to test/cmd, remove from test/extended #11247

Merged

mfojtik added priority/P1 and removed priority/P2 labels Oct 10, 2016

mfojtik mentioned this issue Oct 10, 2016

Fix the client set generator #10861

Merged

4 tasks

mfojtik assigned gnufied and unassigned derekwaynecarr Oct 10, 2016

marun mentioned this issue Oct 10, 2016

dind: ensure node certs are generated serially #11298

Closed

enj mentioned this issue Oct 11, 2016

Remove AllowAll IDP dependency for extended and integration tests #11269

Closed

0xmichalis mentioned this issue Oct 11, 2016

Wait for pod to be running in e2e test #11312

Closed

This was referenced Oct 12, 2016

Fix limitRange flake for deleting namespaces #11336

Closed

Increase the limitRange Namespace deletion timeout kubernetes/kubernetes#34614

Closed

csrwng mentioned this issue Oct 19, 2016

cluster up: add option to install logging components #11343

Merged

0xmichalis mentioned this issue Oct 19, 2016

To add Informer for ServiceAccount #11330

Merged

This was referenced Oct 19, 2016

spell jenkins correctly #11350

Merged

remove ruby, mysql tags from pipeline sample template #11375

Merged

bparees mentioned this issue Oct 21, 2016

add nodeselector and annotation build pod overrides and defaulters #11380

Merged

0xmichalis mentioned this issue Oct 21, 2016

test: retry rollout latest on update conflicts #11482

Merged

ncdc mentioned this issue Oct 21, 2016

UPSTREAM: 35038: don't hang kubelet on unresponsive nfs #11424

Closed

0xmichalis mentioned this issue Oct 22, 2016

deploy: add conditions when creating replication controllers #11412

Merged

danwinship mentioned this issue Oct 22, 2016

Convert openshift-sdn to a CNI plugin #11082

Merged

bparees mentioned this issue Oct 23, 2016

support non-string template parameter substitution #11421

Merged

deads2k mentioned this issue Oct 24, 2016

make login, project, and discovery work against kube with RBAC enabled #11340

Merged

0xmichalis mentioned this issue Oct 25, 2016

flake: Extended.deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #11544

Closed

bparees mentioned this issue Oct 26, 2016

fix broken sample pipeline job #11576

Merged

liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2016

liggitt added this to the 1.4.0 milestone Oct 26, 2016

0xmichalis mentioned this issue Oct 27, 2016

Replace "ifs" with "switch/case" #10625

Merged

liggitt self-assigned this Oct 27, 2016

liggitt mentioned this issue Oct 27, 2016

Only pay attention to origin types in project lifecycle admission #11627

Merged

gnufied mentioned this issue Oct 27, 2016

UPSTREAM: 30836: fix Dynamic provisioning for vSphere #11598

Merged

openshift-bot closed this as completed in #11627 Oct 28, 2016

danwinship mentioned this issue May 17, 2017

LimitRange should create a LimitRange with defaults and ensure pod has those defaults applied. #14229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LimitRange defaults flake #11094

LimitRange defaults flake #11094

bparees commented Sep 26, 2016

gnufied commented Oct 5, 2016

derekwaynecarr commented Oct 6, 2016

mfojtik commented Oct 10, 2016

gnufied commented Oct 11, 2016 •

edited

Loading

gnufied commented Oct 12, 2016

0xmichalis commented Oct 13, 2016

gnufied commented Oct 20, 2016 •

edited

Loading

liggitt commented Oct 20, 2016

gnufied commented Oct 20, 2016 •

edited

Loading

liggitt commented Oct 20, 2016

gnufied commented Oct 21, 2016

liggitt commented Oct 21, 2016

gnufied commented Oct 26, 2016

liggitt commented Oct 26, 2016 •

edited

Loading

liggitt commented Oct 27, 2016 •

edited

Loading

LimitRange defaults flake #11094

LimitRange defaults flake #11094

Comments

bparees commented Sep 26, 2016

gnufied commented Oct 5, 2016

derekwaynecarr commented Oct 6, 2016

mfojtik commented Oct 10, 2016

gnufied commented Oct 11, 2016 • edited Loading

gnufied commented Oct 12, 2016

0xmichalis commented Oct 13, 2016

gnufied commented Oct 20, 2016 • edited Loading

For OpenShift

For k8s:

liggitt commented Oct 20, 2016

gnufied commented Oct 20, 2016 • edited Loading

liggitt commented Oct 20, 2016

gnufied commented Oct 21, 2016

liggitt commented Oct 21, 2016

gnufied commented Oct 26, 2016

liggitt commented Oct 26, 2016 • edited Loading

liggitt commented Oct 27, 2016 • edited Loading

gnufied commented Oct 11, 2016 •

edited

Loading

gnufied commented Oct 20, 2016 •

edited

Loading

gnufied commented Oct 20, 2016 •

edited

Loading

liggitt commented Oct 26, 2016 •

edited

Loading

liggitt commented Oct 27, 2016 •

edited

Loading