Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LimitRange defaults flake #11094

Closed
bparees opened this issue Sep 26, 2016 · 18 comments
Closed

LimitRange defaults flake #11094

bparees opened this issue Sep 26, 2016 · 18 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Milestone

Comments

@bparees
Copy link
Contributor

bparees commented Sep 26, 2016

• Failure in Spec Teardown (AfterEach) [300.360 seconds]
[k8s.io] LimitRange
/data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:793
  should create a LimitRange with defaults and ensure pod has those defaults applied. [AfterEach]
  /data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/limit_range.go:102

  Sep 26 03:54:52.649: Couldn't delete ns: "e2e-tests-limitrange-c4yb1": namespace e2e-tests-limitrange-c4yb1 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-tests-limitrange-c4yb1 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})

  /data/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:338

as seen in:
https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_conformance/6466/consoleFull

@gnufied
Copy link
Member

gnufied commented Oct 5, 2016

@derekwaynecarr are you working on this? If not I would like to take a stab at it?

@derekwaynecarr
Copy link
Member

Please do.

On Wednesday, October 5, 2016, Hemant Kumar notifications@github.com
wrote:

@derekwaynecarr https://github.com/derekwaynecarr are you working on
this? If not I would like to take a stab at it?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11094 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbIzXm7uBtVMb6KYBHtoMy4LbBzg5ks5qw9qpgaJpZM4KGdMQ
.

@mfojtik
Copy link
Contributor

mfojtik commented Oct 10, 2016

bumping the priority as this can be seen more and more often.

@gnufied
Copy link
Member

gnufied commented Oct 11, 2016

So far from my debugging, I have found following:

# request to delete namespace received
110427:I1010 13:28:50.104496   13879 audit.go:97] AUDIT: id="db017281-5cda-4eab-81f7-4cc4cc031cc4" ip="xxxxx.xxxx" method="DELETE" user="system:serviceaccount:openshift-infra:namespace-controller" as="<self>" asgroups="" namespace="e2e-tests-limitrange-9t4ap" uri="/api/v1/namespaces/e2e-tests-limitrange-9t4ap"
# response for delete request
110460:I1010 13:28:50.145487   13879 audit.go:25] AUDIT: id="db017281-5cda-4eab-81f7-4cc4cc031cc4" response="200"
# check if namespace was truly deleted
110468:I1010 13:28:50.171589   13879 audit.go:97] AUDIT: id="3c66e5dc-eb8f-41cf-a8c4-3effe4d68262" ip="xxxxxx.xxxx.xxxx" method="GET" user="system:serviceaccount:openshift-infra:namespace-controller" as="<self>" asgroups="" namespace="e2e-tests-limitrange-9t4ap" uri="/api/v1/namespaces/e2e-tests-limitrange-9t4ap"
110469-I1010 13:28:50.173091   13879 audit.go:25] AUDIT: id="3c66e5dc-eb8f-41cf-a8c4-3effe4d68262" response="404"

And yet the conformance test logs print:

Oct 10 13:22:56.858: INFO: Waiting up to 1m0s for all nodes to be ready
STEP: Destroying namespace "e2e-tests-limitrange-9t4ap" for this suite.
Oct 10 13:27:56.903: INFO: Couldn't delete ns: "e2e-tests-limitrange-9t4ap": namespace

Which makes it look like 5 minute timeout is sometimes not enough for namespace to be deleted. I am thinking of increasing Namespace deletion timeout just for this e2e test case.

@gnufied
Copy link
Member

gnufied commented Oct 12, 2016

I have opened a PR in upstream repo - kubernetes/kubernetes#34614

@gnufied
Copy link
Member

gnufied commented Oct 20, 2016

So I ran some comparison between time taken to delete a namespace in openshift vs k8s:

For OpenShift

Time taken for deleting namespace 240.018105771
Time taken for deleting namespace 240.005871789

For k8s:

Time is took to delete a namespace is 125.002902668
Time is took to delete a namespace is 205.002347593
Time is took to delete a namespace is 125.002965305

This is running on same hardware btw. @liggitt What you think? Can we merge the flake PR?

@liggitt
Copy link
Contributor

liggitt commented Oct 20, 2016

those times are in seconds? is this test creating any openshift resources or just kube resources?

@gnufied
Copy link
Member

gnufied commented Oct 20, 2016

The times are in seconds yes. The tests are only creating kube resources. In fact - e2e test file(limit_range.go) in kube and origin are identical. But when I did oc get namespace foo -o yaml - there is an additional finalizer listed for namespaces created in openshift .

@liggitt
Copy link
Contributor

liggitt commented Oct 20, 2016

@derekwaynecarr do you know why the original namespace finalizer would add 1-2 minutes to clean up a namespace containing no origin resources? Seems like it would be a bunch of list calls returning nothing, then a single finalize call to remove the origin finalizer. Both controllers already retry the finalize call on conflict errors.

@gnufied
Copy link
Member

gnufied commented Oct 21, 2016

Just to make it clearer - please take above benchmarks with pinch of salt. I only posted highest values I saw. The time taken varies between 20s to 240s in case of origin and 20s to 205s in case of k8s.

I am looking at bits of code that gets invoked when a namespace is deleted.

@liggitt
Copy link
Contributor

liggitt commented Oct 21, 2016

Right, but I'd expect 2-3 seconds overhead for the origin namespace max

@gnufied
Copy link
Member

gnufied commented Oct 26, 2016

@liggitt @derekwaynecarr so I did some more debugging around this and found that - origin finalizer is actually running multiple times when a namespace is deleted. And it isn't running multiple times because previous invocation of finalizer failed or something, but it is running even if previous invocation deleted all the origin resources and was successful.

So basically this https://github.com/openshift/origin/blob/master/pkg/project/admission/lifecycle/admission.go#L52 code runs again and again (like 5-10 times) while the namespace is being deleted and it adds back the previously deleted openshift finalizer.

I tried printing attributes of admission when this happens and got something like:

I1026 10:14:38.653781   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE
I1026 10:14:38.703963   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE
I1026 10:14:38.804018   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE
I1026 10:14:38.856005   20125 admission.go:53] dspace lotr ~~~ calling Admin for - :e2e-tests-limitrange-puldo:events::CREATE

@liggitt
Copy link
Contributor

liggitt commented Oct 26, 2016

caused by the project lifecycle admission plugin re-adding the origin finalizer when admitting creations to a terminating namespace

the plugin used to check if the namespace was terminating and reject out-of-hand, but that meant a rapid (delete ns, create ns, create resource) would either fail (if the plugin rejected if it thought the ns was terminating), or incorrectly not add the origin finalizer (if the plugin skipped adding the finalizer if it thought the ns was terminating)

mustfix for 1.4

@liggitt liggitt added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2016
@liggitt liggitt added this to the 1.4.0 milestone Oct 26, 2016
@liggitt
Copy link
Contributor

liggitt commented Oct 27, 2016

debugged back to before the rebase onto kube 1.4 and still encountered the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

6 participants