Stop eating panics #28800

lavalamp · 2016-07-11T21:40:24Z

After this change, by default Kubernetes components will stop handling panics and actually crash. All Kubernetes components should be run by something that actively restarts them. This is true of the default setups, but those with custom environments may need to double-check.

If necessary, one can change this behavior back to the previous behavior by setting pkg/util/runtime's `ReallyPanic` to false.

Fixes #28365

This change is

lavalamp · 2016-07-11T21:57:45Z

looks like some tests need attention.

roberthbailey · 2016-07-11T23:30:54Z

lgtm other than fixing the unit/integration tests.

lavalamp · 2016-07-12T00:56:20Z

Hm, I'm not sure if we want to turn these off for kubelet. @yujuhong, would anything bad happen if kubelet started actually crashing when it panics?

yujuhong · 2016-07-12T01:07:42Z

Hm, I'm not sure if we want to turn these off for kubelet. @yujuhong, would anything bad happen if kubelet started actually crashing when it panics?

Uh...you can say that in most of the cluster setup, there will be some babysitter daemon restarting kubelet. I am not sure if it's true for all setup though. On the other hand, I've seen kubelet panic'd and then the panic got handled by the util, leading to starting multiple instances of the same components. It may be safer/easier to debug to let kubelet crash.

lavalamp · 2016-07-12T01:09:52Z

OK, this should make the tests pass.

lavalamp · 2016-07-12T01:11:00Z

Thanks, @yujuhong -- I think I'll just disable it everywhere like I originally planned then.

fyi @smarterclayton in case this affects OpenShift

yujuhong · 2016-07-12T01:18:45Z

/cc @kubernetes/sig-node

After this PR is merged, if kubelet panics, it will crash.

smarterclayton · 2016-07-12T05:03:54Z

pkg/util/runtime/runtime.go

@@ -42,6 +50,8 @@ func HandleCrash(additionalHandlers ...func(interface{})) {
 		for _, fn := range additionalHandlers {
 			fn(r)
 		}
+		// Actually proceed to panic.
+		panic(r)


That's terrifying.

smarterclayton · 2016-07-12T05:07:35Z

Somewhat of a big change. Doesn't this mean any panic someone can reproduce can be a denial of service attack against any component of the infra? I do like making these more visible, but for API servers exposed to multi-tenant users this is riskier.

Can we flip the default so that someone can disable this (get the old behavior) if they disagree?

lavalamp · 2016-07-13T01:25:53Z

Updated. ReallyCrash boolean works again.

smarterclayton · 2016-07-13T01:52:19Z

Thanks, looks good to me aside from whatever the failures are in mesos tests.

lavalamp · 2016-07-13T06:32:54Z

I'm pretty sure the remaining mesos test error is actually a legit test error that was being papered over by us eating this panic. @kubernetes/kube-mesos @karlkfi Shall I disable the test until you have time to look at it? Or can you suggest a fix?

jdef · 2016-07-13T12:17:38Z

@lavalamp please disable the test for now - thanks

On Wed, Jul 13, 2016 at 3:20 AM, Kubernetes Bot notifications@github.com
wrote:

GCE e2e build/test failed for commit c245793
c245793
.

Test Results
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/28800/kubernetes-pull-build-test-e2e-gce/48812

Build Log
http://pr-test.k8s.io/28800/kubernetes-pull-build-test-e2e-gce/48812/build-log.txt

Test Artifacts
https://console.developers.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/28800/kubernetes-pull-build-test-e2e-gce/48812/artifacts/

Internal Jenkins Results
http://goto.google.com/prkubekins/job/kubernetes-pull-build-test-e2e-gce//48812

Please reference the list of currently known flakes
https://github.com/kubernetes/kubernetes/issues?q=is:issue+label:kind/flake+is:open
when examining this failure. If you request a re-test, you must reference
the issue describing the flake.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28800 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACPVLN8WcR5gVLDDj7EXvN_LEXlV3Tlgks5qVJHZgaJpZM4JJz2w
.

lavalamp · 2016-07-13T16:37:19Z

Thanks @jdef. Test disabled.

k8s-cherrypick-bot · 2016-07-13T18:40:53Z

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

k8s-bot · 2016-07-13T19:18:57Z

GCE e2e build/test passed for commit 78c02cd.

lavalamp · 2016-07-13T19:48:24Z

all the tests pass, I'm going to interpret Robert & Clayton's comments as a collective LGTM.

k8s-github-robot · 2016-07-13T20:28:59Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

smarterclayton · 2016-07-13T20:57:54Z

Yes, it was LGTM

On Wed, Jul 13, 2016 at 4:29 PM, k8s-merge-robot notifications@github.com
wrote:

@k8s-bot https://github.com/k8s-bot test this [submit-queue is
verifying that this PR is safe to merge]

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28800 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p1gBYkaI1JADeqnagkiAXj0-ztV7ks5qVUqZgaJpZM4JJz2w
.

k8s-bot · 2016-07-13T21:05:38Z

GCE e2e build/test passed for commit 78c02cd.

k8s-github-robot · 2016-07-13T21:05:41Z

Automatic merge from submit-queue

thockin · 2016-07-17T02:07:19Z

Did you want to offer a cherry-pick for next cut?

fabioy · 2016-07-19T16:02:49Z

@lavalamp Ping. Could you create a cherrypick PR?

lavalamp · 2016-07-19T22:03:05Z

I'm... concerned about pushing this into prod, since apparently controller manager deletes and recreates a bunch of things it ought not when it restarts.

roberthbailey · 2016-07-19T22:33:29Z

I'm ok waiting for this to land in the 1.3 branch after 1.3.3 to give it a longer bake time (both at head and in the release branch).

fabioy · 2016-09-08T15:47:49Z

Since 1.4 is close to release, I guess we won't be taking this on the 1.3 branch, so removing the cherrypick labels.

spiffxp · 2016-09-21T22:46:15Z

@lavalamp @roberthbailey this PR has a release-note-action-required label on it, but I can't tell that the action is something an end-user of kubernetes can do, outside of kubelet's --really-crash-for-testing flag

can you clarify whether this is worth calling out as an action required in the 1.4.0 release notes?

lavalamp · 2016-09-22T00:06:24Z

@spiffxp the change message says:

After this change, by default Kubernetes components will stop handling panics and actually crash. All Kubernetes components should be run by something that actively restarts them. This is true of the default setups, but those with custom environments may need to double-check.

This is the action that is needed.

lavalamp · 2016-09-22T00:07:23Z

The action is targeted at cluster admins. Most action-required items are, I think. End users of kubernetes should not need to take many actions.

spiffxp · 2016-09-22T00:12:00Z

@lavalamp agree on clusters ops audience (that's what I meant when I said end-user, doh), hence my confusion, skipped over that first line, thanks for pointing me in the right direction

googlebot added the cla: yes label Jul 11, 2016

k8s-github-robot assigned thockin Jul 11, 2016

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Jul 11, 2016

lavalamp added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 11, 2016

lavalamp assigned roberthbailey and unassigned thockin Jul 11, 2016

lavalamp force-pushed the reallypanic branch from e8376f4 to 7b9af2f Compare July 12, 2016 01:09

k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 12, 2016

smarterclayton reviewed Jul 12, 2016
View reviewed changes

lavalamp force-pushed the reallypanic branch 2 times, most recently from 8d249c5 to c8724f5 Compare July 13, 2016 01:25

lavalamp force-pushed the reallypanic branch from c8724f5 to c245793 Compare July 13, 2016 06:30

lavalamp force-pushed the reallypanic branch from c245793 to 15a381c Compare July 13, 2016 16:33

k8s-cherrypick-bot removed the cherrypick-candidate label Jul 13, 2016

roberthbailey added this to the v1.3 milestone Jul 13, 2016

roberthbailey added the cherrypick-candidate label Jul 13, 2016

lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2016

k8s-github-robot merged commit aebc35a into kubernetes:master Jul 13, 2016

zmerlynn added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 18, 2016

This was referenced Aug 9, 2016

apiserver: fix timeout handler #29594

Merged

apiserver still eats panic #30305

Closed

eparis mentioned this pull request Aug 18, 2016

1.3 upstream picks openshift/origin#9790

Closed

fabioy removed cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cherrypick-candidate labels Sep 8, 2016

liggitt mentioned this pull request Nov 2, 2017

Don't panic on unexpected group version for mutated selector #54966

Merged

roycaihw mentioned this pull request Mar 11, 2020

Panic in kubelet Run does not exit, creates multiple parallel Run goroutines #88779

Closed

liggitt mentioned this pull request Feb 7, 2023

apiserver: panic from goroutine spun up by request handler should not crash the apiserver #115565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop eating panics #28800

Stop eating panics #28800

lavalamp commented Jul 11, 2016 •

edited by k8s-oncall

Loading

lavalamp commented Jul 11, 2016

roberthbailey commented Jul 11, 2016

lavalamp commented Jul 12, 2016

yujuhong commented Jul 12, 2016

lavalamp commented Jul 12, 2016

lavalamp commented Jul 12, 2016

yujuhong commented Jul 12, 2016

smarterclayton Jul 12, 2016

smarterclayton commented Jul 12, 2016 •

edited

Loading

lavalamp commented Jul 13, 2016

smarterclayton commented Jul 13, 2016

lavalamp commented Jul 13, 2016

jdef commented Jul 13, 2016

lavalamp commented Jul 13, 2016

k8s-cherrypick-bot commented Jul 13, 2016

k8s-bot commented Jul 13, 2016

lavalamp commented Jul 13, 2016

k8s-github-robot commented Jul 13, 2016

smarterclayton commented Jul 13, 2016

k8s-bot commented Jul 13, 2016

k8s-github-robot commented Jul 13, 2016

thockin commented Jul 17, 2016

fabioy commented Jul 19, 2016

lavalamp commented Jul 19, 2016

roberthbailey commented Jul 19, 2016

fabioy commented Sep 8, 2016

spiffxp commented Sep 21, 2016

lavalamp commented Sep 22, 2016

lavalamp commented Sep 22, 2016

spiffxp commented Sep 22, 2016 •

edited

Loading

Stop eating panics #28800

Stop eating panics #28800

Conversation

lavalamp commented Jul 11, 2016 • edited by k8s-oncall Loading

lavalamp commented Jul 11, 2016

roberthbailey commented Jul 11, 2016

lavalamp commented Jul 12, 2016

yujuhong commented Jul 12, 2016

lavalamp commented Jul 12, 2016

lavalamp commented Jul 12, 2016

yujuhong commented Jul 12, 2016

smarterclayton Jul 12, 2016

Choose a reason for hiding this comment

smarterclayton commented Jul 12, 2016 • edited Loading

lavalamp commented Jul 13, 2016

smarterclayton commented Jul 13, 2016

lavalamp commented Jul 13, 2016

jdef commented Jul 13, 2016

lavalamp commented Jul 13, 2016

k8s-cherrypick-bot commented Jul 13, 2016

k8s-bot commented Jul 13, 2016

lavalamp commented Jul 13, 2016

k8s-github-robot commented Jul 13, 2016

smarterclayton commented Jul 13, 2016

k8s-bot commented Jul 13, 2016

k8s-github-robot commented Jul 13, 2016

thockin commented Jul 17, 2016

fabioy commented Jul 19, 2016

lavalamp commented Jul 19, 2016

roberthbailey commented Jul 19, 2016

fabioy commented Sep 8, 2016

spiffxp commented Sep 21, 2016

lavalamp commented Sep 22, 2016

lavalamp commented Sep 22, 2016

spiffxp commented Sep 22, 2016 • edited Loading

lavalamp commented Jul 11, 2016 •

edited by k8s-oncall

Loading

smarterclayton commented Jul 12, 2016 •

edited

Loading

spiffxp commented Sep 22, 2016 •

edited

Loading