Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop eating panics #28800

Merged
merged 1 commit into from
Jul 13, 2016
Merged

Stop eating panics #28800

merged 1 commit into from
Jul 13, 2016

Conversation

lavalamp
Copy link
Member

@lavalamp lavalamp commented Jul 11, 2016

After this change, by default Kubernetes components will stop handling panics and actually crash. All Kubernetes components should be run by something that actively restarts them. This is true of the default setups, but those with custom environments may need to double-check.

If necessary, one can change this behavior back to the previous behavior by setting pkg/util/runtime's `ReallyPanic` to false.

Fixes #28365


This change is Reviewable

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-label-needed labels Jul 11, 2016
@lavalamp lavalamp added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 11, 2016
@lavalamp lavalamp assigned roberthbailey and unassigned thockin Jul 11, 2016
@lavalamp
Copy link
Member Author

looks like some tests need attention.

@roberthbailey
Copy link
Contributor

lgtm other than fixing the unit/integration tests.

@lavalamp
Copy link
Member Author

Hm, I'm not sure if we want to turn these off for kubelet. @yujuhong, would anything bad happen if kubelet started actually crashing when it panics?

@yujuhong
Copy link
Contributor

Hm, I'm not sure if we want to turn these off for kubelet. @yujuhong, would anything bad happen if kubelet started actually crashing when it panics?

Uh...you can say that in most of the cluster setup, there will be some babysitter daemon restarting kubelet. I am not sure if it's true for all setup though. On the other hand, I've seen kubelet panic'd and then the panic got handled by the util, leading to starting multiple instances of the same components. It may be safer/easier to debug to let kubelet crash.

@lavalamp
Copy link
Member Author

OK, this should make the tests pass.

@lavalamp
Copy link
Member Author

Thanks, @yujuhong -- I think I'll just disable it everywhere like I originally planned then.

fyi @smarterclayton in case this affects OpenShift

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 12, 2016
@yujuhong
Copy link
Contributor

/cc @kubernetes/sig-node

After this PR is merged, if kubelet panics, it will crash.

@@ -42,6 +50,8 @@ func HandleCrash(additionalHandlers ...func(interface{})) {
for _, fn := range additionalHandlers {
fn(r)
}
// Actually proceed to panic.
panic(r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's terrifying.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 12, 2016

Somewhat of a big change. Doesn't this mean any panic someone can reproduce can be a denial of service attack against any component of the infra? I do like making these more visible, but for API servers exposed to multi-tenant users this is riskier.

Can we flip the default so that someone can disable this (get the old behavior) if they disagree?

@lavalamp lavalamp force-pushed the reallypanic branch 2 times, most recently from 8d249c5 to c8724f5 Compare July 13, 2016 01:25
@lavalamp
Copy link
Member Author

Updated. ReallyCrash boolean works again.

@smarterclayton
Copy link
Contributor

Thanks, looks good to me aside from whatever the failures are in mesos tests.

@lavalamp
Copy link
Member Author

I'm pretty sure the remaining mesos test error is actually a legit test error that was being papered over by us eating this panic. @kubernetes/kube-mesos @karlkfi Shall I disable the test until you have time to look at it? Or can you suggest a fix?

@jdef
Copy link
Contributor

jdef commented Jul 13, 2016

@lavalamp please disable the test for now - thanks

On Wed, Jul 13, 2016 at 3:20 AM, Kubernetes Bot notifications@github.com
wrote:

GCE e2e build/test failed for commit c245793
c245793
.

Please reference the list of currently known flakes
https://github.com/kubernetes/kubernetes/issues?q=is:issue+label:kind/flake+is:open
when examining this failure. If you request a re-test, you must reference
the issue describing the flake.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28800 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACPVLN8WcR5gVLDDj7EXvN_LEXlV3Tlgks5qVJHZgaJpZM4JJz2w
.

@lavalamp
Copy link
Member Author

Thanks @jdef. Test disabled.

@k8s-cherrypick-bot
Copy link

Removing label cherrypick-candidate because no release milestone was set. This is an invalid state and thus this PR is not being considered for cherry-pick to any release branch. Please add an appropriate release milestone and then re-add the label.

@k8s-bot
Copy link

k8s-bot commented Jul 13, 2016

GCE e2e build/test passed for commit 78c02cd.

@lavalamp lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2016
@lavalamp
Copy link
Member Author

all the tests pass, I'm going to interpret Robert & Clayton's comments as a collective LGTM.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@smarterclayton
Copy link
Contributor

Yes, it was LGTM

On Wed, Jul 13, 2016 at 4:29 PM, k8s-merge-robot notifications@github.com
wrote:

@k8s-bot https://github.com/k8s-bot test this [submit-queue is
verifying that this PR is safe to merge]


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28800 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p1gBYkaI1JADeqnagkiAXj0-ztV7ks5qVUqZgaJpZM4JJz2w
.

@k8s-bot
Copy link

k8s-bot commented Jul 13, 2016

GCE e2e build/test passed for commit 78c02cd.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit aebc35a into kubernetes:master Jul 13, 2016
@thockin
Copy link
Member

thockin commented Jul 17, 2016

Did you want to offer a cherry-pick for next cut?

@zmerlynn zmerlynn added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 18, 2016
@fabioy
Copy link
Contributor

fabioy commented Jul 19, 2016

@lavalamp Ping. Could you create a cherrypick PR?

@lavalamp
Copy link
Member Author

I'm... concerned about pushing this into prod, since apparently controller manager deletes and recreates a bunch of things it ought not when it restarts.

@roberthbailey
Copy link
Contributor

I'm ok waiting for this to land in the 1.3 branch after 1.3.3 to give it a longer bake time (both at head and in the release branch).

@fabioy
Copy link
Contributor

fabioy commented Sep 8, 2016

Since 1.4 is close to release, I guess we won't be taking this on the 1.3 branch, so removing the cherrypick labels.

@fabioy fabioy removed cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cherrypick-candidate labels Sep 8, 2016
@spiffxp
Copy link
Member

spiffxp commented Sep 21, 2016

@lavalamp @roberthbailey this PR has a release-note-action-required label on it, but I can't tell that the action is something an end-user of kubernetes can do, outside of kubelet's --really-crash-for-testing flag

can you clarify whether this is worth calling out as an action required in the 1.4.0 release notes?

@lavalamp
Copy link
Member Author

@spiffxp the change message says:

After this change, by default Kubernetes components will stop handling panics and actually crash. All Kubernetes components should be run by something that actively restarts them. This is true of the default setups, but those with custom environments may need to double-check.

This is the action that is needed.

@lavalamp
Copy link
Member Author

The action is targeted at cluster admins. Most action-required items are, I think. End users of kubernetes should not need to take many actions.

@spiffxp
Copy link
Member

spiffxp commented Sep 22, 2016

@lavalamp agree on clusters ops audience (that's what I meant when I said end-user, doh), hence my confusion, skipped over that first line, thanks for pointing me in the right direction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.