Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet controller actively kills failed pods (to recreate them) #40330

Merged
merged 4 commits into from
Jan 27, 2017

Conversation

janetkuo
Copy link
Member

@janetkuo janetkuo commented Jan 23, 2017

Ref #36482, @erictune @yujuhong @mikedanese @Kargakis @lukaszo @piosz @kubernetes/sig-apps-bugs

This also helps with DaemonSet update

@janetkuo janetkuo added area/workload-api/daemonset release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Jan 23, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 23, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 23, 2017
case shouldContinueRunning && len(daemonPods) > 1:
case shouldContinueRunning:
// If a daemon pod failed, delete it
// TODO: handle the case when the daemon pods fail consistently and causes kill-recreate hot loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does the controller sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be that often. @janetkuo we probably want to cap on a maximum # of retries and then drop daemon sets out of the queue so we won't end up hotlooping.

Copy link
Member Author

@janetkuo janetkuo Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about returning errors (at the end) whenever there's a failed daemon pod? We use rate limiter when syncHandler returns an error. This can prevent hotloop

@spxtr
Copy link
Contributor

spxtr commented Jan 24, 2017

Why not add or update a unit test for the new behavior?

@0xmichalis
Copy link
Contributor

Why not add or update a unit test for the new behavior?

There is a new extended test added

@0xmichalis
Copy link
Contributor

Just a comment about the hotloop, lgtm otherwise.

@spxtr
Copy link
Contributor

spxtr commented Jan 24, 2017

There is a new extended test added

There is an e2e test, but I would much rather see a unit test.

@janetkuo janetkuo force-pushed the kill-failed-daemon-pods branch 2 times, most recently from f599875 to 33cf0c9 Compare January 24, 2017 22:50
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2017
@janetkuo
Copy link
Member Author

Added a unit test and addressed hotloop issue; PTAL @spxtr @Kargakis

@spxtr
Copy link
Contributor

spxtr commented Jan 25, 2017

Thanks. It looks reasonable overall, but I don't have much context. I'll let @Kargakis LGTM.

@@ -547,6 +563,10 @@ func (dsc *DaemonSetsController) manage(ds *extensions.DaemonSet) error {
for err := range errCh {
errors = append(errors, err)
}
if failedPodsObserved > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this - why do you need to return the error here? Won't the daemon set be resynced because of the deleted pod event anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you want to use the ratelimiter - ok. Although for perma-failed daemon sets we probably want to stop retrying them after a while.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support perma-failed daemon sets yet. Normally DS controller would check if the daemon pod can be scheduled on the node before creating it, so it's unlikely it'll create pods that are doomed to fail. However, sometimes there could be a race condition that kubelet uses its own node object to admit pods, and then rejected the pods (pods become Failed).

Let's deal with this in a follow up PR?

Copy link
Member Author

@janetkuo janetkuo Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the comment to before if statement to make it more clear

@@ -653,6 +661,31 @@ func TestObservedGeneration(t *testing.T) {
}
}

// DaemonSet controller should kill all failed pods and recreate at most 1 failed pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at most 1 pod on every node"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@janetkuo janetkuo force-pushed the kill-failed-daemon-pods branch from 24f2926 to e502f78 Compare January 25, 2017 18:29
@janetkuo janetkuo force-pushed the kill-failed-daemon-pods branch from e502f78 to 81c1e0c Compare January 25, 2017 18:31
@janetkuo
Copy link
Member Author

@Kargakis ptal

@mikedanese
Copy link
Member

/approve

@janetkuo janetkuo added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: mikedanese

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @fejta
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@0xmichalis
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2017
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 62c8022 into kubernetes:master Jan 27, 2017
k8s-github-robot pushed a commit that referenced this pull request Feb 2, 2017
Automatic merge from submit-queue (batch tested with PRs 40556, 40720)

Emit events on 'Failed' daemon pods

Follow up #40330 @erictune @mikedanese @Kargakis @lukaszo @kubernetes/sig-apps-bugs
@jethrogb
Copy link

jethrogb commented Jun 6, 2019

Is there a reason to do this in the controller instead of just letting kubelet do it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/workload-api/daemonset cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants