Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate restarting pods in-place as a mechanism for faster failure recovery #467

Open
Tracked by #523 ...
danielvegamyhre opened this issue Mar 22, 2024 · 9 comments
Assignees

Comments

@danielvegamyhre
Copy link
Contributor

What would you like to be added:
A faster restart/failure recovery mechanism that doesn't involve recreating all the child jobs, rescheduling all of the pods, etc.

Why is this needed:
For faster failure recovery, reducing downtime for batch/training workloads which experience some kind of error.

@danielvegamyhre danielvegamyhre self-assigned this Mar 22, 2024
@kannon92
Copy link
Contributor

Reading this, why wouldn't one just want to use BackOffLimit in the job template?

@danielvegamyhre
Copy link
Contributor Author

Reading this, why wouldn't one just want to use BackOffLimit in the job template?

BackoffLimit only applies to that particular Job, but we need any pod failure in any Job to trigger pod restarts across all Jobs.

@googs1025
Copy link
Member

Sorry, I just joined this project and I'm still trying to understand many things. I'm not sure if I'm thinking correctly. I want to confirm whether the intention is to restart the entire ReplicatedJobs when any job fails, or to restart the entire jobset.

@danielvegamyhre
Copy link
Contributor Author

Sorry, I just joined this project and I'm still trying to understand many things. I'm not sure if I'm thinking correctly. I want to confirm whether the intention is to restart the entire ReplicatedJobs when any job fails, or to restart the entire jobset.

Right now it means the entire jobset, but after #381 is implemented, it will depend on if (and how) a failure policy is configured.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2024
@ahg-g
Copy link
Contributor

ahg-g commented Jun 30, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2024
@danielvegamyhre
Copy link
Contributor Author

Update: After prototyping and investigating this further, I've identified 2 upstream changes that will be needed for approaches based on "kill container process to force Kubelet to recreate it" logic:

  1. In the Job API, if a PodFailurePolicy is defined, pods must use "restartPolicy: Never" so we cannot use a podFailurePolicy to ignore the container exit code resulting from "killall5" while not ignoring real container failures. We need to relax this validation, but it is more complicated than it sounds, there are various edge cases that need to be handled. I opened a Github issue to track this and will try to drive a resolution here: Job API: Relax validation enforcing Pod Failure Policy is only compatible with pod restart policy of "Never" kubernetes/kubernetes#125677

  2. Kubelet backoff policy for container restarts cannot be tuned/configured, so it very quickly becomes more efficient to simply delete+recreate all pods rather than restart containers in place. I found there is already an upstream feature request for this: Make CrashLoopBackoff timing tuneable, or add mechanism to exempt some exits kubernetes/kubernetes#57291. Discussed it with tallclair@ and they are targeting alpha in v1.32.

I'll continue pursuing these upstream while I investigate alternatives for the short-term.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2024
@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Oct 13, 2024

/remove-lifecycle stale

I have another idea for how to do this actually, I may try prototyping it

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants