-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to get logs, etc., from failed parallel pods in job #110464
Comments
@jmgate: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig architecture |
Another data point: For |
Another data point: We're sending logs to ElasticSearch, but if, e.g., I'm using |
Something I tried today was removing the
Pros:
Cons:
|
WorkaroundAnother possible solution, and the one we'll likely go with: Rather than have the container script exit with a code indicating the success or failure of testing, have it always return 0. In this way, we trick k8s into thinking all is well. In that case, all the parallel pods finish and are left on the cluster in the
Pros:
Cons:
I suspect what we'll wind up doing is specifing that any test container scripts must have the last line of their output be something like
|
I would like to take it |
Hi, @jmgate
Reaserched the sources I've found out the controller deletes active pods(having neither kubernetes/pkg/controller/job/job_controller.go Lines 521 to 523 in 4a89df5
when backoff limit reached kubernetes/pkg/controller/job/job_controller.go Lines 506 to 519 in 4a89df5
If the issue is still relevant(please, reply) I guess we can initiate the discussion with community about preventing already running pods(having status
i.e. backoff limit was 6
I believe we can't avoid running such 'not needed' pods within a job having backoff limit, cos the idea of failed job's pods are to be relaunched if the backoff limit presents. |
Thanks for looking into this, @r-erema. We hacked around the problem using the workaround above plus parsing the logs for a particular passed/failed string. This has been working for us for some time, but it doesn't seem like the right solution. Perhaps I'm wrong there, because it seems like the k8s concept of a job is something that must pass, but since we're using it to execute tests, that doesn't quite fit. Given what you found in Perhaps the solution is to add some sort of flag to the job specification to tell it to allow running pods to complete on failure. 🤷♂️ |
Thanks @r-erema. I'll look over the documentation and talk it over with my team to see if we want to invest the time in developing and submitting a KEP. I appreciate your engagement here. |
@r-erema, I finally had time to talk this through with my team, and we decided not to invest the time in writing and submitting a KEP, because even if it were implemented, it would take far too long to get into a release that our team is allowed to use (we're many versions behind, due to contractual limitations). If you'd like to close out this issue, feel free. Thanks for your assistance. |
@jmgate Thanks for feedback. I think you can close the issue. |
What happened?
Motivation
In order to run integration- and system-level tests against our application suite in CI, we first stand up a temporary instance of our application suite, and then apply a number of test jobs to the system. Currently a test job creates a single pod and runs a single container, which then runs the actual testing via tools like Jest, Cypress, etc. When testing is complete, we tear down the temporary instance of our system, but before we do we grab all the logs from all the containers in all the pods, such that we have those available for debugging if there were any testing failures.
We are trying to parallelize our testing such that each test job creates
N
pods, all of which will run the same container. The script run by the container will communicate with a test orchestration service, so though the script run in the container is identical across all the parallel pods, they will be doing different bits of work. We attempted to make this happen by setting thecompletions
andparallelism
appropriately in the job spec as per this document. When all testing passes, this appears to work just fine.The Problem
When testing fails, though, we are unable to get all the logs from all the pods because most of the pods are terminated before we can grab them. Generally successful pods are left in a
Completed
state, and one failed pod is left in anError
state, though other failed pods are terminated immediately.What did you expect to happen?
I would expect all the completed pods for the job to still exist in whatever their final state was (
Completed
,Error
, etc.), such that I could inspect what happened within each pod. If my expectation here is simply wrong, and Kubernetes is behaving as designed, could you please point me to the rationale for the current behavior and then suggest an alternative for retrieving the pod logs before they are terminated?How can we reproduce it (as minimally and precisely as possible)?
I've included what I think are the relevant parts of the job spec below.
I created a dummy container to be flaky by having it run:
I'd recommend doing a
watch -d kubectl get pods -n your-namespace
while you apply the job so you can watch the pods fail and terminate.Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
I'm not sure where you want me to run these commands. If they're relevant, please let me know.
Install tools
N/A
Container runtime (CRI) and version (if applicable)
N/A
Related plugins (CNI, CSI, ...) and versions (if applicable)
N/A
The text was updated successfully, but these errors were encountered: