Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get logs, etc., from failed parallel pods in job #110464

Closed
jmgate opened this issue Jun 8, 2022 · 13 comments
Closed

Unable to get logs, etc., from failed parallel pods in job #110464

jmgate opened this issue Jun 8, 2022 · 13 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture.

Comments

@jmgate
Copy link

jmgate commented Jun 8, 2022

What happened?

Motivation

In order to run integration- and system-level tests against our application suite in CI, we first stand up a temporary instance of our application suite, and then apply a number of test jobs to the system. Currently a test job creates a single pod and runs a single container, which then runs the actual testing via tools like Jest, Cypress, etc. When testing is complete, we tear down the temporary instance of our system, but before we do we grab all the logs from all the containers in all the pods, such that we have those available for debugging if there were any testing failures.

We are trying to parallelize our testing such that each test job creates N pods, all of which will run the same container. The script run by the container will communicate with a test orchestration service, so though the script run in the container is identical across all the parallel pods, they will be doing different bits of work. We attempted to make this happen by setting the completions and parallelism appropriately in the job spec as per this document. When all testing passes, this appears to work just fine.

The Problem

When testing fails, though, we are unable to get all the logs from all the pods because most of the pods are terminated before we can grab them. Generally successful pods are left in a Completed state, and one failed pod is left in an Error state, though other failed pods are terminated immediately.

Note: One time I also saw the successful pods get cleaned up before the script orchestrating everything could grab their logs, but I haven't been able to reproduce that yet and don't know if it's relevant.

What did you expect to happen?

I would expect all the completed pods for the job to still exist in whatever their final state was (Completed, Error, etc.), such that I could inspect what happened within each pod. If my expectation here is simply wrong, and Kubernetes is behaving as designed, could you please point me to the rationale for the current behavior and then suggest an alternative for retrieving the pod logs before they are terminated?

How can we reproduce it (as minimally and precisely as possible)?

I've included what I think are the relevant parts of the job spec below.

apiVersion: batch/v1
kind: Job
spec:
  backoffLimit: 0
  completions: 9
  parallelism: 9
  ttlSecondsAfterFinished: 120
  template:
    spec:
      restartPolicy: "Never"

Note: I've also tried different values for backoffLimit, and have tried restartPolicy: "OnFailure", but neither helped.

I created a dummy container to be flaky by having it run:

if [[ $((0x$(echo -n "$(cat /etc/hostname)" | shasum | awk '{print $1}') % 2)) -eq 0 ]]; then
  failing test
else
  echo "Passing test"
fi

I'd recommend doing a watch -d kubectl get pods -n your-namespace while you apply the job so you can watch the pods fail and terminate.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8", GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean", BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc", Platform:"linux/amd64"}

Note: It is difficult for us to update our Kubernetes version due to customer requirements. However, if a version update would fix our problem, that's good information for us to take to the customer to try to expedite an update.

Cloud provider

Internal corporate cloud managed with [Rancher](https://rancher.com/).

OS version

I'm not sure where you want me to run these commands. If they're relevant, please let me know.

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

N/A

@jmgate jmgate added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2022
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 8, 2022
@k8s-ci-robot
Copy link
Contributor

@jmgate: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 8, 2022
@jmgate
Copy link
Author

jmgate commented Jun 8, 2022

/sig architecture

@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 8, 2022
@jmgate
Copy link
Author

jmgate commented Jun 9, 2022

Another data point: For completions and parallelism set to 9, I just observed 8 pods pass and 1 pod fail, but a kubectl get pods showed 2 pods in the Completed state, 1 pod in the Error state, and the rest were terminated.

@jmgate
Copy link
Author

jmgate commented Jun 9, 2022

Another data point: We're sending logs to ElasticSearch, but if, e.g., I'm using completions: 9, whether logs from 9 different pods make it into ElasticSearch is non-deterministic; often the number of pods represented in the logs is lower. It's almost like Kubernetes is killing pods before they even start generating logs.

@jmgate
Copy link
Author

jmgate commented Jun 21, 2022

Something I tried today was removing the backoffLimit from the test job specs. It was set to 0, but without specifying it'll default to 6. Testing this change with completions: 5 and parallelism: 5, here's what I see:

  1. The job spins up the parallel pods as appropriate, with one container each.
  2. The containers talk to the test orchestration service to get their tasking and report results.
  3. Each of the five containers starts work on a different test.
  4. When a container finishes a test, it'll go grab another one, if there are any.
  5. When there are no more tests to grab, it'll exit with success or failure based on how the testing went. (Since so many of our tests are regularly failing, this'll likely be a failure.)
  6. When a container exits with a failure, k8s will leave that pod in an Error state, and then spin up a replacement, trying to achieve completions number of pod successes.
  7. When the replacement pod spins up, it goes to the test orchestrator looking for work, finds none, and then exits successfully, and k8s will leave it in a Completed state.
  8. When all testing is done, we're left with (in this particular case) 5 pods in an Error state and another 5 in a Completed state.
  9. The script we have controlling everything at the top level will then grab the return code for the job, which is the max of the return codes for the pods. Since the final version of all the pods returned with 0, the return code for the job is 0, and the script thinks the test job passed, even though there were failures.
  10. After the job completes, the top-level script is able to grab all the logs and descriptions of all 10 pods that ran.

Pros:

  • At least everything runs to completion and communicates correctly with the test orchestration service.

Cons:

  • We're running twice as many pods as necessary, and the extra pods that don't do any work just confuse things when looking at the logs.
  • We don't have a way currently to determine success or failure to report back to the user (or GitLab CI).

Note: I don't know that this potential solution helps in the general case of an arbitrary script running without a test orchestration service. For instance, if the script wasn't communicating with some sort of work queue, if it failed it would just keep failing until the default backoffLimit: 6 was reached.

@jmgate
Copy link
Author

jmgate commented Jun 21, 2022

Workaround

Another possible solution, and the one we'll likely go with: Rather than have the container script exit with a code indicating the success or failure of testing, have it always return 0. In this way, we trick k8s into thinking all is well. In that case, all the parallel pods finish and are left on the cluster in the Completed state, and the job completes successfully. When testing with the example mentioned above, here's what I see:

  1. The behavior is as above through step 4.
  2. When there are no more tests to grab, it'll exit 0 regardless how the testing went.
  3. K8s will leave that pod on the cluster in a Completed state.
  4. When all testing is done, we're left with (in this particular case) 5 pods in a Completed state.
  5. The top-level script will then grab the return code for the job, which is the max of the return codes for the pods. Since all the pods returned with 0, the return code for the job is 0, and the script thinks the test augmentation passed, even though there were failures.
  6. After the job completes, the top-level script is able to grab all the logs and descriptions of all 5 pods that ran.

Pros:

  • At least everything runs to completion and communicates correctly with test orchestration service.
  • We're only running as many pods as we expect.
  • There's no confusion between pods that did work and pods that didn't.

Cons:

  • We don't have a way currently to determine success or failure to report back to the user (or GitLab CI).

I suspect what we'll wind up doing is specifing that any test container scripts must have the last line of their output be something like TEST RESULT: SUCCESS or TEST RESULT: FAILURE. Then we can modify our top-level script such that instead of getting the job's exit code, it'll look at the last line of output from each pod and aggregate the results into a final pass/fail value.

Note: This workaround should work in the case of an arbitrary test script that doesn't use a work queue to dole things out.

@r-erema
Copy link
Contributor

r-erema commented Jul 26, 2022

I would like to take it
/assign @r-erema

@r-erema
Copy link
Contributor

r-erema commented Sep 8, 2022

Hi, @jmgate

we are unable to get all the logs from all the pods because most of the pods are terminated before we can grab them

It's almost like Kubernetes is killing pods before they even start generating logs.

Reaserched the sources I've found out the controller deletes active pods(having neither Succeeded nor
Failed status)

if jobFailed {
errCh := make(chan error, active)
jm.deleteJobPods(&job, activePods, errCh)

when backoff limit reached
exceedsBackoffLimit := jobHaveNewFailure && (active != *job.Spec.Parallelism) &&
(failed > *job.Spec.BackoffLimit)
if exceedsBackoffLimit || pastBackoffLimitOnFailure(&job, pods) {
// check if the number of pod restart exceeds backoff (for restart OnFailure only)
// OR if the number of failed jobs increased since the last syncJob
jobFailed = true
failureReason = "BackoffLimitExceeded"
failureMessage = "Job has reached the specified backoff limit"
} else if pastActiveDeadline(&job) {
jobFailed = true
failureReason = "DeadlineExceeded"
failureMessage = "Job was active longer than specified deadline"
}

If the issue is still relevant(please, reply) I guess we can initiate the discussion with community about preventing already running pods(having status Running) deletion, followed by keeping such pod either Succeeded or Failed status. I guess it can solve that:

We don't have a way currently to determine success or failure to report back to the user (or GitLab CI).


Something I tried today was removing the backoffLimit from the test job specs

i.e. backoff limit was 6

We're running twice as many pods as necessary, and the extra pods that don't do any work just confuse things when looking at the logs.

I believe we can't avoid running such 'not needed' pods within a job having backoff limit, cos the idea of failed job's pods are to be relaunched if the backoff limit presents.

@jmgate
Copy link
Author

jmgate commented Sep 8, 2022

Thanks for looking into this, @r-erema. We hacked around the problem using the workaround above plus parsing the logs for a particular passed/failed string. This has been working for us for some time, but it doesn't seem like the right solution. Perhaps I'm wrong there, because it seems like the k8s concept of a job is something that must pass, but since we're using it to execute tests, that doesn't quite fit.

Given what you found in job_controller.go, I suspect what was happening when I saw successful pods getting terminated was though I could see that the script executed passed, there's some (probably very small) lag time between that and the pod being marked Completed, and in that lag time deleteJobPods() was terminating it. Whether this should happen for running pods is probably a much larger discussion. For instance, I would guess a common use case would be if a job fails, we want it to fail fast, so the current behavior makes sense. If we were to change it to allow running pods to complete first, that has the potential to introduce a good deal of lag time between when k8s knows the job has failed and when it's done cleaning itself up, and I'd imagine that would often be undesirable.

Perhaps the solution is to add some sort of flag to the job specification to tell it to allow running pods to complete on failure. 🤷‍♂️

@r-erema
Copy link
Contributor

r-erema commented Sep 9, 2022

@jmgate Thanks for reply.

Perhaps the solution is to add some sort of flag to the job specification to tell it to allow running pods to complete on failure.

It seems you need to raise a KEP to proceed with the imlementation of such feature

@jmgate
Copy link
Author

jmgate commented Sep 12, 2022

Thanks @r-erema. I'll look over the documentation and talk it over with my team to see if we want to invest the time in developing and submitting a KEP. I appreciate your engagement here.

@jmgate
Copy link
Author

jmgate commented Oct 3, 2022

@r-erema, I finally had time to talk this through with my team, and we decided not to invest the time in writing and submitting a KEP, because even if it were implemented, it would take far too long to get into a release that our team is allowed to use (we're many versions behind, due to contractual limitations). If you'd like to close out this issue, feel free. Thanks for your assistance.

@r-erema
Copy link
Contributor

r-erema commented Oct 3, 2022

@jmgate Thanks for feedback. I think you can close the issue.

@jmgate jmgate closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture.
Projects
None yet
Development

No branches or pull requests

3 participants