Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error where communication failures to k8s can lead to stuck tasks #17431

Merged
merged 5 commits into from
Nov 5, 2024

Conversation

georgew5656
Copy link
Contributor

Bugfix for a issue with k8s based ingestion.

Description

If there is a communication error between the overlord and k8s during the KubernetesPeonLifecycle.join method (for example when a overlord comes up and makes a lot of calls to k8s), the overlord may attempt to create a log watcher for a still running task and stream this watcher to a file. This will infinitely hang as long as the task is running since new logs will constantly be generated.

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

In the above situation (where the overlord can't get the location of a task on startup), the overlord should just kill the task instead of hanging forever. Currently when saving task logs we always try to create a log watcher for the pod and then use this log watcher to load the task logs.

This makes sense to do in shutdown() before we actually delete the running k8s job because we want to get the logs up until the point the pod is killed (hence the log watcher).

This is not actually necessary in join(), since saveLogs is called in join after the task lifecycle has already completed. Instead we should just get a stream to the logs at the time we call saveLogs and upload all the logs at that point.

This fixes the bug described above because the overlord will just upload the logs for the pod at the time of the initial communication failure and then mark the task as failed.

The other bug here is that we don't actually log the reason for the communication failure between the overlord <-> k8s. I saw this happen in one of my clusters and it just logged a unhelpful (job no found log), so I updated the exception to include info on the actual exception. Because of this issue I am not actually sure if the fabric8 client actually retried requests to kubernetes even though according to the config it should have. I think if we are planning on still aggresively loading task location on overlord startup (as in https://github.com/apache/druid/pull/17419/files#diff-bb902bcc2fa097a13509038cd5ae6987b355c2bcf50f7a558bf9c1a3f5d521db) it may make sense to add a extra retry block around the getPeonPodWithRetries method that catches the generic fabric8 kubernetes client exception.

Release note

Fix some bugs with kubernetes tasks

Key changed/added classes in this PR
  • KubernetesPeonLifecycle
  • KubernetesPeonClient

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @georgew5656 !
I am not very well-versed with this code. Had a couple of questions.
After your patch, we would only ever initialize the logWatch in shutdown(). When do we actually use this logWatch?

Is the flow something like this?

  • shutdown() is called which initializes logWatch
  • join() gets interrupted due to the shutdown
  • the finally block of join() saves the task logs to a file (blocking until it finishes)

@georgew5656
Copy link
Contributor Author

Thanks for the changes, @georgew5656 ! I am not very well-versed with this code. Had a couple of questions. After your patch, we would only ever initialize the logWatch in shutdown(). When do we actually use this logWatch?

Is the flow something like this?

  • shutdown() is called which initializes logWatch
  • join() gets interrupted due to the shutdown
  • the finally block of join() saves the task logs to a file (blocking until it finishes)

yeah thats basically correct, its a way to (most of the time) save the logs when shutting down tasks

…a/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java

Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @georgew5656 !
In a follow up PR, maybe we could add a short javadoc for startWatchingLogs() and saveLogs() briefly explaining how and when these two methods are used.

@georgew5656 georgew5656 merged commit 8850023 into apache:master Nov 5, 2024
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants