Fix error where communication failures to k8s can lead to stuck tasks #17431

georgew5656 · 2024-10-28T19:36:58Z

Bugfix for a issue with k8s based ingestion.

Description

If there is a communication error between the overlord and k8s during the KubernetesPeonLifecycle.join method (for example when a overlord comes up and makes a lot of calls to k8s), the overlord may attempt to create a log watcher for a still running task and stream this watcher to a file. This will infinitely hang as long as the task is running since new logs will constantly be generated.

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

In the above situation (where the overlord can't get the location of a task on startup), the overlord should just kill the task instead of hanging forever. Currently when saving task logs we always try to create a log watcher for the pod and then use this log watcher to load the task logs.

This makes sense to do in shutdown() before we actually delete the running k8s job because we want to get the logs up until the point the pod is killed (hence the log watcher).

This is not actually necessary in join(), since saveLogs is called in join after the task lifecycle has already completed. Instead we should just get a stream to the logs at the time we call saveLogs and upload all the logs at that point.

This fixes the bug described above because the overlord will just upload the logs for the pod at the time of the initial communication failure and then mark the task as failed.

The other bug here is that we don't actually log the reason for the communication failure between the overlord <-> k8s. I saw this happen in one of my clusters and it just logged a unhelpful (job no found log), so I updated the exception to include info on the actual exception. Because of this issue I am not actually sure if the fabric8 client actually retried requests to kubernetes even though according to the config it should have. I think if we are planning on still aggresively loading task location on overlord startup (as in https://github.com/apache/druid/pull/17419/files#diff-bb902bcc2fa097a13509038cd5ae6987b355c2bcf50f7a558bf9c1a3f5d521db) it may make sense to add a extra retry block around the getPeonPodWithRetries method that catches the generic fabric8 kubernetes client exception.

Release note

Fix some bugs with kubernetes tasks

Key changed/added classes in this PR

KubernetesPeonLifecycle
KubernetesPeonClient

This PR has:

kfaraz

Thanks for the changes, @georgew5656 !
I am not very well-versed with this code. Had a couple of questions.
After your patch, we would only ever initialize the logWatch in shutdown(). When do we actually use this logWatch?

Is the flow something like this?

shutdown() is called which initializes logWatch
join() gets interrupted due to the shutdown
the finally block of join() saves the task logs to a file (blocking until it finishes)

...lord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java

...overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java

georgew5656 · 2024-10-29T16:28:39Z

Thanks for the changes, @georgew5656 ! I am not very well-versed with this code. Had a couple of questions. After your patch, we would only ever initialize the logWatch in shutdown(). When do we actually use this logWatch?

Is the flow something like this?

shutdown() is called which initializes logWatch

join() gets interrupted due to the shutdown

the finally block of join() saves the task logs to a file (blocking until it finishes)

yeah thats basically correct, its a way to (most of the time) save the logs when shutting down tasks

…a/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

kfaraz

Thanks, @georgew5656 !
In a follow up PR, maybe we could add a short javadoc for startWatchingLogs() and saveLogs() briefly explaining how and when these two methods are used.

Fix save logs error

564c0b9

github-actions bot added the Kubernetes label Oct 28, 2024

georgew5656 requested review from gianm and suneet-s October 28, 2024 19:39

kfaraz reviewed Oct 29, 2024

View reviewed changes

...lord-extensions/src/main/java/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java Outdated Show resolved Hide resolved

...overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesPeonLifecycle.java Outdated Show resolved Hide resolved

Update extensions-contrib/kubernetes-overlord-extensions/src/main/jav…

658ac6a

…a/org/apache/druid/k8s/overlord/common/KubernetesPeonClient.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

georgew5656 mentioned this pull request Oct 29, 2024

KubernetesTaskRunner: Wait in start() for tasks to be located. #17419

Closed

make things final

5cc9db1

georgew5656 requested a review from kfaraz October 29, 2024 19:30

georgew5656 assigned suneet-s Oct 29, 2024

georgew5656 added 2 commits October 29, 2024 15:27

Fix merge conflicts

05cb55f

fix merge conflicts

feb51d9

kfaraz approved these changes Nov 5, 2024

View reviewed changes

georgew5656 merged commit 8850023 into apache:master Nov 5, 2024
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error where communication failures to k8s can lead to stuck tasks #17431

Fix error where communication failures to k8s can lead to stuck tasks #17431

georgew5656 commented Oct 28, 2024

kfaraz left a comment

georgew5656 commented Oct 29, 2024

kfaraz left a comment

Fix error where communication failures to k8s can lead to stuck tasks #17431

Fix error where communication failures to k8s can lead to stuck tasks #17431

Conversation

georgew5656 commented Oct 28, 2024

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

kfaraz left a comment

Choose a reason for hiding this comment

georgew5656 commented Oct 29, 2024

kfaraz left a comment

Choose a reason for hiding this comment