Fix error where communication failures to k8s can lead to stuck tasks #17431
+23
−32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bugfix for a issue with k8s based ingestion.
Description
If there is a communication error between the overlord and k8s during the KubernetesPeonLifecycle.join method (for example when a overlord comes up and makes a lot of calls to k8s), the overlord may attempt to create a log watcher for a still running task and stream this watcher to a file. This will infinitely hang as long as the task is running since new logs will constantly be generated.
Fixed the bug ...
Renamed the class ...
Added a forbidden-apis entry ...
In the above situation (where the overlord can't get the location of a task on startup), the overlord should just kill the task instead of hanging forever. Currently when saving task logs we always try to create a log watcher for the pod and then use this log watcher to load the task logs.
This makes sense to do in shutdown() before we actually delete the running k8s job because we want to get the logs up until the point the pod is killed (hence the log watcher).
This is not actually necessary in join(), since saveLogs is called in join after the task lifecycle has already completed. Instead we should just get a stream to the logs at the time we call saveLogs and upload all the logs at that point.
This fixes the bug described above because the overlord will just upload the logs for the pod at the time of the initial communication failure and then mark the task as failed.
The other bug here is that we don't actually log the reason for the communication failure between the overlord <-> k8s. I saw this happen in one of my clusters and it just logged a unhelpful (job no found log), so I updated the exception to include info on the actual exception. Because of this issue I am not actually sure if the fabric8 client actually retried requests to kubernetes even though according to the config it should have. I think if we are planning on still aggresively loading task location on overlord startup (as in https://github.com/apache/druid/pull/17419/files#diff-bb902bcc2fa097a13509038cd5ae6987b355c2bcf50f7a558bf9c1a3f5d521db) it may make sense to add a extra retry block around the getPeonPodWithRetries method that catches the generic fabric8 kubernetes client exception.
Release note
Fix some bugs with kubernetes tasks
Key changed/added classes in this PR
KubernetesPeonLifecycle
KubernetesPeonClient
This PR has: