-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KubernetesTaskRunner: Wait in start() for tasks to be located. #17419
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,12 +23,15 @@ | |
import com.google.common.annotations.VisibleForTesting; | ||
import com.google.common.base.Optional; | ||
import com.google.common.base.Preconditions; | ||
import com.google.common.util.concurrent.ListenableFuture; | ||
import com.google.common.util.concurrent.SettableFuture; | ||
import io.fabric8.kubernetes.api.model.Pod; | ||
import io.fabric8.kubernetes.api.model.PodStatus; | ||
import io.fabric8.kubernetes.api.model.batch.v1.Job; | ||
import io.fabric8.kubernetes.client.dsl.LogWatch; | ||
import org.apache.commons.io.FileUtils; | ||
import org.apache.commons.io.IOUtils; | ||
import org.apache.druid.common.guava.FutureUtils; | ||
import org.apache.druid.indexer.TaskLocation; | ||
import org.apache.druid.indexer.TaskStatus; | ||
import org.apache.druid.indexing.common.task.Task; | ||
|
@@ -94,7 +97,7 @@ protected enum State | |
@MonotonicNonNull | ||
private LogWatch logWatch; | ||
|
||
private TaskLocation taskLocation; | ||
private final SettableFuture<TaskLocation> taskLocation = SettableFuture.create(); | ||
|
||
protected KubernetesPeonLifecycle( | ||
Task task, | ||
|
@@ -130,8 +133,6 @@ protected synchronized TaskStatus run(Job job, long launchTimeout, long timeout, | |
writeTaskPayload(task); | ||
} | ||
|
||
// In case something bad happens and run is called twice on this KubernetesPeonLifecycle, reset taskLocation. | ||
taskLocation = null; | ||
kubernetesClient.launchPeonJobAndWaitForStart( | ||
job, | ||
task, | ||
|
@@ -183,8 +184,8 @@ protected synchronized TaskStatus join(long timeout) throws IllegalStateExceptio | |
since Druid doesn't support retrying tasks from a external system (K8s). We can explore adding a fabric8 watcher | ||
if we decide we need to change this later. | ||
**/ | ||
taskLocation = getTaskLocationFromK8s(); | ||
updateState(new State[]{State.NOT_STARTED, State.PENDING}, State.RUNNING); | ||
taskLocation.set(getTaskLocationFromK8s()); | ||
|
||
JobResponse jobResponse = kubernetesClient.waitForPeonJobCompletion( | ||
taskId, | ||
|
@@ -260,11 +261,19 @@ protected TaskLocation getTaskLocation() | |
since Druid doesn't support retrying tasks from a external system (K8s). We can explore adding a fabric8 watcher | ||
if we decide we need to change this later. | ||
**/ | ||
if (taskLocation == null) { | ||
if (!taskLocation.isDone()) { | ||
log.warn("Unknown task location for [%s]", taskId); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i'm a little unsure about this logic because i have seen instances where the k8s calls can temporarily fail. i am also not sure if fabric8 is actually retrying these calls. see the initial trigger for (#17431, #17417) i think i would feel better if we still fell back to querying k8s for the pod location in this block for now (returning a immediate future). i think another option would be to wrap getTaskLocationFromK8s in more retries, i can put up a separate pr if you don't want to do it in this one though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What change are you suggesting here? I can make it, but I'm not sure what you mean exactly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. basically i think for now we should add a check for the future having completed with a exception and retry it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
return TaskLocation.unknown(); | ||
} | ||
|
||
return FutureUtils.getUncheckedImmediately(taskLocation); | ||
} | ||
|
||
/** | ||
* Get a future that resolves to the task location, when available. | ||
*/ | ||
public ListenableFuture<TaskLocation> getTaskLocationAsync() | ||
{ | ||
return taskLocation; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should reverse the order of these two lines. technically if updateState -> RUNNING is not called the runner will still not know where the task is even if the getTaskLocationAsync future returns (b/c getTaskLocation does a check on state first). in any case we should always be marking a task as running immediately after we join it (i looked through all the references to state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched them. I'm not sure why the order was this way originally, but it does make more sense to me the way you're suggesting.