Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] Allow node blocks from deleted pods to be retried (full version) #1083

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
d1a7c49
Pick up https://github.com/jenkinsci/workflow-durable-task-step-plugi…
jglick Dec 6, 2021
8fa0aab
Some `@CheckForNull`s
jglick Dec 6, 2021
36554ae
`ContainerExecDecorator.ws` no longer unused, but now actively harmfu…
jglick Dec 6, 2021
b34d34a
`Reaper` was activated only by `onOnline`, making it useless for clea…
jglick Dec 6, 2021
178a4d9
`RestartPipelineTest.terminatedPodAfterRestart` improvements: logging…
jglick Dec 6, 2021
748f4ea
Removing comment about `Reaper` rendered incorrect by #714
jglick Dec 6, 2021
9651786
`RestartPipelineTest.terminatedPodAfterRestart` overriding `terminati…
jglick Dec 6, 2021
9395b2f
Implementing `ExecutorStepRetryEligibility`
jglick Dec 6, 2021
6066949
Merge branch 'deps' into retry-JENKINS-49707
jglick Dec 7, 2021
0bbb646
Merge branch 'deps' into retry-JENKINS-49707
jglick Dec 7, 2021
c2e7014
Merge branch 'KubernetesPipelineTest.cascadingDelete' into retry-JENK…
jglick Dec 7, 2021
1433875
`KubernetesPipelineTest.terminatedPod` is analogous to `RestartPipeli…
jglick Dec 7, 2021
5d14a7e
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Dec 7, 2021
e83d18c
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Dec 8, 2021
dff6ce4
Pick up https://github.com/jenkinsci/workflow-step-api-plugin/pull/73
jglick Dec 8, 2021
91c0b4b
`KubernetesRetryEligibility` makes more sense in the `pipeline` subpa…
jglick Dec 8, 2021
7884be6
Trying to fix `KubernetesPipelineTest.containerTerminated` by skippin…
jglick Dec 8, 2021
0f00e23
Delaying `Reaper.activate` seems to help? https://github.com/jenkinsc…
jglick Dec 8, 2021
d802e29
Making `KubernetesPipelineTest.podDeadlineExceeded` pass
jglick Dec 8, 2021
d54a047
Typo in `IGNORED_CONTAINER_TERMINATION_REASONS`
jglick Dec 8, 2021
f45d69f
https://github.com/jenkinsci/workflow-step-api-plugin/pull/73 released
jglick Dec 8, 2021
0d391e7
`RestartPipelineTest.terminatedPodAfterRestart` requires https://gith…
jglick Dec 8, 2021
fefb808
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Dec 9, 2021
494b726
Picking up https://github.com/jenkinsci/workflow-durable-task-step-pl…
jglick Dec 10, 2021
38e6f5d
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Dec 10, 2021
2c88527
Adapting to https://github.com/jenkinsci/workflow-durable-task-step-p…
jglick Dec 14, 2021
f7a71a2
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Dec 14, 2021
0a0bb28
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Jan 10, 2022
2c08b80
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Apr 28, 2022
2141a03
Initial work with `KubernetesAgentErrorCondition`
jglick May 2, 2022
71510c6
Pick up incremental builds
jglick May 3, 2022
b9174f9
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick May 12, 2022
52c55ff
Comment
jglick May 12, 2022
417f590
Updating deps
jglick May 12, 2022
2f5c297
Expiring `terminationReasons` entries after a day https://github.com/…
jglick May 12, 2022
d918d8b
SpotBugs
jglick May 12, 2022
13853bc
Merge branch 'gitHubRepo' into retry-JENKINS-49707
jglick May 13, 2022
5f53f7f
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick May 18, 2022
5d2e4d8
`errorConditions` → `conditions`
jglick May 19, 2022
e912f1c
Pick up https://github.com/jenkinsci/pipeline-model-definition-plugin…
jglick May 25, 2022
da10da2
Got an incremental deployment of https://github.com/jenkinsci/pipelin…
jglick May 26, 2022
3347882
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Jun 3, 2022
6172a1b
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick Jun 7, 2022
c09a686
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick Jun 10, 2022
46b9e7c
Merge branch 'master' of https://github.com/jenkinsci/kubernetes-plug…
jglick Jul 15, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-step-api</artifactId>
<version>2.25-rc586.6e93334f902d</version> <!-- TODO https://github.com/jenkinsci/workflow-step-api-plugin/pull/73 -->
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
Expand Down Expand Up @@ -135,6 +136,11 @@
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>credentials-binding</artifactId>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-durable-task-step</artifactId>
<version>1122.v2d03a46f6f49</version> <!-- TODO https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180 -->
</dependency>

<!-- for testing -->
<dependency>
Expand All @@ -147,11 +153,6 @@
<artifactId>workflow-basic-steps</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-durable-task-step</artifactId>
<scope>test</scope>
</dependency>
<dependency> <!-- SemaphoreStep -->
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-support</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -571,6 +571,7 @@ public boolean canProvision(@NonNull Cloud.CloudState state) {
* @param label label to look for in templates
* @return the template
*/
@CheckForNull
public PodTemplate getTemplate(@CheckForNull Label label) {
return PodTemplateUtils.getTemplateByLabel(label, getAllTemplates());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,7 @@ static PodTemplate unwrap(PodTemplate template, Collection<PodTemplate> allTempl
* @param templates The list of all templates.
* @return The first pod template from the collection that has a matching label.
*/
@CheckForNull
public static PodTemplate getTemplateByLabel(@CheckForNull Label label, Collection<PodTemplate> templates) {
for (PodTemplate t : templates) {
if ((label == null && t.getNodeUsageMode() == Node.Mode.NORMAL) || (label != null && label.matches(t.getLabelSet()))) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,6 @@ public class ContainerExecDecorator extends LauncherDecorator implements Seriali
private String containerName;
private EnvironmentExpander environmentExpander;
private EnvVars globalVars;
/** @deprecated no longer used */
@Deprecated
private FilePath ws;
private EnvVars rcEnvVars;
private String shell;
private KubernetesNodeContext nodeContext;
Expand All @@ -118,7 +115,6 @@ public ContainerExecDecorator() {
public ContainerExecDecorator(KubernetesClient client, String podName, String containerName, String namespace, EnvironmentExpander environmentExpander, FilePath ws) {
this.containerName = containerName;
this.environmentExpander = environmentExpander;
this.ws = ws;
}

@Deprecated
Expand Down Expand Up @@ -222,16 +218,6 @@ public EnvVars getRunContextEnvVars() {
return this.rcEnvVars;
}

/** @deprecated unused */
@Deprecated
public FilePath getWs() {
return ws;
}

public void setWs(FilePath ws) {
this.ws = ws;
}

public void setShell(String shell) {
this.shell = shell;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,6 @@ public boolean start() throws Exception {
decorator.setNodeContext(nodeContext);
decorator.setContainerName(containerName);
decorator.setEnvironmentExpander(env);
decorator.setWs(getContext().get(FilePath.class));
decorator.setGlobalVars(globalVars);
decorator.setRunContextEnvVars(rcEnvVars);
decorator.setShell(shell);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
/*
* Copyright 2021 CloudBees, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.csanchez.jenkins.plugins.kubernetes.pipeline;

import hudson.Extension;
import hudson.ExtensionList;
import hudson.model.Label;
import hudson.model.Node;
import hudson.model.TaskListener;
import hudson.slaves.Cloud;
import java.util.HashSet;
import java.util.Set;
import java.util.logging.Logger;
import jenkins.model.Jenkins;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave;
import org.csanchez.jenkins.plugins.kubernetes.pod.retention.Reaper;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepRetryEligibility;

/**
* Qualifies {@code node} blocks associated with {@link KubernetesSlave} to be retried if the node was deleted.
*/
@Extension
public class KubernetesRetryEligibility implements ExecutorStepRetryEligibility {

private static final Logger LOGGER = Logger.getLogger(KubernetesRetryEligibility.class.getName());

private static final Set<String> IGNORED_CONTAINER_TERMINATION_REASONS = new HashSet<String>();
static {
IGNORED_CONTAINER_TERMINATION_REASONS.add("OOMKiller");
IGNORED_CONTAINER_TERMINATION_REASONS.add("Completed");
IGNORED_CONTAINER_TERMINATION_REASONS.add("DeadlineExceeded");
}

@Override
public boolean shouldRetry(Throwable t, String node, String label, TaskListener listener) {
if (!ExecutorStepRetryEligibility.isRemovedNode(t)) {
LOGGER.fine(() -> "Not a RemovedNode failure: " + t);
return false;
}
if (!isKubernetesAgent(node, label)) {
LOGGER.fine(() -> node + " was not a K8s agent");
return false;
}
Set<String> terminationReasons = ExtensionList.lookupSingleton(Reaper.class).terminationReasons(node);
if (terminationReasons.stream().anyMatch(r -> IGNORED_CONTAINER_TERMINATION_REASONS.contains(r))) {
LOGGER.fine(() -> "ignored termination reason(s) for " + node + ": " + terminationReasons);
return false;
}
LOGGER.fine(() -> "active on " + node);
listener.getLogger().println("Will retry failed node block from deleted pod " + node);
return true;
}

private static boolean isKubernetesAgent(String node, String label) {
Node current = Jenkins.get().getNode(node);
if (current instanceof KubernetesSlave) {
return true;
} else if (current == null) {
Label l = Label.get(label);
for (Cloud c : Jenkins.get().clouds) {
if (c instanceof KubernetesCloud && ((KubernetesCloud) c).getTemplate(l) != null) {
return true;
}
}
}
return false;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import io.fabric8.kubernetes.api.model.ContainerStateWaiting;
import io.fabric8.kubernetes.api.model.ContainerStatus;
import io.fabric8.kubernetes.api.model.Pod;
import io.fabric8.kubernetes.api.model.PodStatus;
import io.fabric8.kubernetes.client.KubernetesClient;
import io.fabric8.kubernetes.client.Watch;
import io.fabric8.kubernetes.client.Watcher;
Expand All @@ -46,7 +47,14 @@
import java.util.logging.Logger;

import io.fabric8.kubernetes.client.WatcherException;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import jenkins.model.Jenkins;
import jenkins.util.Timer;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave;
Expand Down Expand Up @@ -90,10 +98,12 @@ public static Reaper getInstance() {

private Watch watch;

private final Map<String, Set<String>> terminationReasons = new HashMap<>();
jglick marked this conversation as resolved.
Show resolved Hide resolved

@Override
public void onOnline(Computer c, TaskListener listener) throws IOException, InterruptedException {
public void preLaunch(Computer c, TaskListener taskListener) throws IOException, InterruptedException {
if (c instanceof KubernetesComputer && activated.compareAndSet(false, true)) {
activate();
Timer.get().schedule(this::activate, 10, TimeUnit.SECONDS);
}
}

Expand Down Expand Up @@ -154,7 +164,7 @@ public void eventReceived(Watcher.Action action, Pod pod) {
}
ExtensionList.lookup(Listener.class).forEach(listener -> {
try {
listener.onEvent(action, optionalNode.get(), pod);
listener.onEvent(action, optionalNode.get(), pod, terminationReasons.computeIfAbsent(optionalNode.get().getNodeName(), k -> new HashSet<>()));
} catch (Exception x) {
LOGGER.log(Level.WARNING, "Listener " + listener + " failed for " + ns + "/" + name, x);
}
Expand Down Expand Up @@ -184,6 +194,19 @@ private void closeWatch() {
}
}

/**
* Get any reason(s) why a node was terminated by a listener.
* @param node a {@link Node#getNodeName}
* @return a possibly empty set of {@link ContainerStateTerminated#getReason} or {@link PodStatus#getReason}
*/
@NonNull
public Set<String> terminationReasons(@NonNull String node) {
synchronized (terminationReasons) {
Set<String> reasons = terminationReasons.get(node);
return reasons == null ? Collections.emptySet() : new HashSet<>(reasons);
}
}

/**
* Listener called when a Kubernetes event related to a Kubernetes agent happens.
*/
Expand All @@ -194,13 +217,13 @@ public interface Listener extends ExtensionPoint {
* @param node The affected node
* @param pod The affected pod
*/
void onEvent(@NonNull Watcher.Action action, @NonNull KubernetesSlave node, @NonNull Pod pod) throws IOException, InterruptedException;
void onEvent(@NonNull Watcher.Action action, @NonNull KubernetesSlave node, @NonNull Pod pod, @NonNull Set<String> terminationReaons) throws IOException, InterruptedException;
}

@Extension
public static class RemoveAgentOnPodDeleted implements Listener {
@Override
public void onEvent(@NonNull Watcher.Action action, @NonNull KubernetesSlave node, @NonNull Pod pod) throws IOException {
public void onEvent(@NonNull Watcher.Action action, @NonNull KubernetesSlave node, @NonNull Pod pod, @NonNull Set<String> terminationReasons) throws IOException {
if (action != Action.DELETED) {
return;
}
Expand All @@ -215,8 +238,9 @@ public void onEvent(@NonNull Watcher.Action action, @NonNull KubernetesSlave nod

@Extension
public static class TerminateAgentOnContainerTerminated implements Listener {

@Override
public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonNull Pod pod) throws IOException, InterruptedException {
public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonNull Pod pod, @NonNull Set<String> terminationReasons) throws IOException, InterruptedException {
if (action != Action.MODIFIED) {
return;
}
Expand All @@ -229,6 +253,7 @@ public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonN
ContainerStateTerminated t = c.getState().getTerminated();
LOGGER.info(() -> ns + "/" + name + " Container " + c.getName() + " was just terminated, so removing the corresponding Jenkins agent");
runListener.getLogger().printf("%s/%s Container %s was terminated (Exit Code: %d, Reason: %s)%n", ns, name, c.getName(), t.getExitCode(), t.getReason());
terminationReasons.add(t.getReason());
});
node.terminate();
}
Expand All @@ -238,7 +263,7 @@ public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonN
@Extension
public static class TerminateAgentOnPodFailed implements Listener {
@Override
public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonNull Pod pod) throws IOException, InterruptedException {
public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonNull Pod pod, @NonNull Set<String> terminationReasons) throws IOException, InterruptedException {
if (action != Action.MODIFIED) {
return;
}
Expand All @@ -248,6 +273,7 @@ public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonN
TaskListener runListener = node.getTemplate().getListener();
LOGGER.info(() -> ns + "/" + name + " Pod just failed. Removing the corresponding Jenkins agent. Reason: " + pod.getStatus().getReason() + ", Message: " + pod.getStatus().getMessage());
runListener.getLogger().printf("%s/%s Pod just failed (Reason: %s, Message: %s)%n", ns, name, pod.getStatus().getReason(), pod.getStatus().getMessage());
terminationReasons.add(pod.getStatus().getReason());
try {
String lines = PodUtils.logLastLines(pod, node.getKubernetesCloud().connect());
if (lines != null) {
Expand All @@ -266,7 +292,7 @@ public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonN
public static class TerminateAgentOnImagePullBackOff implements Listener {

@Override
public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonNull Pod pod) throws IOException, InterruptedException {
public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonNull Pod pod, @NonNull Set<String> terminationReasons) throws IOException, InterruptedException {
List<ContainerStatus> backOffContainers = PodUtils.getContainers(pod, cs -> {
ContainerStateWaiting waiting = cs.getState().getWaiting();
return waiting != null && waiting.getMessage() != null && waiting.getMessage().contains("Back-off pulling image");
Expand All @@ -278,6 +304,7 @@ public void onEvent(@NonNull Action action, @NonNull KubernetesSlave node, @NonN
TaskListener runListener = node.getTemplate().getListener();
runListener.error("Unable to pull Docker image \""+cs.getImage()+"\". Check if image tag name is spelled correctly.");
});
terminationReasons.add("ImagePullBackOff");
try (ACLContext _ = ACL.as(ACL.SYSTEM)) {
PodUtils.cancelQueueItemFor(pod, "ImagePullBackOff");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ public static void setupHost() throws Exception {
System.err.println("Calling home to address: " + hostAddress);
URL nonLocalhostUrl = new URL(url.getProtocol(), hostAddress, url.getPort(),
url.getFile());
// TODO better to set KUBERNETES_JENKINS_URL
// TODO better to set KUBERNETES_JENKINS_URL, or better yet KubernetesCloud.setJenkinsUrl
JenkinsLocationConfiguration.get().setUrl(nonLocalhostUrl.toString());

Integer slaveAgentPort = Integer.getInteger("slaveAgentPort");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ public abstract class AbstractKubernetesPipelineTest {
public LoggerRule logs = new LoggerRule()
.recordPackage(KubernetesCloud.class, Level.FINE)
.recordPackage(NoDelayProvisionerStrategy.class, Level.FINE)
.record(NodeProvisioner.class, Level.FINE);
.record(NodeProvisioner.class, Level.FINE)
.record(KubernetesRetryEligibility.class, Level.FINE);

@BeforeClass
public static void isKubernetesConfigured() throws Exception {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@
import org.hamcrest.MatcherAssert;
import org.hamcrest.Matchers;
import org.jenkinsci.plugins.workflow.job.WorkflowRun;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.jenkinsci.plugins.workflow.test.steps.SemaphoreStep;
import org.junit.After;
import org.junit.Before;
Expand Down Expand Up @@ -477,18 +476,21 @@ public void runInPodWithRetention() throws Exception {
public void terminatedPod() throws Exception {
r.waitForMessage("+ sleep", b);
deletePods(cloud.connect(), getLabels(this, name), false);
r.assertBuildStatus(Result.ABORTED, r.waitForCompletion(b));
r.waitForMessage(new ExecutorStepExecution.RemovedNodeCause().getShortDescription(), b);
r.waitForMessage("busybox --", b);
r.waitForMessage("jnlp --", b);
r.waitForMessage("was deleted; cancelling node body", b);
r.waitForMessage("Will retry failed node block from deleted pod", b);
r.assertBuildStatusSuccess(r.waitForCompletion(b));
}

@Issue("JENKINS-59340")
@Test
public void containerTerminated() throws Exception {
assertBuildStatus(r.waitForCompletion(b), Result.FAILURE, Result.ABORTED);
r.waitForMessage("Container stress-ng was terminated", b);
/* TODO sometimes instead get: Container stress-ng was terminated (Exit Code: 0, Reason: Completed)
r.waitForMessage("Reason: OOMKilled", b);
*/
}

@Test
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,10 @@
import org.csanchez.jenkins.plugins.kubernetes.model.KeyValueEnvVar;
import org.csanchez.jenkins.plugins.kubernetes.model.SecretEnvVar;
import org.csanchez.jenkins.plugins.kubernetes.model.TemplateEnvVar;
import org.csanchez.jenkins.plugins.kubernetes.pod.retention.Reaper;
import org.jenkinsci.plugins.workflow.job.WorkflowJob;
import org.jenkinsci.plugins.workflow.job.WorkflowRun;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep;
import org.junit.BeforeClass;
import org.junit.ClassRule;
import org.junit.Rule;
Expand Down Expand Up @@ -245,18 +246,15 @@ public void terminatedPodAfterRestart() throws Exception {
projectName.set(b.getParent().getFullName());
r.waitForMessage("+ sleep", b);
});
logs.record(DurableTaskStep.class, Level.FINE).record(Reaper.class, Level.FINE);
story.then(r -> {
setupHost(); // otherwise JenkinsLocationConfiguration will be clobbered
WorkflowRun b = r.jenkins.getItemByFullName(projectName.get(), WorkflowJob.class).getBuildByNumber(1);
r.waitForMessage("Ready to run", b);
// Note that the test is cheating here slightly.
// The watch in Reaper is still running across the in-JVM restarts,
// whereas in production it would have been cancelled during the shutdown.
// But it does not matter since we are waiting for the agent to come back online after the restart,
// which is sufficient trigger to reactivate the reaper.
// Indeed we get two Reaper instances running, which independently remove the node.
deletePods(cloud.connect(), getLabels(this, name), false);
r.assertBuildStatus(Result.ABORTED, r.waitForCompletion(b));
r.waitForMessage(new ExecutorStepExecution.RemovedNodeCause().getShortDescription(), b);
r.waitForMessage("assuming it is not coming back", b);
r.waitForMessage("Will retry failed node block from deleted pod", b);
r.assertBuildStatusSuccess(r.waitForCompletion(b));
});
}

Expand Down
Loading