Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build #180

Merged
merged 92 commits into from
Oct 17, 2022
Merged
Show file tree
Hide file tree
Changes from 91 commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
fec8b95
Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution`
jglick Nov 16, 2021
5bb6378
Amending test from #34
jglick Nov 16, 2021
aeb67be
Fixing `Callback`
jglick Nov 16, 2021
aa00f8d
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Dec 1, 2021
57e1ed0
Exploring where to detect a failure eligible for retry
jglick Dec 1, 2021
b43f084
Making `normalNodeDisappearance` pass under somewhat altered circumst…
jglick Dec 1, 2021
8796113
First successful `node` block retry
jglick Dec 2, 2021
802accd
`FilePathDynamicContext` must take precedence over `ExecutorStepDynam…
jglick Dec 2, 2021
4f0878c
`ExecutorStepTest.unloadableExecutorPickle` as currently conceived no…
jglick Dec 2, 2021
bfac529
Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableExceptio…
jglick Dec 3, 2021
d03e043
Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnec…
jglick Dec 3, 2021
98fe3e8
Marking `ExecutorStepRetryEligibility` beta
jglick Dec 3, 2021
f149cdb
Javadoc imports
jglick Dec 3, 2021
f798bfa
SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.no…
jglick Dec 3, 2021
f8bcbe6
Reasonable behavior of `ExecutorPickleTest.canceledQueueItem`
jglick Dec 6, 2021
3a6bad4
User-visible logging about 5m timeout now that `ExecutorPickle.printW…
jglick Dec 6, 2021
2c0f41b
`ExecutorPickleTest` → `ExecutorStepDynamicContextTest`
jglick Dec 6, 2021
1cc20e6
Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `Exec…
jglick Dec 6, 2021
65a5749
Some user-visible logging when `ExecutorStepRetryEligibility` is _not…
jglick Dec 6, 2021
2d03a46
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Dec 6, 2021
b65ff1c
Do not block on `FutureImpl`.
jglick Dec 8, 2021
1edef32
A `sh` step could fail to find a workspace at the moment it starts.
jglick Dec 10, 2021
6cd1cf2
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Dec 10, 2021
2455149
Also handling `ClosedChannelException` as commonly thrown by `SimpleB…
jglick Dec 10, 2021
64c123a
Also retry after `MissingContextVariableException` on `FilePath`
jglick Dec 14, 2021
a54afe2
Remove `super.onResume` left over in #46; see https://github.com/jenk…
jglick Dec 14, 2021
8a49c9e
Documenting need for JENKINS-30383
jglick Dec 14, 2021
a7f21b5
https://github.com/jenkinsci/workflow-step-api-plugin/pull/77 allows …
jglick Dec 15, 2021
0c554d3
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Dec 16, 2021
5d83e63
https://github.com/jenkinsci/workflow-step-api-plugin/pull/77 released
jglick Dec 16, 2021
e138f6f
Expanding `MissingContextVariableException` type list after noticing …
jglick Dec 16, 2021
8beb608
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Dec 17, 2021
8800d55
Never log a message with only `PlaceholderTask.cookie`; also need `.r…
jglick Dec 17, 2021
9c71521
Suppressing uninteresting log message
jglick Dec 17, 2021
b5880ca
Maybe need to call `StepContext.saveState`?
jglick Dec 17, 2021
b78373e
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jan 5, 2022
e1b4bb7
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jan 5, 2022
3f0e140
Deleting comment; for now it seems clearest to keep `ExecutorStepDyna…
jglick Jan 5, 2022
dbc3387
Resolve longstanding tech debt from #104 & #117 about `BodyExecution`.
jglick Jan 5, 2022
3223668
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jan 6, 2022
a968adf
Sketch of updating `WorkspaceStepExecution`: https://github.com/jenki…
jglick Jan 7, 2022
6a9854c
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Mar 14, 2022
75b2375
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Mar 14, 2022
79d4ff0
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Apr 28, 2022
8b08398
Allowing `FilePathDynamicContext` to block waiting for an agent to re…
jglick Apr 28, 2022
46821ed
Adding a comment about 8b08398
jglick Apr 28, 2022
db3d043
8b08398b19b3f5da7ab3915dc62ebe9871f95b84 seems to allow a968adfd12835…
jglick Apr 29, 2022
f4660d6
Refining 8b08398b19b3f5da7ab3915dc62ebe9871f95b84 to avoid tail call
jglick Apr 29, 2022
242f829
`org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated
jglick Apr 29, 2022
8e25c8f
SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an S…
jglick Apr 29, 2022
1c42f39
Adding `NodeTranslator`
jglick Apr 29, 2022
5abe7a4
Beginning rewrite to `AgentErrorCondition`
jglick May 2, 2022
39b0bd3
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick May 3, 2022
6b82cc9
Extracted `RetryExecutorStepTest` from `ExecutorStepTest`
jglick May 3, 2022
56e8777
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick May 11, 2022
3035961
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick May 11, 2022
02fbe3a
Rely on blocking `ExecutorStepExecution.onResume`
jglick May 11, 2022
7231de1
Merge branch 'stale-placeholdertask' into retry-JENKINS-49707
jglick May 11, 2022
c663479
Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `u…
jglick May 13, 2022
7cd4bdd
`PlaceholderTask` may not hold a reference to `ExecutorStepExecution`
jglick May 17, 2022
c134cc4
Equivalent of #184 for `ExecutorStepDynamicContext`
jglick May 17, 2022
2a40b11
Assert https://github.com/jenkinsci/workflow-api-plugin/commit/b1778a…
jglick May 19, 2022
f46dd04
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jun 3, 2022
f76de40
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick Jun 3, 2022
974a71e
https://github.com/jenkinsci/workflow-cps-plugin/pull/534 & https://g…
jglick Jun 3, 2022
67f2180
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick Jun 6, 2022
0a995a3
Windows tests https://github.com/jenkinsci/workflow-basic-steps-plugi…
jglick Jun 6, 2022
9a63dd9
Avoiding `powershell` https://github.com/jenkinsci/workflow-basic-ste…
jglick Jun 7, 2022
59ce004
Better usage of Hamcrest; got a flake https://github.com/jenkinsci/wo…
jglick Jun 7, 2022
7b49dcb
In fact the flake was in a different test (`canceledQueueItem`)
jglick Jun 7, 2022
0a5c299
Forgot to make `canceledQueueItem` not use `sh`, though it should not…
jglick Jun 7, 2022
b9618c5
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick Jul 7, 2022
05c77e0
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jul 7, 2022
48950fd
Test flake: https://github.com/jenkinsci/workflow-durable-task-step-p…
jglick Jul 7, 2022
b2bf524
Merge branch 'retry-JENKINS-49707-conditions' into retry-JENKINS-49707
jglick Jul 9, 2022
3ebbf94
Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well
jglick Jul 9, 2022
4a8e223
Can now run all of `AgentErrorConditionTest`
jglick Jul 9, 2022
2699896
Responding to `CancellationException` in `ExecutorStepDynamicContext`…
jglick Jul 9, 2022
17476f7
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jul 11, 2022
9865461
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jul 19, 2022
a437c24
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jul 21, 2022
26175b3
Permit `canceledQueueItem` to pass on `RemovedNodeCause`
jglick Jul 22, 2022
8d39d15
Normalizing indentation
jglick Jul 22, 2022
2e95e3a
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Jul 22, 2022
9eaca4a
Merge branch 'master' into retry-JENKINS-49707
jglick Sep 7, 2022
efdb6ee
https://github.com/jenkinsci/workflow-step-api-plugin/pull/124 deprec…
jglick Sep 8, 2022
97365d4
Refined `ExecutorStepDynamicContext` behavior in case of `quietingDow…
jglick Sep 8, 2022
edbd309
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick Oct 10, 2022
e022b6b
Add an explicit `serialVersionUID` to `ExecutorStepExecution.Placehol…
jglick Oct 10, 2022
4b48432
Optimizing `withExecution` to only consider steps in the current buil…
jglick Oct 10, 2022
84036f7
Pick up https://github.com/jenkinsci/workflow-api-plugin/pull/256 htt…
jglick Oct 10, 2022
fcea63a
https://github.com/jenkinsci/workflow-api-plugin/pull/256 released
jglick Oct 17, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-api</artifactId>
<version>1199.v597ffe744637</version> <!-- TODO https://github.com/jenkinsci/workflow-api-plugin/pull/256 -->
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@
import org.jenkinsci.plugins.workflow.steps.StepExecution;
import org.jenkinsci.plugins.workflow.support.concurrent.Timeout;
import org.jenkinsci.plugins.workflow.support.concurrent.WithThreadName;
import org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.kohsuke.accmod.Restricted;
import org.kohsuke.accmod.restrictions.DoNotUse;
Expand Down Expand Up @@ -306,6 +305,9 @@ static final class Execution extends AbstractStepExecutionImpl implements Runnab
returnStatus = step.returnStatus;
StepContext context = getContext();
ws = context.get(FilePath.class);
if (ws == null) {
throw new AbortException("No workspace currently accessible");
}
node = FilePathUtils.getNodeName(ws);
DurableTask durableTask = step.task();
if (returnStdout) {
Expand Down Expand Up @@ -356,12 +358,13 @@ static final class Execution extends AbstractStepExecutionImpl implements Runnab
LOGGER.fine(() -> "discovered that " + node + " has been removed");
removedNodeDiscovered = System.nanoTime();
return null;
} else if (System.nanoTime() - removedNodeDiscovered < TimeUnit.MILLISECONDS.toNanos(ExecutorPickle.TIMEOUT_WAITING_FOR_NODE_MILLIS)) {
} else if (System.nanoTime() - removedNodeDiscovered < TimeUnit.MILLISECONDS.toNanos(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS)) {
LOGGER.fine(() -> "rediscovering that " + node + " has been removed");
return null;
} else {
LOGGER.fine(() -> node + " has been removed for a while, assuming it is not coming back");
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
LOGGER.fine(() -> "rediscovering that " + node + " has been removed and timeout has expired");
listener().getLogger().println(node + " has been removed for " + Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS) + ", assuming it is not coming back");
throw new FlowInterruptedException(Result.ABORTED, /* TODO false probably more appropriate */true, new ExecutorStepExecution.RemovedNodeCause());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that

if (t instanceof FlowInterruptedException && ((FlowInterruptedException) t).getCauses().stream().anyMatch(
c -> c instanceof ExecutorStepExecution.RemovedNodeCause || c instanceof ExecutorStepExecution.QueueTaskCancelled)) {
return true;
bypasses the actualInterruption check in https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/d57e3ca46d24bb91b8410467284176b0f319c64e/src/main/java/org/jenkinsci/plugins/workflow/steps/RetryStepExecution.java#L93 so the distinction would only matter for other catching steps like https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/d57e3ca46d24bb91b8410467284176b0f319c64e/src/main/java/org/jenkinsci/plugins/workflow/steps/CatchErrorStep.java#L229

}
}
removedNodeDiscovered = 0; // something else; reset
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,14 @@
import hudson.model.Computer;
import jenkins.model.Jenkins;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;

/**
* Reference to {@link Computer}
*
* @author Kohsuke Kawaguchi
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
*/
@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the factories for these pickles be deleted or replaced with an implementation that just throws an exception if it encounters an object of the specified type? Or do you want to keep them around in case some other step or wild Groovy code is relying on them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to retain them for compatibility. Weird Pipeline script is one possibility. They could also be used in other Pipeline steps; I know PushdStep binds FilePathDynamicContext, so that is not affected, but there may be others. Seems harmless enough to leave them here.

public class ComputerPickle extends Pickle {
private final String slave;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,7 @@

import com.google.common.util.concurrent.ListenableFuture;
import edu.umd.cs.findbugs.annotations.NonNull;
import edu.umd.cs.findbugs.annotations.SuppressFBWarnings;
import hudson.Extension;
import hudson.Main;
import hudson.Util;
import hudson.init.InitMilestone;
import hudson.model.Executor;
Expand All @@ -41,35 +39,30 @@
import hudson.model.queue.SubTask;

import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.logging.Level;
import java.util.logging.Logger;

import jenkins.model.Jenkins;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.pickles.Pickle;
import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException;
import org.jenkinsci.plugins.workflow.steps.durable_task.Messages;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.kohsuke.accmod.Restricted;
import org.kohsuke.accmod.restrictions.NoExternalUse;

/**
* Persists an {@link Executor} as the {@link hudson.model.Queue.Task} it was running.
* That task can in turn have some way of producing a display name, a special {@link hudson.model.Queue.Executable} with a custom {@code executorCell.jelly}, and so on.
* When rehydrated, the task is rescheduled, and when it starts executing the owning executor is produced.
* Typically the {@link SubTask#getAssignedLabel} should be a {@link Node#getSelfLabel} so that the rehydrated executor is in fact on the same node.
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
*/
@Deprecated
public class ExecutorPickle extends Pickle {

private static final Logger LOGGER = Logger.getLogger(ExecutorPickle.class.getName());

private final Queue.Task task;

@SuppressFBWarnings(value = "MS_SHOULD_BE_FINAL", justification = "deliberately mutable")
@Restricted(NoExternalUse.class)
public static long TIMEOUT_WAITING_FOR_NODE_MILLIS = Main.isUnitTest ? /* fail faster */ TimeUnit.SECONDS.toMillis(15) : Long.getLong(ExecutorPickle.class.getName()+".timeoutForNodeMillis", TimeUnit.MINUTES.toMillis(5));

private ExecutorPickle(Executor executor) {
if (executor instanceof OneOffExecutor) {
throw new IllegalArgumentException("OneOffExecutor not currently supported");
Expand Down Expand Up @@ -105,7 +98,7 @@ protected Executor tryResolve() throws Exception {
throw new IllegalStateException("queue refused " + task);
}
itemID = item.getId();
endTimeNanos = System.nanoTime() + TIMEOUT_WAITING_FOR_NODE_MILLIS*1000000;
endTimeNanos = System.nanoTime() + ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS*1000000;
LOGGER.log(Level.FINE, "{0} scheduled {1}", new Object[] {ExecutorPickle.this, item});
} else {
item = Queue.getInstance().getItem(itemID);
Expand All @@ -129,7 +122,7 @@ protected Executor tryResolve() throws Exception {
if (System.nanoTime() > endTimeNanos) {
Queue.getInstance().cancel(item);
owner.getListener().getLogger().printf("Killed %s after waiting for %s because we assume unknown agent %s is never going to appear%n",
item.task.getDisplayName(), Util.getTimeSpanString(TIMEOUT_WAITING_FOR_NODE_MILLIS), placeholder.getAssignedLabel());
item.task.getDisplayName(), Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS), placeholder.getAssignedLabel());
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause());
}
}
Expand Down Expand Up @@ -159,7 +152,7 @@ protected Executor tryResolve() throws Exception {
}
@Override protected void printWaitingMessage(@NonNull TaskListener listener) {
Queue.Item item = Queue.getInstance().getItem(itemID);
String message = Messages.ExecutorPickle_waiting_to_resume(task.getFullDisplayName());
String message = "Waiting to resume " + task.getFullDisplayName();
if (item == null) { // ???
listener.getLogger().println(message);
return;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,14 @@
import org.jenkinsci.plugins.workflow.FilePathUtils;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.pickles.Pickle;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;
import org.jenkinsci.plugins.workflow.support.steps.FilePathDynamicContext;

/**
* @author Kohsuke Kawaguchi
* @deprecated Normally now done via {@link ExecutorStepDynamicContext} or {@link FilePathDynamicContext}.
*/
@Deprecated
public class FilePathPickle extends Pickle {

private final String slave;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,12 @@
import org.jenkinsci.plugins.workflow.FilePathUtils;
import org.jenkinsci.plugins.workflow.flow.FlowExecutionOwner;
import org.jenkinsci.plugins.workflow.pickles.Pickle;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext;

/**
* @deprecated Normally now done via {@link ExecutorStepDynamicContext}.
*/
@Deprecated
public class WorkspaceListLeasePickle extends Pickle {

// Could perhaps just store the FilePath directly (thus using FilePathPickle implicitly), but we need the Computer anyway for its WorkspaceList:
Expand Down
Loading