-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-34637] Failure to kill bodies from timeout #76
Conversation
…ure from its own body. Happens when stop() is implemented to call getContext().onFailure(cause), and something incorrectly calls stop on a non-innermost execution. Also providing much better diagnostics in other cases where a step seems to be completing twice.
…ameter, since in most cases we want to test sandboxed behavior.
…t the innermost execution, and block-scoped StepExecution.stop does not generally kill its body (JENKINS-26148). getCurrentExecutions was also in direct violation of its Javadoc, though it does not appear to have ever been called, much less tested.
@@ -172,29 +179,63 @@ private Continuable createContinuable(CpsThread currentThread, CpsCallableInvoca | |||
} | |||
|
|||
@Override | |||
public synchronized Collection<StepExecution> getCurrentExecutions() { | |||
public Collection<StepExecution> getCurrentExecutions() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method was not involved in the original bug, but it was wrong too, in the same way.
for (FlowNode node : scanner) { | ||
if (node.getId().equals(startNodeId)) { | ||
t.stop(stopped); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main fix.
/** | ||
* @deprecated use {@link #CpsFlowDefinition(String, boolean)} instead | ||
*/ | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related, just on my to-do list for ages…
for (CauseOfInterruption cause : ((FlowInterruptedException) failure).getCauses()) { | ||
if (cause instanceof BodyFailed) { | ||
LOGGER.log(Level.FINE, "already completed " + this + " and now received body failure", failure); | ||
// Predictable that the error would be thrown up here; quietly ignore it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we get an ugly stack trace when timing out.
@@ -231,7 +231,7 @@ void fireCompletionHandlers(Outcome o) { | |||
/** | |||
* Finds the next younger {@link CpsThread} that shares the same {@link FlowHead}. | |||
* | |||
* Can be {@code this.} | |||
* Cannot be {@code this}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not wind up using this function; another comment, at its sole call site, notes that it is probably wrong to use it there, too (but leaving that for another day).
@@ -126,7 +126,10 @@ public void onFailure(StepContext context, Throwable t) { | |||
if (handler.originalFailure == null) { | |||
handler.originalFailure = new SimpleEntry<String, Throwable>(name, t); | |||
} else { | |||
handler.originalFailure.getValue().addSuppressed(t); | |||
Throwable originalT = handler.originalFailure.getValue(); | |||
if (t != originalT) { // could be the same abort being delivered across branches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise blows up when using parallel
inside timeout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment helped a lot for understanding here. ++
This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation. |
Collection<StepExecution> currentExecutions2 = execs[2].body.getCurrentExecutions(); | ||
assertThat(currentExecutions2, Matchers.<StepExecution>iterableWithSize(1)); | ||
assertEquals(semaphores, Sets.union(Sets.newLinkedHashSet(currentExecutions1), Sets.newLinkedHashSet(currentExecutions2))); | ||
assertEquals(semaphores, Sets.newLinkedHashSet(execs[0].body.getCurrentExecutions())); // the top-level one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed before fix: returned a singleton set of ParallelStepExecution
.
assertEquals(semaphores, Sets.newLinkedHashSet(execs[0].body.getCurrentExecutions())); // the top-level one | ||
execs[0].body.cancel(); | ||
SemaphoreStep.success("c/1", null); | ||
jenkins.assertBuildStatus(Result.ABORTED, jenkins.waitForCompletion(b)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed before fix: called ParallelStepExecution.stop
, which delegated to RetainsBodyStep.Execution.stop
, rather than going straight to SemaphoreStep.Execution.stop
.
// cf. trick in CpsFlowExecution.getCurrentExecutions(true) | ||
Map<FlowHead, CpsThread> m = new LinkedHashMap<>(); | ||
// TODO access to CpsThreadGroup.threads should be restricted to the CPS VM thread, but the API signature does not allow us to return a promise or throw InterruptedException | ||
for (CpsThread t : thread.group.threads.values()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need to guard against concurrent modifications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes—see the TODO
comment right above. Unfortunately when @kohsuke defined this API method, besides implementing it incorrectly here, he did not allow it to throw any exceptions—and to implement it correctly (to run in the CPS VM thread) we would need to either throw exceptions, or return a ListenableFuture
.
At any rate it appears this method was never actually called, so I am not sure it matters. Perhaps the API method should be deprecated and a better replacement introduced if and when there is a need.
t.stop(stopped); | ||
// Similar to getCurrentExecutions but we want the raw CpsThread, not a StepExecution; cf. CpsFlowExecution.interrupt | ||
Map<FlowHead, CpsThread> m = new LinkedHashMap<>(); | ||
for (CpsThread t : thread.group.threads.values()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to guard against concurrent modifications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not here, because we are in the single CPS VM thread.
🐝 |
…dell rightly pointed out that it ought to be made thread-safe.
} | ||
for (CpsThread t : m.values()) { | ||
// TODO seems cumbersome to have to go through the flow graph to find out whether a head is a descendant of ours, yet FlowHead does not seem to retain a parent field | ||
LinearBlockHoppingScanner scanner = new LinearBlockHoppingScanner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simply this with FlowScanningUtils.fetchEnclosingBlocks -- it will just take the t.head.get() and return an iterator over enclosing BlockStartNodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that utility method, but it does not simplify much, I think:
for (FlowNode node : FlowScanningUtils.fetchEnclosingBlocks(t.head.get())) {
if (node.getId().equals(startNodeId)) {
// …as before
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little more terse and does a prefilter to only return block start nodes by instanceof check (maybe slightly faster). Not a big deal though, there's a half dozen ways to do this, I just want to put it on your radar since this is exactly what it's for.
m.put(t.head, t); | ||
} | ||
for (CpsThread t : m.values()) { | ||
// TODO seems cumbersome to have to go through the flow graph to find out whether a head is a descendant of ours, yet FlowHead does not seem to retain a parent field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consideration: would it be worth adding an enclosingBlockStartNodeId field to FlowNode, and setting it with our block-generating steps?
More data to persist and read, but simplifies many things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding an
enclosingBlockStartNodeId
field toFlowNode
Well it could be transient
since it recomputable based on existing information. Anyway that would not help here I think, because we are starting with thread information, which does not completely align with flow graph information.
(For getCurrentExecutions
—which no one besides my new test ever calls—it would probably suffice: you could filter FlowExecution.currentHeads
to those enclosed by this body. But for cancel
we need the actual CpsThread
s, since these have special behavior for stop
even when not inside a Step
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I would really want here is a way to get from a FlowHead
to an Iterable<CpsThread>
representing the parts of the call stack that interact with step executions, or conversely to go from a CpsThread
to any descendants (getNextInner
does not suffice). For example in
timeout(…) {/* #1 */
parallel a: {
sleep 9999 /* #2 */
}, b: {
while (true) {/* #3 */}
}
}
we would have a straightforward way of going from the CpsBodyExecution
marked at point 1 to the CpsThread
s representing points 2 and 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It simplifies the LinearBlockHoppingScanner use anyway (and might make it mostly redundant), but the catch is you have to have provided it at execution time or do a full scan of the flow to generate the data (persisting as you go).
Meh, there are bigger optimizations. Flow scanning will get faster and faster over time anyway.
m.put(t.head, t); | ||
} | ||
for (CpsThread t : Iterators.reverse(ImmutableList.copyOf(m.values()))) { | ||
LinearBlockHoppingScanner scanner = new LinearBlockHoppingScanner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same simplification as before applies (FlowScanner.getEnclosingBlocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, scratch that: reuse the scanner though. You can create an instance and do setup on it repeatedly since it is stateful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could. This code is not performance-critical however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a micro-optimization here. Stuff that's on the hot path should do that though wherever possible.
|
🐝 AIUI with a suggested small improvement. Smells potentially risky though due to complexity. Would suggest deploying on an instance with a few jobs using different structures to test (I've got one I can toss it on if you don't) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-approving with a 🐝
Well, the changes should only affect calls to |
JENKINS-34637
Downstream of jenkinsci/workflow-support-plugin#20.
@reviewbybees esp. @kohsuke