Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-34637] Failure to kill bodies from timeout #76

Merged
merged 10 commits into from
Oct 20, 2016

Conversation

jglick
Copy link
Member

@jglick jglick commented Oct 18, 2016

JENKINS-34637

Downstream of jenkinsci/workflow-support-plugin#20.

@reviewbybees esp. @kohsuke

…ure from its own body.

Happens when stop() is implemented to call getContext().onFailure(cause), and something incorrectly calls stop on a non-innermost execution.
Also providing much better diagnostics in other cases where a step seems to be completing twice.
…ameter, since in most cases we want to test sandboxed behavior.
…t the innermost execution, and block-scoped StepExecution.stop does not generally kill its body (JENKINS-26148).

getCurrentExecutions was also in direct violation of its Javadoc, though it does not appear to have ever been called, much less tested.
@@ -172,29 +179,63 @@ private Continuable createContinuable(CpsThread currentThread, CpsCallableInvoca
}

@Override
public synchronized Collection<StepExecution> getCurrentExecutions() {
public Collection<StepExecution> getCurrentExecutions() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was not involved in the original bug, but it was wrong too, in the same way.

for (FlowNode node : scanner) {
if (node.getId().equals(startNodeId)) {
t.stop(stopped);
break;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main fix.

/**
* @deprecated use {@link #CpsFlowDefinition(String, boolean)} instead
*/
@Deprecated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related, just on my to-do list for ages…

for (CauseOfInterruption cause : ((FlowInterruptedException) failure).getCauses()) {
if (cause instanceof BodyFailed) {
LOGGER.log(Level.FINE, "already completed " + this + " and now received body failure", failure);
// Predictable that the error would be thrown up here; quietly ignore it.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we get an ugly stack trace when timing out.

@@ -231,7 +231,7 @@ void fireCompletionHandlers(Outcome o) {
/**
* Finds the next younger {@link CpsThread} that shares the same {@link FlowHead}.
*
* Can be {@code this.}
* Cannot be {@code this}.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not wind up using this function; another comment, at its sole call site, notes that it is probably wrong to use it there, too (but leaving that for another day).

@@ -126,7 +126,10 @@ public void onFailure(StepContext context, Throwable t) {
if (handler.originalFailure == null) {
handler.originalFailure = new SimpleEntry<String, Throwable>(name, t);
} else {
handler.originalFailure.getValue().addSuppressed(t);
Throwable originalT = handler.originalFailure.getValue();
if (t != originalT) { // could be the same abort being delivered across branches
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise blows up when using parallel inside timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment helped a lot for understanding here. ++

@ghost
Copy link

ghost commented Oct 18, 2016

This pull request originates from a CloudBees employee. At CloudBees, we require that all pull requests be reviewed by other CloudBees employees before we seek to have the change accepted. If you want to learn more about our process please see this explanation.

Collection<StepExecution> currentExecutions2 = execs[2].body.getCurrentExecutions();
assertThat(currentExecutions2, Matchers.<StepExecution>iterableWithSize(1));
assertEquals(semaphores, Sets.union(Sets.newLinkedHashSet(currentExecutions1), Sets.newLinkedHashSet(currentExecutions2)));
assertEquals(semaphores, Sets.newLinkedHashSet(execs[0].body.getCurrentExecutions())); // the top-level one
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed before fix: returned a singleton set of ParallelStepExecution.

assertEquals(semaphores, Sets.newLinkedHashSet(execs[0].body.getCurrentExecutions())); // the top-level one
execs[0].body.cancel();
SemaphoreStep.success("c/1", null);
jenkins.assertBuildStatus(Result.ABORTED, jenkins.waitForCompletion(b));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed before fix: called ParallelStepExecution.stop, which delegated to RetainsBodyStep.Execution.stop, rather than going straight to SemaphoreStep.Execution.stop.

// cf. trick in CpsFlowExecution.getCurrentExecutions(true)
Map<FlowHead, CpsThread> m = new LinkedHashMap<>();
// TODO access to CpsThreadGroup.threads should be restricted to the CPS VM thread, but the API signature does not allow us to return a promise or throw InterruptedException
for (CpsThread t : thread.group.threads.values()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to guard against concurrent modifications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes—see the TODO comment right above. Unfortunately when @kohsuke defined this API method, besides implementing it incorrectly here, he did not allow it to throw any exceptions—and to implement it correctly (to run in the CPS VM thread) we would need to either throw exceptions, or return a ListenableFuture.

At any rate it appears this method was never actually called, so I am not sure it matters. Perhaps the API method should be deprecated and a better replacement introduced if and when there is a need.

t.stop(stopped);
// Similar to getCurrentExecutions but we want the raw CpsThread, not a StepExecution; cf. CpsFlowExecution.interrupt
Map<FlowHead, CpsThread> m = new LinkedHashMap<>();
for (CpsThread t : thread.group.threads.values()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to guard against concurrent modifications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, because we are in the single CPS VM thread.

@rsandell
Copy link
Member

🐝

}
for (CpsThread t : m.values()) {
// TODO seems cumbersome to have to go through the flow graph to find out whether a head is a descendant of ours, yet FlowHead does not seem to retain a parent field
LinearBlockHoppingScanner scanner = new LinearBlockHoppingScanner();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply this with FlowScanningUtils.fetchEnclosingBlocks -- it will just take the t.head.get() and return an iterator over enclosing BlockStartNodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that utility method, but it does not simplify much, I think:

for (FlowNode node : FlowScanningUtils.fetchEnclosingBlocks(t.head.get())) {
    if (node.getId().equals(startNodeId)) {
        // …as before
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little more terse and does a prefilter to only return block start nodes by instanceof check (maybe slightly faster). Not a big deal though, there's a half dozen ways to do this, I just want to put it on your radar since this is exactly what it's for.

m.put(t.head, t);
}
for (CpsThread t : m.values()) {
// TODO seems cumbersome to have to go through the flow graph to find out whether a head is a descendant of ours, yet FlowHead does not seem to retain a parent field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consideration: would it be worth adding an enclosingBlockStartNodeId field to FlowNode, and setting it with our block-generating steps?

More data to persist and read, but simplifies many things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding an enclosingBlockStartNodeId field to FlowNode

Well it could be transient since it recomputable based on existing information. Anyway that would not help here I think, because we are starting with thread information, which does not completely align with flow graph information.

(For getCurrentExecutions—which no one besides my new test ever calls—it would probably suffice: you could filter FlowExecution.currentHeads to those enclosed by this body. But for cancel we need the actual CpsThreads, since these have special behavior for stop even when not inside a Step.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I would really want here is a way to get from a FlowHead to an Iterable<CpsThread> representing the parts of the call stack that interact with step executions, or conversely to go from a CpsThread to any descendants (getNextInner does not suffice). For example in

timeout(…) {/* #1 */
  parallel a: {
    sleep 9999 /* #2 */
  }, b: {
    while (true) {/* #3 */}
  }
}

we would have a straightforward way of going from the CpsBodyExecution marked at point 1 to the CpsThreads representing points 2 and 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It simplifies the LinearBlockHoppingScanner use anyway (and might make it mostly redundant), but the catch is you have to have provided it at execution time or do a full scan of the flow to generate the data (persisting as you go).

Meh, there are bigger optimizations. Flow scanning will get faster and faster over time anyway.

m.put(t.head, t);
}
for (CpsThread t : Iterators.reverse(ImmutableList.copyOf(m.values()))) {
LinearBlockHoppingScanner scanner = new LinearBlockHoppingScanner();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same simplification as before applies (FlowScanner.getEnclosingBlocks)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, scratch that: reuse the scanner though. You can create an instance and do setup on it repeatedly since it is stateful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could. This code is not performance-critical however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a micro-optimization here. Stuff that's on the hot path should do that though wherever possible.

@jglick
Copy link
Member Author

jglick commented Oct 19, 2016

CpsThreadDump test failures are trivial, will fix. Need to look at the ParallelStepTest.suspend failure closely.

@svanoort
Copy link
Member

svanoort commented Oct 19, 2016

🐝 AIUI with a suggested small improvement. Smells potentially risky though due to complexity. Would suggest deploying on an instance with a few jobs using different structures to test (I've got one I can toss it on if you don't)

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-approving with a 🐝

@jglick
Copy link
Member Author

jglick commented Oct 19, 2016

potentially risky

Well, the changes should only affect calls to BodyExecution.cancel…which we already know was badly broken. So I am comfortable with the functional test coverage here. Was hoping for some guidance from @kohsuke on his original intention w.r.t. StepExecution.stop, but my understanding of his more recent code in CpsFlowExecution.interrupt is that it is expected that we deliver interrupts directly to the innermost running code—which may then propagate throwables up the call stack, log them and continue, process finally blocks, etc.

@jglick jglick merged commit bee2879 into jenkinsci:master Oct 20, 2016
@jglick jglick deleted the timeout-block-JENKINS-34637 branch October 20, 2016 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants