Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover gracefully when a PlaceholderTask is in the queue but the associated build is complete #185

Merged
merged 6 commits into from
Dec 6, 2021

Conversation

dwnusbaum
Copy link
Member

While testing Jenkins in various backup and restore scenarios, we ran into a case where the Jenkins queue contained a ExecutorStepExecution.PlaceholderTask whose associated build was already complete, causing the task to sit in the queue forever. We are not sure how this happened or how to reproduce it, but this PR adds a test that reproduces the same symptoms that may be useful for future investigation.

I did some experimentation to see if the issue could be fixed, and something like this does fix it but does not seem like the best approach:

diff --git a/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java b/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
index c81bf0c..64ca128 100644
--- a/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
+++ b/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
@@ -360,6 +360,10 @@ public class ExecutorStepExecution extends AbstractStepExecutionImpl {
         }
 
         private Object readResolve() {
+            Run<?, ?> run = runForDisplay();
+            if (run != null && !run.isLogUpdated()) {
+                return null;
+            }
             if (cookie != null) {
                 synchronized (runningTasks) {
                     runningTasks.put(cookie, new RunningTask());

A patch like this would cause Pipeline builds to be loaded and resumed from inside of Queue.load while the queue lock is held during Jenkins startup, which seems concerning. Accessing any context variables via StepContext.get would have the same problem because CpsStepContext.doGet transitively calls getThreadGroupSynchronously which has to load the WorkflowRun and resume the Pipeline. Perhaps we could expose CpsStepContext.isComplete as a new method on StepContext and check that instead, but I'm not sure. Queue.load does explicitly handle the case where a task is null, although maybe there is a better way to remove the task from the queue.

Given @jglick is working on stuff related to resumption of the node step in #180, it seems best not to try to make a speculative fix right now.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@dwnusbaum dwnusbaum requested a review from jglick December 1, 2021 21:01
@jglick
Copy link
Member

jglick commented Dec 1, 2021

I need to think about the interaction with #180.

@jglick jglick requested a review from dduportal December 1, 2021 21:18
Co-authored-by: Jesse Glick <jglick@cloudbees.com>
@jglick jglick requested a review from car-roll December 1, 2021 21:23
@jglick jglick added the tests label Dec 1, 2021
@jglick jglick changed the title Add ignored test where a PlaceholderTask is in the queue but the associated build is complete Add ignored test where a PlaceholderTask is in the queue but the associated build is complete Dec 1, 2021
@dwnusbaum dwnusbaum marked this pull request as draft December 2, 2021 20:39
@dwnusbaum
Copy link
Member Author

Moving back to draft to wait for a release of jenkinsci/jenkins-test-harness#353 and to investigate jenkinsci/workflow-cps-plugin#490 (comment).

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A QueueDecisionHandler might fix the real bug, but not the problem path tested here. You could add a QueueTaskDispatcher whose canRun cancels the item if it matches this condition, which avoids the problems with readResolve but on the other hand adds overhead to a critical code path for what I guess is a rare case. Maybe you could implement a fix in PlaceholderTask.getAssignedLabel or getAffinityKey or some similar method Queue.maintain is guaranteed to call, cancelling itself (perhaps asynchronously via Timer to avoid weird reëntrancy issues)?

pom.xml Outdated Show resolved Hide resolved
@dwnusbaum
Copy link
Member Author

FWIW I looked into exposing CpsStepContext.isCompleted as an API on StepContext to be able to check if the step completed without having to load the build in PlaceholderTask.readResolve, but then I realized that kind of fix would be useless for my test case since the serialized CpsStepContext would be outdated and so isCompleted would return false.

Maybe you could implement a fix in PlaceholderTask.getAssignedLabel or getAffinityKey or some similar method Queue.maintain is guaranteed to call, cancelling itself (perhaps asynchronously via Timer to avoid weird reëntrancy issues)?

Yeah I was thinking about something like this. I will spend a bit of time looking into it.

@@ -422,6 +422,22 @@ public String getCookie() {
}

@Override public CauseOfBlockage getCauseOfBlockage() {
Run<?, ?> run = runForDisplay();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is safe. If we get here for a step that is just starting or resuming, then the run is already loaded and so this should complete quickly. The only time this should be slow is if this is after a Jenkins restart and the build has already completed so we end up here without the build having been loaded via some other route and we trigger the cancellation path.

Comment on lines +434 to +439
return new CauseOfBlockage() {
@Override
public String getShortDescription() {
return "Stopping " + getDisplayName();
}
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if an anonymous class or hard-coded text is ok here. My thought was that this cause should not usually be around long enough for anyone to see it, but I guess we should set up localization just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An anonymous class should be fine here. As you say, it ought not appear in the GUI for more than a moment if at all.

@dwnusbaum dwnusbaum requested a review from jglick December 3, 2021 20:58
@dwnusbaum dwnusbaum marked this pull request as ready for review December 3, 2021 20:58
@dwnusbaum
Copy link
Member Author

@jglick let me know what you think of 9d8eeca. If you think there are some problems with that approach then I'll just revert that commit and leave the test ignored for now.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

Comment on lines +434 to +439
return new CauseOfBlockage() {
@Override
public String getShortDescription() {
return "Stopping " + getDisplayName();
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An anonymous class should be fine here. As you say, it ought not appear in the GUI for more than a moment if at all.

@jglick jglick added bug and removed tests labels Dec 3, 2021
@jglick jglick changed the title Add ignored test where a PlaceholderTask is in the queue but the associated build is complete Recover gracefully when a PlaceholderTask is in the queue but the associated build is complete Dec 3, 2021
Co-authored-by: Jesse Glick <jglick@cloudbees.com>
@dduportal
Copy link
Contributor

The issue described (and fixed) by this PR happens quite often on restarts of ci.jenkins.io (but not on other instances owned by the infra team).
Would you want us to try this change on ci.jenkins in advance? I'm asking because I'm not sure to really get all the impacts that it could have, but since ci.jenkins.io is basically a giant infinite retry loop for its jobs, that should not cause harm while allowing the "rare" bug to be caught and tried. Let us know?

@jglick
Copy link
Member

jglick commented Dec 6, 2021

@dduportal sure, if you want to install an incremental build (assuming all dependencies are met) and be prepared to roll back at the first sign of trouble, that would be great.

Somehow https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/workflow/workflow-durable-task-step/1107.ve0b5abf0ca7f/ is dead. Not sure why; https://ci.jenkins.io/job/Plugins/job/workflow-durable-task-step-plugin/job/PR-185/5/console claims to have deployed as the status check confirms. @daniel-beck any clue?

@jglick
Copy link
Member

jglick commented Dec 6, 2021

CD is also set up here, so if @car-roll merges, then you should have an official release version within the hour.

@car-roll
Copy link
Collaborator

car-roll commented Dec 6, 2021

I can go ahead and merge. Was just waiting for @dduportal 's comments.

@car-roll car-roll merged commit 9c8d2f4 into jenkinsci:master Dec 6, 2021
@jglick
Copy link
Member

jglick commented Dec 6, 2021

assertThat(logging.getMessages(), hasItem(startsWith("Refusing to build ExecutorStepExecution.PlaceholderTask{runId=p#")));
});
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the reason there was no newline here before is that this step was a @TestExtension of the test formerly above it, now with the new test intervening.

@dduportal
Copy link
Contributor

And there we go @dduportal: https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1102.v9c8d2f466adb (should be on UC shortly)

Installed, tested and approved: last restart did not triggered "ghosh" builds in the queue. Many thanks for this \o/

@dwnusbaum dwnusbaum deleted the stale-placeholdertask branch December 22, 2021 15:53
@dwnusbaum
Copy link
Member Author

@dduportal do you see messages of the form "Refusing to build <...> and cancelling it because associated build is complete" in the Jenkins system logs? What about "Resuming <...>, which is missing from FlowExecutionList (<...>), so registering it now"?

@dduportal
Copy link
Contributor

@dduportal do you see messages of the form "Refusing to build <...> and cancelling it because associated build is complete" in the Jenkins system logs? What about "Resuming <...>, which is missing from FlowExecutionList (<...>), so registering it now"?

Yes, just retried a restart and got the following message:

Dec 22, 2021 4:21:02 PM WARNING org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask getCauseOfBlockage

Refusing to build ExecutorStepExecution.PlaceholderTask{runId=Plugins/declarative-pipeline-migration-assistant-plugin/PR-117#1,label=maven-windows,context=CpsStepContext[13:null]:Owner[Plugins/declarative-pipeline-migration-assistant-plugin/PR-117/1:Plugins/declarative-pipeline-migration-assistant-plugin/PR-117 #1],cookie=null,auth=null} and cancelling it because associated build is complete

@dwnusbaum
Copy link
Member Author

Hmm, it's somewhat concerning to me that you are seeing those messages just on a regular restart. Thanks for confirming though!

dwnusbaum added a commit to dwnusbaum/workflow-durable-task-step-plugin that referenced this pull request Jan 14, 2022
…ut the associated build is complete (jenkinsci#185)"

This reverts commit 9c8d2f4.
dwnusbaum added a commit to dwnusbaum/workflow-durable-task-step-plugin that referenced this pull request Jan 14, 2022
…ut the associated build is complete (jenkinsci#185)"

This reverts commit 9c8d2f4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants