-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover gracefully when a PlaceholderTask
is in the queue but the associated build is complete
#185
Conversation
…ciated build is complete
I need to think about the interaction with #180. |
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepTest.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Jesse Glick <jglick@cloudbees.com>
PlaceholderTask
is in the queue but the associated build is complete
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepTest.java
Outdated
Show resolved
Hide resolved
Moving back to draft to wait for a release of jenkinsci/jenkins-test-harness#353 and to investigate jenkinsci/workflow-cps-plugin#490 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A QueueDecisionHandler
might fix the real bug, but not the problem path tested here. You could add a QueueTaskDispatcher
whose canRun
cancels the item if it matches this condition, which avoids the problems with readResolve
but on the other hand adds overhead to a critical code path for what I guess is a rare case. Maybe you could implement a fix in PlaceholderTask.getAssignedLabel
or getAffinityKey
or some similar method Queue.maintain
is guaranteed to call, cancelling itself (perhaps asynchronously via Timer
to avoid weird reëntrancy issues)?
FWIW I looked into exposing
Yeah I was thinking about something like this. I will spend a bit of time looking into it. |
src/test/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepTest.java
Outdated
Show resolved
Hide resolved
…if so, block execution and cancel task
@@ -422,6 +422,22 @@ public String getCookie() { | |||
} | |||
|
|||
@Override public CauseOfBlockage getCauseOfBlockage() { | |||
Run<?, ?> run = runForDisplay(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is safe. If we get here for a step that is just starting or resuming, then the run is already loaded and so this should complete quickly. The only time this should be slow is if this is after a Jenkins restart and the build has already completed so we end up here without the build having been loaded via some other route and we trigger the cancellation path.
return new CauseOfBlockage() { | ||
@Override | ||
public String getShortDescription() { | ||
return "Stopping " + getDisplayName(); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if an anonymous class or hard-coded text is ok here. My thought was that this cause should not usually be around long enough for anyone to see it, but I guess we should set up localization just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An anonymous class should be fine here. As you say, it ought not appear in the GUI for more than a moment if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me.
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
return new CauseOfBlockage() { | ||
@Override | ||
public String getShortDescription() { | ||
return "Stopping " + getDisplayName(); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An anonymous class should be fine here. As you say, it ought not appear in the GUI for more than a moment if at all.
PlaceholderTask
is in the queue but the associated build is completePlaceholderTask
is in the queue but the associated build is complete
Co-authored-by: Jesse Glick <jglick@cloudbees.com>
The issue described (and fixed) by this PR happens quite often on restarts of ci.jenkins.io (but not on other instances owned by the infra team). |
@dduportal sure, if you want to install an incremental build (assuming all dependencies are met) and be prepared to roll back at the first sign of trouble, that would be great. Somehow https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/workflow/workflow-durable-task-step/1107.ve0b5abf0ca7f/ is dead. Not sure why; https://ci.jenkins.io/job/Plugins/job/workflow-durable-task-step-plugin/job/PR-185/5/console claims to have deployed as the status check confirms. @daniel-beck any clue? |
CD is also set up here, so if @car-roll merges, then you should have an official release version within the hour. |
I can go ahead and merge. Was just waiting for @dduportal 's comments. |
And there we go @dduportal: https://github.com/jenkinsci/workflow-durable-task-step-plugin/releases/tag/1102.v9c8d2f466adb (should be on UC shortly) |
assertThat(logging.getMessages(), hasItem(startsWith("Refusing to build ExecutorStepExecution.PlaceholderTask{runId=p#"))); | ||
}); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW the reason there was no newline here before is that this step was a @TestExtension
of the test formerly above it, now with the new test intervening.
Installed, tested and approved: last restart did not triggered "ghosh" builds in the queue. Many thanks for this \o/ |
@dduportal do you see messages of the form "Refusing to build <...> and cancelling it because associated build is complete" in the Jenkins system logs? What about "Resuming <...>, which is missing from FlowExecutionList (<...>), so registering it now"? |
Yes, just retried a restart and got the following message:
|
Hmm, it's somewhat concerning to me that you are seeing those messages just on a regular restart. Thanks for confirming though! |
…ut the associated build is complete (jenkinsci#185)" This reverts commit 9c8d2f4.
…ut the associated build is complete (jenkinsci#185)" This reverts commit 9c8d2f4.
While testing Jenkins in various backup and restore scenarios, we ran into a case where the Jenkins queue contained a
ExecutorStepExecution.PlaceholderTask
whose associated build was already complete, causing the task to sit in the queue forever. We are not sure how this happened or how to reproduce it, but this PR adds a test that reproduces the same symptoms that may be useful for future investigation.I did some experimentation to see if the issue could be fixed, and something like this does fix it but does not seem like the best approach:
A patch like this would cause Pipeline builds to be loaded and resumed from inside of
Queue.load
while the queue lock is held during Jenkins startup, which seems concerning. Accessing any context variables viaStepContext.get
would have the same problem becauseCpsStepContext.doGet
transitively callsgetThreadGroupSynchronously
which has to load theWorkflowRun
and resume the Pipeline. Perhaps we could exposeCpsStepContext.isComplete
as a new method onStepContext
and check that instead, but I'm not sure.Queue.load
does explicitly handle the case where a task is null, although maybe there is a better way to remove the task from the queue.Given @jglick is working on stuff related to resumption of the
node
step in #180, it seems best not to try to make a speculative fix right now.