-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-49707] Agent missing after controller restart to fail resumption of node
step, not kill whole build
#180
Conversation
*/ | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the factories for these pickles be deleted or replaced with an implementation that just throws an exception if it encounters an object of the specified type? Or do you want to keep them around in case some other step or wild Groovy code is relying on them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to retain them for compatibility. Weird Pipeline script is one possibility. They could also be used in other Pipeline steps; I know PushdStep
binds FilePathDynamicContext
, so that is not affected, but there may be others. Seems harmless enough to leave them here.
…e-task-step-plugin into retry-JENKINS-49707
…icContext.FilePathTranslator`
… longer makes sense
…n` by copying some diagnostic code from `FilePathDynamicContext`
…aitingMessage` is unused
…utorStepExecution`
…e-task-step-plugin into retry-JENKINS-49707
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer` which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message: ``` FINE o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart jenkinsci#1] java.util.concurrent.TimeoutException at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115) Caused: java.io.IOException at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553) at … FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back ```
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause()); | ||
LOGGER.fine(() -> "rediscovering that " + node + " has been removed and timeout has expired"); | ||
listener().getLogger().println(node + " has been removed for " + Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS) + ", assuming it is not coming back"); | ||
throw new FlowInterruptedException(Result.ABORTED, /* TODO false probably more appropriate */true, new ExecutorStepExecution.RemovedNodeCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that
Lines 60 to 62 in 97365d4
if (t instanceof FlowInterruptedException && ((FlowInterruptedException) t).getCauses().stream().anyMatch( | |
c -> c instanceof ExecutorStepExecution.RemovedNodeCause || c instanceof ExecutorStepExecution.QueueTaskCancelled)) { | |
return true; |
actualInterruption
check in https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/d57e3ca46d24bb91b8410467284176b0f319c64e/src/main/java/org/jenkinsci/plugins/workflow/steps/RetryStepExecution.java#L93 so the distinction would only matter for other catching steps like https://github.com/jenkinsci/workflow-basic-steps-plugin/blob/d57e3ca46d24bb91b8410467284176b0f319c64e/src/main/java/org/jenkinsci/plugins/workflow/steps/CatchErrorStep.java#L229
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a brief look at the PR and added some questions, but in general it looks fine. I am planning on reviewing it more thoroughly tomorrow.
@@ -45,7 +45,7 @@ | |||
* Allows a step body to save a representation of a workspace | |||
* without forcing a particular {@link FilePath#getChannel} to be used the whole time. | |||
*/ | |||
@Extension public final class FilePathDynamicContext extends DynamicContext.Typed<FilePath> { | |||
@Extension(ordinal = 100) public final class FilePathDynamicContext extends DynamicContext.Typed<FilePath> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ordinal = 100
needed for PushdStep
to keep working, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not suffice: https://issues.jenkins.io/browse/JENKINS-69936
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContext.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Show resolved
Hide resolved
This comment was marked as resolved.
This comment was marked as resolved.
…e-task-step-plugin into retry-JENKINS-49707
Breaks |
JENKINS-49707
Upstream of
jenkinsci/workflow-basic-steps-plugin#195 andjenkinsci/kubernetes-plugin#1211.Subsumes #231. Subsumes #249.Relevant prior work: #104, jenkinsci/kubernetes-plugin#461, #47, #48, #115, #101, jenkinsci/workflow-basic-steps-plugin#86, #175, #176, #179.
Plan of record:
ExecutorStepDynamicContext
and associated machinery as a replacement forExecutorPickle
and other pickles (see below).retry
step to be given configurable behavior.kubernetes
which takes into account pod status.agent
typesany
andlabel
(i.e., generic) andkubernetes
node
but outsidesh
.ExecutorPickle
should not reschedule anything until afterQueue.load
#184 to the new code, as Warn inQueue.init
if items scheduled during startup will be clobbered jenkins#5934 can be reproduced again.queue.xml
: [JENKINS-49707] Agent missing after controller restart to fail resumption ofnode
step, not kill whole build #180 (comment)AgentErrorCondition
improvements #249 toExecutorStepDynamicContext
The idea is that users updating plugins would see a somewhat altered behavior for resumption inside
node
blocks by default, which I hope would go mostly unnoticed; and could adjust pipelines to look something likeor
The key is to avoid relying on
Pickle
s forExecutorStepExecution
to resume properly. (More precisely, pickles usingTryRepeatedly
;SecretPickle
and the like are harmless in this context.) Instead we take explicit control over how the step is resumed and what happens when. The problem withPickle
here is that it is too magical; we would like to be able to detect after resumption that an agent is gone and then throw an exception at the Groovy level, which is not possible if the build is aborted at the CPS VM level due to an inability to rehydrate.I had planned to offer a system property to revert to
ExecutorPickle
in case of catastrophic problems (along with a couple sanity tests of that mode) but after reviewing the changes toExecutorStepExecution
this does not seem straightforward: the major simplifications and fixes to it rely on moving away fromExecutorPickle
.