fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time #283

dduportal · 2022-09-08T10:22:25Z

This PR introduces 2 min changes:

Stop allocating a node for the "top-level" job: it will run in a "ligthweight executor" (e.g. a JVM thread) in the controller which is expected to survive controller restart. Less moving pieces, less code, less resource usage.
Enable the "retry" top-level directive thanks to @jglick's recent work. This directive ensures that in most usual cases (sush as a controller restart), the pipeline shoud resume. Tested manually: if the sh step was started, then it continues the pipeline when resuming.
- There are still some edge cases where the pipeline does not resume: if the pipeline instruction is not durable.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2022-09-08T10:23:16Z

If we are able to review this PR, and if it makes sense and is approved, then we should be able to test it for the next weekly release (the 13th of September 2022).

dduportal · 2022-09-08T10:24:34Z

Ping @timja @MarkEWaite @lemeurherve @NotMyFault I would want to have multiple reviews on this one, for the sake of knowledge sharing, and feeling safer :)

For info, I've tested with manual pipeline jobs on release.ci, but only with a sh 'sleep 120' step.

jglick

Incorrect retry syntax.

jglick · 2022-09-08T11:51:56Z

Jenkinsfile.d/core/package

@@ -66,6 +66,7 @@ pipeline {
  options {
    disableConcurrentBuilds()
    buildDiscarder logRotator(numToKeepStr: '15') // Retain only last 15 builds to reduce space requirements
+    retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)


Suggested change

retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)

retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)

(The whole point of handleNonKubernetes: true is that you do not also specify agent.)

Anyway I am not sure if this syntax even works—I have not tried it—and you are working too hard. The tested syntax for Declarative is

agent { kubernetes { // …as before retries 2 } }

applied above to

release/Jenkinsfile.d/core/package

Lines 2 to 7 in e2368e9

agent {

kubernetes {

yamlFile 'PodTemplates.d/package-linux.yaml'

workingDir '/home/jenkins/agent'

}

}

(GH suggestions do not allow this)

jglick · 2022-09-08T11:52:05Z

Jenkinsfile.d/core/release

@@ -45,6 +45,7 @@ pipeline {

  options {
    disableConcurrentBuilds()
+    retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)


Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

jglick · 2022-09-08T12:09:40Z

it will run in a "ligthweight executor" (e.g. a JVM thread) in the controller

Actually, no, Pipeline CPS code does not consume a JVM thread permanently. (Temporarily uses a JVM thread while running Groovy, then exits it when switching to a step like build.)

which is expected to survive controller restart

This is sort of mangled. The build should survive a controller restart even if you had a superfluous agent. That is a core aspect of Pipeline from its initial design, and applies in particular to K8s agents.

This directive ensures that in most usual cases (such as a controller restart), the pipeline should resume.

Again this is confused. retry is not necessary to ensure that, and does not contribute to that at all. Rather it ensures that if the agent dies for some reason, the build can restart the whole node block rather than aborting. (And that does not work across controller restarts without jenkinsci/workflow-durable-task-step-plugin#180. Until that PR ships, retry here will help only in cases where the agent dies while the controller is up and running.)

So while adding retry here may be a good idea—if the agent block is idempotent (which is not at all obvious in the case of Jenkinsfile.d/core/release, especially seeing as maven-release-plugin is notoriously not idempotent)—it is not addressing the supposed issue mentioned in the PR title. Similarly, removing an unused agent/executor from a build which only runs controller-based steps such as build is certainly a good idea to avoid wasting money on cloud costs, but it is not necessary to address the issue mentioned in the PR title.

if the pipeline instruction is not durable

Is there a specific example you have in mind? The nonresumable condition (for Declarative, implied in using the retries option) covers non-durable steps such as checkout.

jglick

Fine except for the PR title.

dduportal · 2022-09-08T12:10:59Z

Thanks @jglick for the review and the pointers. I've updated: does it fit to what you suggested?

Side note: the options { retry {} } syntax that I initially used came from the pipeline snippet generator. Looks important to mention as if it is not expected then we should block it from being generated (otherwise other users could do the same).

dduportal · 2022-09-08T12:16:03Z

The build should survive a controller restart even if you had a superfluous agent

Honestly, It's rarely the case. I have no idea why, what and how and it feels really complicated (hence my confusion).

What we see on this instance (and also on infra.ci), which are mainly using Kubernetes pods, is that when the controller restarts, the builds are stuck and never resume (hence the initial issue).

Most of the time, the pod agent is gone (No idea which system removes it: is it Jenkins - through the kubernetes plugin?- Is it a timeout of the PID 1 ? something else? I have no idea how to track this).

That's why I felt adding the "retry" here could help to restart the build. But this PR might be dangerous, as you mentionned that the maven deployment should not be retried. Maybe we should accept build failures and restart it manually.

What do you think? What direction should we head to?

jglick · 2022-09-08T12:16:17Z

the options { retry {} } syntax that I initially used came from the pipeline snippet generator

I think it simply offers any block-scoped step as an option of the same name. Potentially that could be useful in the case of retry though I am not sure offhand where the retry step would be inserted relative to the node step; part of my discomfort with Declarative (especially the agent declarative) is that it does not make it clear from source code what the build is actually doing when.

At any rate, I checked and retries is offered in the Declarative directive generator for applicable agent types including kubernetes. Probably needs some high-level documentation on jenkins.io.

jglick · 2022-09-08T12:23:31Z

the builds are stuck and never resume

As in, stay in progress indefinitely? Or abort after the 5m timeout applied to a missing agent after controller restart?

the pod agent is gone (No idea which system removes it: is it Jenkins - through the kubernetes plugin?

No, it should not. There is of course test coverage in kubernetes-plugin demonstrating the controller restarting while the agent pod continues to run uninterrupted.

adding the "retry" here could help to restart the build

Again, pending jenkinsci/workflow-durable-task-step-plugin#180, only if a controller restart is not involved; and as mentioned, restarting the build (more specifically the node block for the active stage) may or may not be appropriate depending on the workload: generally fine for pure CI, but potentially problematic for deployments.

What direction should we head to?

Well the removal of the superfluous agent for build-only jobs is clearly desirable and you can go ahead with that. As to the rest, you should start by diagnosing what your actual problem is. Is the situation a general tidy controller restart, such as to apply a plugin update, that happens to take place while some long-running build is using an agent? That should just work out of the box. If you are talking about a cluster upgrade which might be destroying agent pods, jenkinsci/workflow-durable-task-step-plugin#180 (and an idempotent stage) is the only option for automated recovery.

dduportal · 2022-09-08T15:43:26Z

the builds are stuck and never resume

As in, stay in progress indefinitely? Or abort after the 5m timeout applied to a missing agent after controller restart?

the pod agent is gone (No idea which system removes it: is it Jenkins - through the kubernetes plugin?

No, it should not. There is of course test coverage in kubernetes-plugin demonstrating the controller restarting while the agent pod continues to run uninterrupted.

adding the "retry" here could help to restart the build

Again, pending jenkinsci/workflow-durable-task-step-plugin#180, only if a controller restart is not involved; and as mentioned, restarting the build (more specifically the node block for the active stage) may or may not be appropriate depending on the workload: generally fine for pure CI, but potentially problematic for deployments.

What direction should we head to?

Well the removal of the superfluous agent for build-only jobs is clearly desirable and you can go ahead with that. As to the rest, you should start by diagnosing what your actual problem is. Is the situation a general tidy controller restart, such as to apply a plugin update, that happens to take place while some long-running build is using an agent? That should just work out of the box. If you are talking about a cluster upgrade which might be destroying agent pods, jenkinsci/workflow-durable-task-step-plugin#180 (and an idempotent stage) is the only option for automated recovery.

Interesting. Thanks for your patience and the explanation (and the work involved). Given what you described, I'll update the PR (and its title) to only remove the agent of the parent pipeline and a a "retry" to the packaging pipeline (which is idempotent), but no retry on the release pipeline (as it is NOT idempotent and I does not feel safe to automate resuming it).

I'll update the asosciated helpdesk issue to ensure that we will try to willingly restart the controller (e.g. deleting the pod which is expected to send a STOP signal to Jenkins process) during the next weekly release to see what happens exactly (and extract logs).

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2022-09-08T15:46:32Z

retry removed from the (non-idempotent) release pipeline
Updated the PR title

WDYT?

jglick · 2022-09-08T17:12:35Z

deleting the pod which is expected to send a STOP signal to Jenkins process

Well, TERM but yes.

chore(pipelines) resume builds when facing a controller restart

e2368e9

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal requested a review from a team as a code owner September 8, 2022 10:22

lemeurherve approved these changes Sep 8, 2022

View reviewed changes

dduportal requested review from MarkEWaite and timja September 8, 2022 10:23

dduportal mentioned this pull request Sep 8, 2022

Weekly release build does not resume jenkins-infra/helpdesk#2925

Closed

timja approved these changes Sep 8, 2022

View reviewed changes

jglick suggested changes Sep 8, 2022

View reviewed changes

code review

8253d9e

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

jglick approved these changes Sep 8, 2022

View reviewed changes

code review

92d2608

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal changed the title ~~chore(pipelines) resume builds when facing a controller restart~~ fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time Sep 8, 2022

dduportal requested review from jglick, timja and lemeurherve and removed request for jglick September 8, 2022 15:45

timja approved these changes Sep 8, 2022

View reviewed changes

lemeurherve approved these changes Sep 8, 2022

View reviewed changes

jglick approved these changes Sep 8, 2022

View reviewed changes

dduportal merged commit 0670a76 into jenkins-infra:master Sep 8, 2022

dduportal deleted the helpdesk-2925 branch September 8, 2022 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time #283

fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time #283

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

jglick left a comment

jglick Sep 8, 2022

jglick Sep 8, 2022

jglick commented Sep 8, 2022

jglick left a comment

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

jglick commented Sep 8, 2022

jglick commented Sep 8, 2022

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

jglick commented Sep 8, 2022

	retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)
	retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)

	agent {
	kubernetes {
	yamlFile 'PodTemplates.d/package-linux.yaml'
	workingDir '/home/jenkins/agent'
	}
	}

fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time #283

fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time #283

Conversation

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

jglick left a comment

Choose a reason for hiding this comment

jglick Sep 8, 2022

Choose a reason for hiding this comment

jglick Sep 8, 2022

Choose a reason for hiding this comment

jglick commented Sep 8, 2022

jglick left a comment

Choose a reason for hiding this comment

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

jglick commented Sep 8, 2022

jglick commented Sep 8, 2022

dduportal commented Sep 8, 2022

dduportal commented Sep 8, 2022

jglick commented Sep 8, 2022