Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time #283

Merged
merged 3 commits into from
Sep 8, 2022

Conversation

dduportal
Copy link
Contributor

Closes jenkins-infra/helpdesk#2925

This PR introduces 2 min changes:

  • Stop allocating a node for the "top-level" job: it will run in a "ligthweight executor" (e.g. a JVM thread) in the controller which is expected to survive controller restart. Less moving pieces, less code, less resource usage.
  • Enable the "retry" top-level directive thanks to @jglick's recent work. This directive ensures that in most usual cases (sush as a controller restart), the pipeline shoud resume. Tested manually: if the sh step was started, then it continues the pipeline when resuming.
    • There are still some edge cases where the pipeline does not resume: if the pipeline instruction is not durable.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal dduportal requested a review from a team as a code owner September 8, 2022 10:22
@dduportal
Copy link
Contributor Author

If we are able to review this PR, and if it makes sense and is approved, then we should be able to test it for the next weekly release (the 13th of September 2022).

@dduportal
Copy link
Contributor Author

Ping @timja @MarkEWaite @lemeurherve @NotMyFault I would want to have multiple reviews on this one, for the sake of knowledge sharing, and feeling safer :)

For info, I've tested with manual pipeline jobs on release.ci, but only with a sh 'sleep 120' step.

Copy link
Contributor

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect retry syntax.

@@ -66,6 +66,7 @@ pipeline {
options {
disableConcurrentBuilds()
buildDiscarder logRotator(numToKeepStr: '15') // Retain only last 15 builds to reduce space requirements
retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)
retry(conditions: [kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)

(The whole point of handleNonKubernetes: true is that you do not also specify agent.)

Anyway I am not sure if this syntax even works—I have not tried it—and you are working too hard. The tested syntax for Declarative is

agent {
  kubernetes {
    // …as before
    retries 2
  }
}

applied above to

agent {
kubernetes {
yamlFile 'PodTemplates.d/package-linux.yaml'
workingDir '/home/jenkins/agent'
}
}
(GH suggestions do not allow this)

@@ -45,6 +45,7 @@ pipeline {

options {
disableConcurrentBuilds()
retry(conditions: [agent(), kubernetesAgent(handleNonKubernetes: true), nonresumable()], count: 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@jglick
Copy link
Contributor

jglick commented Sep 8, 2022

it will run in a "ligthweight executor" (e.g. a JVM thread) in the controller

Actually, no, Pipeline CPS code does not consume a JVM thread permanently. (Temporarily uses a JVM thread while running Groovy, then exits it when switching to a step like build.)

which is expected to survive controller restart

This is sort of mangled. The build should survive a controller restart even if you had a superfluous agent. That is a core aspect of Pipeline from its initial design, and applies in particular to K8s agents.

This directive ensures that in most usual cases (such as a controller restart), the pipeline should resume.

Again this is confused. retry is not necessary to ensure that, and does not contribute to that at all. Rather it ensures that if the agent dies for some reason, the build can restart the whole node block rather than aborting. (And that does not work across controller restarts without jenkinsci/workflow-durable-task-step-plugin#180. Until that PR ships, retry here will help only in cases where the agent dies while the controller is up and running.)

So while adding retry here may be a good idea—if the agent block is idempotent (which is not at all obvious in the case of Jenkinsfile.d/core/release, especially seeing as maven-release-plugin is notoriously not idempotent)—it is not addressing the supposed issue mentioned in the PR title. Similarly, removing an unused agent/executor from a build which only runs controller-based steps such as build is certainly a good idea to avoid wasting money on cloud costs, but it is not necessary to address the issue mentioned in the PR title.

if the pipeline instruction is not durable

Is there a specific example you have in mind? The nonresumable condition (for Declarative, implied in using the retries option) covers non-durable steps such as checkout.

Copy link
Contributor

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine except for the PR title.

@dduportal
Copy link
Contributor Author

Thanks @jglick for the review and the pointers. I've updated: does it fit to what you suggested?

Side note: the options { retry {} } syntax that I initially used came from the pipeline snippet generator. Looks important to mention as if it is not expected then we should block it from being generated (otherwise other users could do the same).

@dduportal
Copy link
Contributor Author

The build should survive a controller restart even if you had a superfluous agent

Honestly, It's rarely the case. I have no idea why, what and how and it feels really complicated (hence my confusion).

What we see on this instance (and also on infra.ci), which are mainly using Kubernetes pods, is that when the controller restarts, the builds are stuck and never resume (hence the initial issue).

Most of the time, the pod agent is gone (No idea which system removes it: is it Jenkins - through the kubernetes plugin?- Is it a timeout of the PID 1 ? something else? I have no idea how to track this).

That's why I felt adding the "retry" here could help to restart the build. But this PR might be dangerous, as you mentionned that the maven deployment should not be retried. Maybe we should accept build failures and restart it manually.

What do you think? What direction should we head to?

@jglick
Copy link
Contributor

jglick commented Sep 8, 2022

the options { retry {} } syntax that I initially used came from the pipeline snippet generator

I think it simply offers any block-scoped step as an option of the same name. Potentially that could be useful in the case of retry though I am not sure offhand where the retry step would be inserted relative to the node step; part of my discomfort with Declarative (especially the agent declarative) is that it does not make it clear from source code what the build is actually doing when.

At any rate, I checked and retries is offered in the Declarative directive generator for applicable agent types including kubernetes. Probably needs some high-level documentation on jenkins.io.

@jglick
Copy link
Contributor

jglick commented Sep 8, 2022

the builds are stuck and never resume

As in, stay in progress indefinitely? Or abort after the 5m timeout applied to a missing agent after controller restart?

the pod agent is gone (No idea which system removes it: is it Jenkins - through the kubernetes plugin?

No, it should not. There is of course test coverage in kubernetes-plugin demonstrating the controller restarting while the agent pod continues to run uninterrupted.

adding the "retry" here could help to restart the build

Again, pending jenkinsci/workflow-durable-task-step-plugin#180, only if a controller restart is not involved; and as mentioned, restarting the build (more specifically the node block for the active stage) may or may not be appropriate depending on the workload: generally fine for pure CI, but potentially problematic for deployments.

What direction should we head to?

Well the removal of the superfluous agent for build-only jobs is clearly desirable and you can go ahead with that. As to the rest, you should start by diagnosing what your actual problem is. Is the situation a general tidy controller restart, such as to apply a plugin update, that happens to take place while some long-running build is using an agent? That should just work out of the box. If you are talking about a cluster upgrade which might be destroying agent pods, jenkinsci/workflow-durable-task-step-plugin#180 (and an idempotent stage) is the only option for automated recovery.

@dduportal
Copy link
Contributor Author

the builds are stuck and never resume

As in, stay in progress indefinitely? Or abort after the 5m timeout applied to a missing agent after controller restart?

the pod agent is gone (No idea which system removes it: is it Jenkins - through the kubernetes plugin?

No, it should not. There is of course test coverage in kubernetes-plugin demonstrating the controller restarting while the agent pod continues to run uninterrupted.

adding the "retry" here could help to restart the build

Again, pending jenkinsci/workflow-durable-task-step-plugin#180, only if a controller restart is not involved; and as mentioned, restarting the build (more specifically the node block for the active stage) may or may not be appropriate depending on the workload: generally fine for pure CI, but potentially problematic for deployments.

What direction should we head to?

Well the removal of the superfluous agent for build-only jobs is clearly desirable and you can go ahead with that. As to the rest, you should start by diagnosing what your actual problem is. Is the situation a general tidy controller restart, such as to apply a plugin update, that happens to take place while some long-running build is using an agent? That should just work out of the box. If you are talking about a cluster upgrade which might be destroying agent pods, jenkinsci/workflow-durable-task-step-plugin#180 (and an idempotent stage) is the only option for automated recovery.

Interesting. Thanks for your patience and the explanation (and the work involved). Given what you described, I'll update the PR (and its title) to only remove the agent of the parent pipeline and a a "retry" to the packaging pipeline (which is idempotent), but no retry on the release pipeline (as it is NOT idempotent and I does not feel safe to automate resuming it).

I'll update the asosciated helpdesk issue to ensure that we will try to willingly restart the controller (e.g. deleting the pod which is expected to send a STOP signal to Jenkins process) during the next weekly release to see what happens exactly (and extract logs).

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal dduportal changed the title chore(pipelines) resume builds when facing a controller restart fix(Jenkins Pipelines) do not allocate agent for "parent" pipelines and retry the packaging process a 2nd time Sep 8, 2022
@dduportal dduportal requested review from jglick, timja and lemeurherve and removed request for jglick September 8, 2022 15:45
@dduportal
Copy link
Contributor Author

  • retry removed from the (non-idempotent) release pipeline
  • Updated the PR title

WDYT?

@jglick
Copy link
Contributor

jglick commented Sep 8, 2022

deleting the pod which is expected to send a STOP signal to Jenkins process

Well, TERM but yes.

@dduportal dduportal merged commit 0670a76 into jenkins-infra:master Sep 8, 2022
@dduportal dduportal deleted the helpdesk-2925 branch September 8, 2022 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weekly release build does not resume
4 participants