Terminate builds stuck in queue on infra.ci #3378

NotMyFault · 2023-02-07T16:27:54Z

Service(s)

infra.ci.jenkins.io

Summary

Can someone stop the following couple of jobs, please?

As configured, new builds don't start if one is already in progress. I can't cancel building jobs.

Thanks!

Reproduction steps

No response

dduportal · 2023-02-07T17:53:00Z

Done, new builds kicked-in.

For the record, we saw the logs like the following on almost all stuck builds (not only these two):

18:13:23  Resuming build at Fri Feb 03 17:13:23 UTC 2023 after Jenkins restart
18:13:23  Waiting for reconnection of reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd before proceeding with build
18:16:15  Resuming build at Fri Feb 03 17:16:15 UTC 2023 after Jenkins restart
18:16:17  Waiting for reconnection of reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd before proceeding with build
18:21:17  reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd has been removed for 5 min 0 sec, assuming it is not coming back
18:21:17  Timeout set to expire in 1 hr 41 min
18:21:17  Could not connect to reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd to send interrupt signal to process
18:21:17  Still paused
20:02:48  Cancelling nested steps due to timeout
20:03:48  Body did not finish within grace period; terminating with extreme prejudice

NotMyFault · 2023-02-11T19:59:04Z

It's happening again, same two jobs.

dduportal · 2023-02-12T06:53:27Z

Status for the week end (e.g. short term unblocks)

Builds forced-terminated, I've captured a few logs just in case
infra.ci has been restart to clean up its memory "just in case"
Root cause is identified: this behavior happen each time we have a controller restart
Checked the pipeline durability setup: it's set globally as "MAX_SURVAVIBILITY". All the jobs are not overriding the setup so there is definitively an issue on the pipeline durabilit behavior.
All of the pipelines on this controller are suffering from this behavior (restart controller always end up on stuck jobs with the error Body did not finish within grace period; terminating with extreme prejudice , which is the only using the weekly line AND with real life builds => we have to check if ci.jenkins.io has the same behavior (LTS line, pod defined by admin, and a lot of jobs with a correct retry behavior)

=> gotta check now:

Do we have the same behavior on builds using a VM agent (instead of a pod agent) (hunch)
Do we need max survavibility for all these jobs given they are stateless for most (and run reagularly)?
Do we need the global buidl timeout at 24h as a safety net (and if we add it, we must check if it works after a controller upgrade + restart or if it behaves like the current pipeline timeouts)?
Do we need "retry" instructions with the correct setup to let the controller decide if a new pod is needed?
Check the agent setups: do we use agent.jar directly with a timeout longer than the controler measured restart time?
Report the behavior on Jenkins JIRA

smerle33 · 2023-02-13T08:25:48Z

I tried a PAUSE/RESUME when the job was stuck after a controller restart : https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/job/main/12285/console
and it did resume : 09:21:14 Resuming

dduportal · 2023-02-17T15:34:28Z

Update: infra.ci was stuck last Thursday and we had to rollback/re-apply its deployement in Kubernetes.
We considered the following element while looking at different metrics:

The IOPS on the PVC providing the JENKINS_HOME were huge due to the amount of running pipelines, so we increased the "disk" performance from P6 (240 IOPS burstable at 3500, 50 Mbps) to P15 (1100 IOPS, burtsable at 3500, 125 Mbits) - https://azure.microsoft.com/en-us/pricing/details/managed-disks/

The CPU was throttled. It's hard to tell if it is due to the I/O wait, the CPOU "normal" usage or both. Let's see if the new disk is OK, or if we have to:
- Increase to 4 vCPUS
- Remove CPU limits/resources for the pod to avoid throttling

dduportal · 2023-02-22T08:16:16Z

Today deployment of 2.392 shows that:

IOPS are clearly better
CPU is used at 85-100% during startup but not all the time
Memory should be increased and tuned as the logs mentions the following error (before the JVM shuts down and is restarted, which hapopens 2-3 tims before the restart finally succeeds):

java.lang.IllegalArgumentException: committed = 1076887552 should be < max = 1073741824
        at java.management/java.lang.management.MemoryUsage.<init>(MemoryUsage.java:166)
        at java.management/sun.management.MemoryImpl.getMemoryUsage0(Native Method)
        at java.management/sun.management.MemoryImpl.getHeapMemoryUsage(MemoryImpl.java:71)
        at com.codahale.metrics.jvm.MemoryUsageGaugeSet.lambda$getMetrics$7(MemoryUsageGaugeSet.java:58)
        at jenkins.metrics.util.AutoSamplingHistogram.update(AutoSamplingHistogram.java:78)
        at jenkins.metrics.util.AutoSamplingHistogram$PeriodicWorkImpl.doRun(AutoSamplingHistogram.java:128)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

dduportal · 2023-02-22T08:20:03Z

jenkins-infra/kubernetes-management#3618 to fix this memory management issue

dduportal · 2023-02-23T16:47:50Z

With the Kuberentes 1.24 update in #3387 , the controller seems to have restarted properly and jobs contuning to work.

Closing (unless we reproduce the issue again).

NotMyFault added the triage Incoming issues that need review label Feb 7, 2023

NotMyFault changed the title ~~Terminate stuck builds on infra.ci~~ Terminate builds stuck in queue on infra.ci Feb 7, 2023

jenkins-infra-helpdesk-app bot added the infra.ci.jenkins.io label Feb 7, 2023

dduportal added this to the infra-team-sync-2023-02-14 milestone Feb 7, 2023

dduportal self-assigned this Feb 7, 2023

dduportal added bug Something isn't working and removed triage Incoming issues that need review labels Feb 7, 2023

dduportal closed this as completed Feb 7, 2023

dduportal reopened this Feb 12, 2023

lemeurherve modified the milestones: infra-team-sync-2023-02-14, infra-team-sync-2023-02-21 Feb 14, 2023

dduportal modified the milestones: infra-team-sync-2023-02-21, infra-team-sync-2023-02-28 Feb 21, 2023

dduportal mentioned this issue Feb 22, 2023

fix(infra.ci,release.ci, weekly.ci) improve memory management for JVMs + cleanup JVM options jenkins-infra/kubernetes-management#3618

Merged

dduportal closed this as completed Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate builds stuck in queue on infra.ci #3378

Terminate builds stuck in queue on infra.ci #3378

NotMyFault commented Feb 7, 2023 •

edited

Loading

dduportal commented Feb 7, 2023

NotMyFault commented Feb 11, 2023

dduportal commented Feb 12, 2023

smerle33 commented Feb 13, 2023

dduportal commented Feb 17, 2023

dduportal commented Feb 22, 2023

dduportal commented Feb 22, 2023

dduportal commented Feb 23, 2023

Terminate builds stuck in queue on infra.ci #3378

Terminate builds stuck in queue on infra.ci #3378

Comments

NotMyFault commented Feb 7, 2023 • edited Loading

Service(s)

Summary

Reproduction steps

dduportal commented Feb 7, 2023

NotMyFault commented Feb 11, 2023

dduportal commented Feb 12, 2023

smerle33 commented Feb 13, 2023

dduportal commented Feb 17, 2023

dduportal commented Feb 22, 2023

dduportal commented Feb 22, 2023

dduportal commented Feb 23, 2023

NotMyFault commented Feb 7, 2023 •

edited

Loading