Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate builds stuck in queue on infra.ci #3378

Closed
NotMyFault opened this issue Feb 7, 2023 · 8 comments
Closed

Terminate builds stuck in queue on infra.ci #3378

NotMyFault opened this issue Feb 7, 2023 · 8 comments
Assignees
Labels
bug Something isn't working infra.ci.jenkins.io

Comments

@NotMyFault
Copy link
Member

NotMyFault commented Feb 7, 2023

Service(s)

infra.ci.jenkins.io

Summary

Can someone stop the following couple of jobs, please?

As configured, new builds don't start if one is already in progress. I can't cancel building jobs.

Thanks!

Reproduction steps

No response

@NotMyFault NotMyFault added the triage Incoming issues that need review label Feb 7, 2023
@NotMyFault NotMyFault changed the title Terminate stuck builds on infra.ci Terminate builds stuck in queue on infra.ci Feb 7, 2023
@dduportal dduportal added this to the infra-team-sync-2023-02-14 milestone Feb 7, 2023
@dduportal dduportal self-assigned this Feb 7, 2023
@dduportal dduportal added bug Something isn't working and removed triage Incoming issues that need review labels Feb 7, 2023
@dduportal
Copy link
Contributor

Done, new builds kicked-in.

For the record, we saw the logs like the following on almost all stuck builds (not only these two):

18:13:23  Resuming build at Fri Feb 03 17:13:23 UTC 2023 after Jenkins restart
18:13:23  Waiting for reconnection of reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd before proceeding with build
18:16:15  Resuming build at Fri Feb 03 17:16:15 UTC 2023 after Jenkins restart
18:16:17  Waiting for reconnection of reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd before proceeding with build
18:21:17  reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd has been removed for 5 min 0 sec, assuming it is not coming back
18:21:17  Timeout set to expire in 1 hr 41 min
18:21:17  Could not connect to reports-jira-users-report-main-2948-jzgzg-0fmft-1w8rd to send interrupt signal to process
18:21:17  Still paused
20:02:48  Cancelling nested steps due to timeout
20:03:48  Body did not finish within grace period; terminating with extreme prejudice

@NotMyFault
Copy link
Member Author

It's happening again, same two jobs.

@dduportal dduportal reopened this Feb 12, 2023
@dduportal
Copy link
Contributor

Status for the week end (e.g. short term unblocks)

  • Builds forced-terminated, I've captured a few logs just in case
  • infra.ci has been restart to clean up its memory "just in case"
  • Root cause is identified: this behavior happen each time we have a controller restart
  • Checked the pipeline durability setup: it's set globally as "MAX_SURVAVIBILITY". All the jobs are not overriding the setup so there is definitively an issue on the pipeline durabilit behavior.
  • All of the pipelines on this controller are suffering from this behavior (restart controller always end up on stuck jobs with the error Body did not finish within grace period; terminating with extreme prejudice , which is the only using the weekly line AND with real life builds => we have to check if ci.jenkins.io has the same behavior (LTS line, pod defined by admin, and a lot of jobs with a correct retry behavior)

=> gotta check now:

  • Do we have the same behavior on builds using a VM agent (instead of a pod agent) (hunch)
  • Do we need max survavibility for all these jobs given they are stateless for most (and run reagularly)?
  • Do we need the global buidl timeout at 24h as a safety net (and if we add it, we must check if it works after a controller upgrade + restart or if it behaves like the current pipeline timeouts)?
  • Do we need "retry" instructions with the correct setup to let the controller decide if a new pod is needed?
  • Check the agent setups: do we use agent.jar directly with a timeout longer than the controler measured restart time?
  • Report the behavior on Jenkins JIRA

@smerle33
Copy link
Contributor

I tried a PAUSE/RESUME when the job was stuck after a controller restart : https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/job/main/12285/console
and it did resume : 09:21:14 Resuming

@dduportal
Copy link
Contributor

Update: infra.ci was stuck last Thursday and we had to rollback/re-apply its deployement in Kubernetes.
We considered the following element while looking at different metrics:

image

  • The CPU was throttled. It's hard to tell if it is due to the I/O wait, the CPOU "normal" usage or both. Let's see if the new disk is OK, or if we have to:
    • Increase to 4 vCPUS
    • Remove CPU limits/resources for the pod to avoid throttling

@dduportal
Copy link
Contributor

Today deployment of 2.392 shows that:

  • IOPS are clearly better
  • CPU is used at 85-100% during startup but not all the time
  • Memory should be increased and tuned as the logs mentions the following error (before the JVM shuts down and is restarted, which hapopens 2-3 tims before the restart finally succeeds):
java.lang.IllegalArgumentException: committed = 1076887552 should be < max = 1073741824
        at java.management/java.lang.management.MemoryUsage.<init>(MemoryUsage.java:166)
        at java.management/sun.management.MemoryImpl.getMemoryUsage0(Native Method)
        at java.management/sun.management.MemoryImpl.getHeapMemoryUsage(MemoryImpl.java:71)
        at com.codahale.metrics.jvm.MemoryUsageGaugeSet.lambda$getMetrics$7(MemoryUsageGaugeSet.java:58)
        at jenkins.metrics.util.AutoSamplingHistogram.update(AutoSamplingHistogram.java:78)
        at jenkins.metrics.util.AutoSamplingHistogram$PeriodicWorkImpl.doRun(AutoSamplingHistogram.java:128)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:92)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

@dduportal
Copy link
Contributor

jenkins-infra/kubernetes-management#3618 to fix this memory management issue

@dduportal
Copy link
Contributor

With the Kuberentes 1.24 update in #3387 , the controller seems to have restarted properly and jobs contuning to work.

Closing (unless we reproduce the issue again).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working infra.ci.jenkins.io
Projects
None yet
Development

No branches or pull requests

4 participants