Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ubuntu-latest: jobs fail with error code 143 #6680

Closed
2 of 11 tasks
nicktrav opened this issue Dec 2, 2022 · 16 comments
Closed
2 of 11 tasks

ubuntu-latest: jobs fail with error code 143 #6680

nicktrav opened this issue Dec 2, 2022 · 16 comments
Assignees
Labels
investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu

Comments

@nicktrav
Copy link

nicktrav commented Dec 2, 2022

Description

We recently started seeing a high rate of failure in runs of a job that run go race on Linux runners (ubuntu-latest). We see the following in the logs, but lack the context to say why the process receives the SIGTERM:

2022-12-01T17:39:21.3663768Z make: *** [Makefile:22: test] Terminated
2022-12-01T17:39:21.5137635Z ##[error]Process completed with exit code 143.
2022-12-01T17:39:21.5192954Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2022-12-01T17:39:21.7086950Z Cleaning up orphan processes

This seems to be less of an issue with the codebase itself (the same set of tests pass under stress on dedicated Linux workstations and cloud VMs), and more with the action runner VMs. That said, the failure rate seems to have markedly increased after a recent change to the codebase.

We're speculating that we are hitting some kind of resource limit due to the recent code change, though it's hard to say definitively.

More context in cockroachdb/pebble#2159.

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 18.04
  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 10.15
  • macOS 11
  • macOS 12
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

  Image: ubuntu-22.04
  Version: 20221119.2

Is it regression?

No - we've seen the same job passing with the same image.

Expected behavior

The job should complete without error.

Actual behavior

Job fails with exit code 143.

Repro steps

Run the linux-race job in the Pebble repo (e.g. via PR, etc.). NOTE: we've since temporarily disabled that job until we resolve this particular issue.

We used cockroachdb/pebble#2158 to bisect down to the code change that increased the failure rate, though it's not clear why it's failing with error code 143.

@mikhailkoliada
Copy link
Contributor

@nicktrav Hello! My initial suspect is that you hit runner's limitation and we can not do anything here as you need to either reduce the resources usage or switch to large runners.

Could you post a link to the failed build please? (not mandatory the link is world-readable, as we do not need the content of the job, but the link itself)

@mikhailkoliada
Copy link
Contributor

@nicktrav nevermind, I found what I was looking for.

@mikhailkoliada mikhailkoliada self-assigned this Dec 2, 2022
@mikhailkoliada mikhailkoliada added OS: Ubuntu investigate Collect additional information, like space on disk, other tool incompatibilities etc. and removed bug report needs triage labels Dec 2, 2022
@nicktrav
Copy link
Author

nicktrav commented Dec 2, 2022

hit runner's limitation and we can not do anything here as you need to either reduce the resources usage or switch to large runners

Is there any way to tell if we're hitting the limitation? Are there graphs / metrics some place we can look?

@mikhailkoliada
Copy link
Contributor

@nicktrav unfortunately we do not publish any data like this, but I found out that the runner has fallen due to high CPU usage rate, to my sadness we can not do anything on our side as I said above. The only alternatives are self-hosted runners or large runners.

If you have questions, feel free to reach us out again!

@svenjacobs
Copy link

svenjacobs commented Dec 5, 2022

We're facing the same problem since a few days. Builds are cancelled at various stages and execution times. Did the behaviour change for Ubuntu 22.04 standard runners, which are now the default since December 1, 2022? We never had these kind of problems with Ubuntu 20.04.

@mikhailkoliada
Copy link
Contributor

@svenjacobs hi! No, runners are same

@JakubMosakowski
Copy link

The same problem started to occur in our repo. Is there any way to check if the runner hit its limitation?

Our builds ends with:

2022-12-06T15:40:04.3131651Z AAPT2 aapt2-7.3.0-8691043-linux Daemon #0: shutdown
2022-12-06T15:40:04.3150422Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2022-12-06T15:40:04.4473398Z ##[error]The operation was canceled.

at random stages of the build.

@nicktrav
Copy link
Author

nicktrav commented Dec 6, 2022

unfortunately we do not publish any data like this, but I found out that the runner has fallen due to high CPU usage rate

Thanks for digging into it @mikhailkoliada.

If I might provide some feedback for the product, it would be nice to be able to tell why a runner is failing in these situations. The The runner has received a shutdown signal message is very vague, and required us opening an issue here to confirm that we're seeing high CPU on the runner.

@JakubMosakowski
Copy link

JakubMosakowski commented Dec 7, 2022

For me what fixed the issue was reducing the amount of memory used by Gradle JVM.

https://docs.gradle.org/current/userguide/build_environment.html#sec:configuring_jvm_memory
(replaced org.gradle.jvmargs=-Xmx4g with org.gradle.jvmargs=-Xmx6g).

EDIT: It doesn't work anymore. No idea, maybe it was just a fluke, that it started to pass after changing those options.

jlucktay added a commit to ovotech/go-sync that referenced this issue Sep 11, 2023
jlucktay added a commit to ovotech/go-sync that referenced this issue Sep 11, 2023
jlucktay added a commit to ovotech/go-sync that referenced this issue Sep 12, 2023
@PGMacDesign2
Copy link

This is still an issue for us and we have no fix for this other than to schedule an elaborate retry loop to retry certain builds.
Is this still being actively worked on? Or are there any workarounds / solutions that may solve this issue?

@joeflack4
Copy link

@mikhailkoliada @nicktrav How are you able to ascertain whether it is a CPU issue? I can't find the cause of my failure, and I think many others are having the same problem.

Do you or anyone else know where to find if/where GitHub has published CPU limitations?
It was very hard to find this article which shows memory and disk limitations. It mentions what kind of CPU is being used by the runner, but nothing about a limit.

@KarthikAyikkathil100
Copy link

Is there any update on this ?

@shojaeix
Copy link

shojaeix commented Nov 5, 2024

This happened in my pipeline that was applying Terraform plan to AWS. The pipeline took 6 minutes and terminated with error 143.

Then I restarted it, it took 12 minutes and finished successfully! very weird

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Collect additional information, like space on disk, other tool incompatibilities etc. OS: Ubuntu
Projects
None yet
Development

No branches or pull requests