-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
macOS builders randomly stops working on GitHub Actions #71988
Comments
Two new ones in a row from #72083: |
This broke the 2nd rollup in a row: #72187 There's no error, just a red cross next to "upload artifacts to S3". |
We have a call with Github today, if there is no news we'll rollback the gate on GHA. |
GitHub is going to continue looking into this ❤️ In the meantime I'll remove the double-gate, to let the queue run more smoothly. |
Removed the gate on GHA. |
@pietroalbini thanks! |
It should. |
Possibly related: https://github.com/rust-lang-ci/rust/runs/685622726 |
Just got another failure related to this: |
Might be related to rust-lang/rust#71988
I am getting similar experience in other rust project, so happy to see Github is looking into this issue. |
Thanks to everyone for their patience on this issue, I work on the GitHub Actions compute platform that powers our runners and wanted to share an update. We're aware of the issue and have multiple members of our team across our entire hardware stack actively working on it. We've been able to identify a specific data center that may be the primary failure point. We're rapidly building out additional telemetry and alerting to help us find the root cause. Stay tuned, we want to fix this as soon as possible. |
@thejoebourneidentity Thanks for the insight into the bug investigation and reiteration that you and the team are working on it. |
@pietroalbini We continue looking at this and I wanted to confirm you are still seeing the exact same symptoms of disconnects. We've been able to curve the cases where the runner disconnects and are now focusing more on perf issues that might be leading to timeouts. I browsed a few workflows and noticed timeouts such as this. Is this a good illustration of the current problem or are you seeing disconnects at a high rate still? Thanks! |
@alepauly thanks for getting back to us! Indeed, it seems like the runner disconnects are not appearing anymore! I'll monitor closely the build results in the coming week to see if they appear. Regarding the cancelled builds, like the one you mentioned, that's a mistake on our end 😅. Since our build matrix is so big we built an action to cancel builds once a new commit is pushed to the branch. While this is not a problem normally since our queue bot only pushes a new commit to the test branch when the previous one finishes, we're not gating our builds on the macOS builders yet due to this issue, so the queue bot isn't waiting for them to finish before pushing the new commit (causing the action to cancel the build). I opened a PR to disable the build cancellation for the macOS jobs, so the cancellations should disappear as soon as the PR lands. If we don't see any other spurious failure for a while we'll try to gate our builds on macOS again! Thanks again for looking into this issue ❤️ |
…r=Mark-Simulacrum Disable cancel-outdated-builds for auto-fallible `cancel-outdated-builds` doesn't need to be enabled on fallible jobs, and it's actually making it harder for us to see if rust-lang#71988 is fixed. This adds some temporary code to avoid `auto-fallible` jobs from being cancelled by our tooling. r? @Mark-Simulacrum
It seems a runner disappeared only one time across the whole week, which is definitely an improvement! I'll propose during the next weekly infrastructure team meeting to try and double-gate macOS on both Azure and GitHub Actions, to see what the impact would be on our queue size. |
We'll look at it on our end, we expect this to be better but not fully solved yet. Thanks for the report. |
…crum Gate macOS on both Azure and GHA As discussed in the previous infrastructure team meeting, this PR gates macOS builds on both GHA and Azure. Once this is merged we'll wait a week or two to see if there is a troublesome rate of spurious failures, and if not we'll remove the builds on the Azure side. r? `@Mark-Simulacrum` cc rust-lang#71988
We didn't see it anymore for months, closing this. Thanks to everyone involved in fixing this! ❤️ |
We started noticing that occasionally one of our macOS builders stop working, marking the job as either failed or canceled and providing no logs for the build. Examples of such failed builds are:
The text was updated successfully, but these errors were encountered: