Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS builders randomly stops working on GitHub Actions #71988

Closed
pietroalbini opened this issue May 7, 2020 · 18 comments
Closed

macOS builders randomly stops working on GitHub Actions #71988

pietroalbini opened this issue May 7, 2020 · 18 comments
Labels
A-github-actions Area: GitHub Actions (GHA) A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) O-macos Operating system: macOS T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@pietroalbini pietroalbini added A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-github-actions Area: GitHub Actions (GHA) labels May 7, 2020
@pietroalbini
Copy link
Member Author

@pietroalbini
Copy link
Member Author

@RalfJung
Copy link
Member

RalfJung commented May 14, 2020

This broke the 2nd rollup in a row: #72187
Makes it kind of hard to land anything, really.^^

There's no error, just a red cross next to "upload artifacts to S3".

@pietroalbini
Copy link
Member Author

We have a call with Github today, if there is no news we'll rollback the gate on GHA.

@pietroalbini
Copy link
Member Author

GitHub is going to continue looking into this ❤️

In the meantime I'll remove the double-gate, to let the queue run more smoothly.

@pietroalbini
Copy link
Member Author

Removed the gate on GHA.

@RalfJung
Copy link
Member

@pietroalbini thanks!
Does this take effect on the already running #72202 (comment) ?

@pietroalbini
Copy link
Member Author

It should.

@Aaron1011
Copy link
Member

@pietroalbini
Copy link
Member Author

pietroalbini commented Jul 3, 2020

Just got another failure related to this:

@youknowone
Copy link
Contributor

I am getting similar experience in other rust project, so happy to see Github is looking into this issue.

@thejoebourneidentity
Copy link

thejoebourneidentity commented Jul 29, 2020

Thanks to everyone for their patience on this issue, I work on the GitHub Actions compute platform that powers our runners and wanted to share an update.

We're aware of the issue and have multiple members of our team across our entire hardware stack actively working on it. We've been able to identify a specific data center that may be the primary failure point. We're rapidly building out additional telemetry and alerting to help us find the root cause. Stay tuned, we want to fix this as soon as possible.

@matthewmccullough
Copy link

We're aware of the issue have multiple members of our team across our entire hardware stack actively working on it. We've been able to identify a specific data center that may be the primary failure point. We're rapidly building out additional telemetry and alerting to help us find the root cause. Stay tuned, we want to fix this as soon as possible.

@thejoebourneidentity Thanks for the insight into the bug investigation and reiteration that you and the team are working on it.

@alepauly
Copy link

alepauly commented Aug 27, 2020

@pietroalbini We continue looking at this and I wanted to confirm you are still seeing the exact same symptoms of disconnects. We've been able to curve the cases where the runner disconnects and are now focusing more on perf issues that might be leading to timeouts. I browsed a few workflows and noticed timeouts such as this. Is this a good illustration of the current problem or are you seeing disconnects at a high rate still? Thanks!

@pietroalbini
Copy link
Member Author

@alepauly thanks for getting back to us! Indeed, it seems like the runner disconnects are not appearing anymore! I'll monitor closely the build results in the coming week to see if they appear.

Regarding the cancelled builds, like the one you mentioned, that's a mistake on our end 😅. Since our build matrix is so big we built an action to cancel builds once a new commit is pushed to the branch. While this is not a problem normally since our queue bot only pushes a new commit to the test branch when the previous one finishes, we're not gating our builds on the macOS builders yet due to this issue, so the queue bot isn't waiting for them to finish before pushing the new commit (causing the action to cancel the build).

I opened a PR to disable the build cancellation for the macOS jobs, so the cancellations should disappear as soon as the PR lands. If we don't see any other spurious failure for a while we'll try to gate our builds on macOS again!

Thanks again for looking into this issue ❤️

bors added a commit to rust-lang-ci/rust that referenced this issue Aug 27, 2020
…r=Mark-Simulacrum

Disable cancel-outdated-builds for auto-fallible

`cancel-outdated-builds` doesn't need to be enabled on fallible jobs, and it's actually making it harder for us to see if rust-lang#71988 is fixed. This adds some temporary code to avoid `auto-fallible` jobs from being cancelled by our tooling.

r? @Mark-Simulacrum
@pietroalbini
Copy link
Member Author

It seems a runner disappeared only one time across the whole week, which is definitely an improvement!

I'll propose during the next weekly infrastructure team meeting to try and double-gate macOS on both Azure and GitHub Actions, to see what the impact would be on our queue size.

@alepauly
Copy link

alepauly commented Sep 3, 2020

It seems a runner disappeared only one time across the whole week, which is definitely an improvement!

We'll look at it on our end, we expect this to be better but not fully solved yet. Thanks for the report.

bors added a commit to rust-lang-ci/rust that referenced this issue Sep 15, 2020
…crum

Gate macOS on both Azure and GHA

As discussed in the previous infrastructure team meeting, this PR gates macOS builds on both GHA and Azure. Once this is merged we'll wait a week or two to see if there is a troublesome rate of spurious failures, and if not we'll remove the builds on the Azure side.

r? `@Mark-Simulacrum`
cc rust-lang#71988
@pietroalbini
Copy link
Member Author

pietroalbini commented Nov 16, 2020

We didn't see it anymore for months, closing this. Thanks to everyone involved in fixing this! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-github-actions Area: GitHub Actions (GHA) A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) O-macos Operating system: macOS T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants