Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitHub Actions - sometimes... our 'host is unreachable' #2890

Closed
1 of 7 tasks
trel opened this issue Mar 9, 2021 · 23 comments
Closed
1 of 7 tasks

GitHub Actions - sometimes... our 'host is unreachable' #2890

trel opened this issue Mar 9, 2021 · 23 comments

Comments

@trel
Copy link

trel commented Mar 9, 2021

Description

We are using GitHub Actions to install packages from our own apt/yum repository, housed on a public VM, raw directories under Apache. Not seeing any problems from anywhere else... however...

Sometimes... 25%?... the GitHub Action cannot get to https://unstable.irods.org

It resolves correctly, the IP address is right.

Area for Triage:

Containers

Question, Bug, or Feature?:

Bug

Virtual environments affected

  • Ubuntu 16.04
  • Ubuntu 18.04
  • Ubuntu 20.04
  • macOS 10.15
  • macOS 11.0
  • Windows Server 2016 R2
  • Windows Server 2019

Image version

Version: 20210302.0

Expected behavior

Expected to be able to see/use our server.

Actual behavior

From:

https://github.com/irods/irods/blob/master/.github/workflows/build-irods.yml#L31

We see this in the Action logs:

W: Failed to fetch https://unstable.irods.org/apt/dists/bionic/InRelease  Could not connect to unstable.irods.org:443 (152.54.5.173), connection timed out
W: Some index files failed to download. They have been ignored, or old ones used instead.
E: Unable to locate package irods-externals-*
E: Couldn't find any package by glob 'irods-externals-*'
E: Couldn't find any package by regex 'irods-externals-*'

Repro steps

The commits to https://github.com/irods/irods trigger builds - sometimes they fail. Manual retries eventually can connect, and complete their work.

Has the feel of a firewall somewhere between GitHub and our VM that is throttling connections, perhaps IP-based. Is there any way to detect / determine this?

@AlenaSviridenko
Copy link
Contributor

Hi @trel,
we host our images in Azure. Please, create a VM there and if the issue will appear on that VM, then please contact Azure support.

@trel
Copy link
Author

trel commented Mar 10, 2021

I have reproduced the failure with fewer moving parts - just a curl call.

https://github.com/trel/irods/pull/9/checks?check_run_id=2079836230

I'll look into trying a similar curl in vanilla Azure

@trel
Copy link
Author

trel commented Mar 15, 2021

In some further testing, curl and wget always work fine. Only when hitting these servers with apt update do they fail to connect.

Could this be an "apt and https" issue? As in, the proxy is not configured for apt via https correctly within some of the Azure containers?

@trel
Copy link
Author

trel commented Mar 15, 2021

#2919 possibly related?

@Darleev
Copy link
Contributor

Darleev commented Mar 15, 2021

Hello @trel,
The issue #2919 is related to the specific OpenSUSE repository problem.
I tried to find any information regarding the error in the workflow:

W: Some index files failed to download. They have been ignored, or old ones used instead.

And all information that I was able to find - it is some temporary error with external apt-mirror. So, it is some error with https://unstable.irods.org/apt/ repository, and I believe it makes sense to check this problem from their side as well.

@trel
Copy link
Author

trel commented Mar 15, 2021

That side is my side :)

We've checked and have no connectivity issues from anywhere else in the world that we've checked. And, it's also happening on https://packages.irods.org, a different VM with similar setup.

We're still investigating, but at this time it does feel like a firewall/proxy issue in the container itself (as most of the time it works cleanly, and has worked for some time via travis with the same commands before moving to github actions).

@Darleev
Copy link
Contributor

Darleev commented Mar 15, 2021

Hello @trel,
Thank you for the provided information. We are checking the issue from our side as well, I'll try to deploy clean local/Azure machines and reproduce the issue one more time.
I'll keep you informed.

@trel
Copy link
Author

trel commented Mar 16, 2021

Just saw the same issue via yum on a CentOS7 container, rather than with apt on Ubuntu...

https://unstable.irods.org/yum/pool/centos7/x86_64/irods-externals-avro1.7.7-0-1.0-1.x86_64.rpm: [Errno 12] Timeout on https://unstable.irods.org/yum/pool/centos7/x86_64/irods-externals-avro1.7.7-0-1.0-1.x86_64.rpm: (28, 'Connection timed out after 30001 milliseconds')

In addition...

I am noticing more failures in the middle of my day (UTC-0500).
Mornings rarely see these timeouts, evenings succeed more than during the workday.

@Darleev
Copy link
Contributor

Darleev commented Mar 16, 2021

Hello @trel,
I'm trying to reproduce the issue locally and on the self-hosted agents, but facing an issue:

E: The repository 'https://unstable.irods.org/apt focal Release' does not have a Release file.
N: Updating from such a repository can't be done securely and is therefore disabled by default.

There are no changes from my side, I just follow instruction for Ubuntu from the official site: https://unstable.irods.org . Could you please check it? I need to reproduce the issue one more time, to check possible network problems from our side.
We look forward to hearing from you.

@trel
Copy link
Author

trel commented Mar 16, 2021

Hi,

We don't have an Ubuntu20 release out yet - please try a bionic (ubuntu:18.04) VM/container.

@Darleev
Copy link
Contributor

Darleev commented Mar 16, 2021

@trel Thank you, I assumed this, but Ubuntu20 was filled in the initial request, that's why I'm asking this.
I'll keep you posted.

@trel
Copy link
Author

trel commented Mar 16, 2021

I will uncheck that box - I had added that check mark when it was 'just a curl call'.

And then forgot to uncheck when I learned it was the the apt/yum calls instead.

Thanks.

@Darleev
Copy link
Contributor

Darleev commented Mar 22, 2021

@trel I've reproduced the issue several times, but could not find any relations with the region or specific machine configuration. We've created an internal issue to the Azure network engineering team for further investigation.
I'll keep you updated.

@trel
Copy link
Author

trel commented Mar 22, 2021

Excellent - thank you.

@lbruun
Copy link

lbruun commented Mar 23, 2021

I've seen something similar, i.e. connection problems when connecting outbound to public internet from within GitHub Actions. The problem seems to have escalated over the past month(s). Like @trel my suspicion is the same. There seems to be a "first touch penalty" for creating outbound connections (perhaps the penalty is paid on a per-destination basis, dunno).

Therefore, one advice is to check your connect timeout. Our case were simple http downloads. We were using 5 seconds connect timeout. After increasing to 30 seconds the problem went away .. or at least could no longer be reproduced. However, wrt yum, the default is already 30 seconds as far as I can tell. But thought I would share findings anyways namely that it feels as if the runner needs some kind of warmup, network wise, before outbound connections are stable.

@lukepighetti
Copy link

Not sure if this is related but just today we started getting "This check failed" errors on our two linux checks without any guidance as to what the issue might be. No logs or anything is running.

@OmgImAlexis
Copy link

OmgImAlexis commented Mar 26, 2021

Not sure if this is related but just today we started getting "This check failed" errors on our two linux checks without any guidance as to what the issue might be. No logs or anything is running.

Same here, this started around 12 hours ago. I'm now randomly getting the following and our environment tag is gone in the repo settings.

Unable to fetch the information for the environment 'staging' targeted by this job.

@maxim-lobanov
Copy link
Contributor

@lukepighetti @OmgImAlexis , could you please log the separate issue for this problem since it is not related to the initial issue. For investigation, we need links to the pipelines (links will be useful even if repo is private)

@lukepighetti
Copy link

lukepighetti commented Mar 26, 2021

Ours was a billing issue, but the big red X didn't inform us of this. There are logs available if you click on the Actions tab which are not available if you view the action status from the PR. I'm considering my particular issue resolved, but I do think my feedback should be considered. Apologies for the noise in this PR.

@trel
Copy link
Author

trel commented Apr 20, 2021

We've seen increased success lately. Not sure that's actionable here, but seeing fewer timeout failures. Not yet zero, though.

@Darleev
Copy link
Contributor

Darleev commented Aug 10, 2021

Hello @trel,
We are going to close the issue. In case of any questions feel free to contact us.

@Darleev Darleev closed this as completed Aug 10, 2021
@trel
Copy link
Author

trel commented Aug 10, 2021

Okay. Thanks for the update - we still see this timeout more than once per week.

@trel
Copy link
Author

trel commented Sep 20, 2021

Follow up: It's been more than a month since we upgraded the host itself that was sometimes unreachable. It had been an Ubuntu14 VM and is now CentOS7. We have seen no errors since this upgrade. Current speculation is that the aging/EOL SSL libraries on Ubuntu14 could have been related to the intermittent errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants