-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes a lot of HTTP status codes 520 when accessing Galaxy API #2429
Comments
I now got this in a web brower as well: it's an error reported by Cloudflare:
|
When building a release for Ansible, part of the work is querying the API to retrieve the versions of collections we're interested in and then we download them to include in the release tarball. The part where we query the API is often failing with error 520's. Despite the tooling providing exception handling and retries, it still ends up giving up. Can we do something about this ? |
There are known performance issues with fetching lots of collection data. There may be plans on the radar to flatten the requests needed to make this more performant for sync purposes, but I don't know if that's slated for community galaxy or only automation hub. |
I haven't personally run into performance problems but I learned that the HTTP 520s returned by cloudflare are likely due to rate limiting which could make sense given we make a number of requests in a short time -- there's already over 80 collections included so it quickly adds up. Ironically, we end up doing more requests because we re-try on exceptions which further exacerbates the issue. Edit: my personal experience in regards to performance might not be representative, I'm told it could be much faster :) |
I currently get these all the times in community.general's CI (Azure Pipelines). For example for this backport: ansible-collections/community.general#2002 I had to restart failing CI jobs multiple times before finally everything passed. |
To give some numbers: in the first run of ansible-collections/community.general#2004, 75 CI jobs failed because of this (77 succeeded). When rerunning them, 18 failed again. Only on the second rerun all passed. |
We regularly see failed CI jobs for Kayobe (which is part of Kolla in OpenStack) due to this error:
Anecdotally, it seems to have become worse in the past few weeks. |
It got a lot worse ~2 weeks ago, and basically stayed that bad until now. In community.general, I still have to restart almost most stable-1 CI runs (but not only them, though later versions installed a lot less from galaxy) at least once, and usually at least twice. I'm currently thinking of replacing installs from galaxy with clones of the corresponding git repos. Galaxy is getting pretty unusable :-( |
This is getting quite painful for our CI environment. |
We still see flakiness when downloading content from Ansible Galaxy, often HTTP 520. This change increases the retries from 3 to 10, and adds a 5 second delay between attempts. Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654 Related: ansible/galaxy#2429
* Update kayobe from branch 'master' to 557f4f1ad3f275a0623b9663c3cc5557ef3559ea - Merge "CI: increase Ansible Galaxy retries & add delay" - CI: increase Ansible Galaxy retries & add delay We still see flakiness when downloading content from Ansible Galaxy, often HTTP 520. This change increases the retries from 3 to 10, and adds a 5 second delay between attempts. Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654 Related: ansible/galaxy#2429
We still see flakiness when downloading content from Ansible Galaxy, often HTTP 520. This change increases the retries from 3 to 10, and adds a 5 second delay between attempts. Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654 Related: ansible/galaxy#2429 (cherry picked from commit df00ba2)
We still see flakiness when downloading content from Ansible Galaxy, often HTTP 520. This change increases the retries from 3 to 10, and adds a 5 second delay between attempts. Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654 Related: ansible/galaxy#2429 (cherry picked from commit df00ba2)
We still see flakiness when downloading content from Ansible Galaxy, often HTTP 520. This change increases the retries from 3 to 10, and adds a 5 second delay between attempts. Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654 Related: ansible/galaxy#2429 (cherry picked from commit df00ba2)
We're hitting 520s in our CI as well while trying to install the amazon.aws collection. The ansible-galaxy CLI is performing quite a number of requests to galaxy to find the collection to install:
|
Sadly galaxy install CLI does not have retry mechanism included in it, which I see as a bug (not missing feature...). Just yesterday I had to implement retry mechanism in ansible-lint specially as it was randomly failing to install collections. Network operations can fail and will fail, we better have an option in galaxy CLI to retry at least twice. This will likely avoid most glitches. |
Hmm, I was assuming that |
Hmm, apparently I'm mistaken, it does not seem to set |
We've doubled the rate limit from 10 requests per second to 20 as a temporary fix and there's an issue for |
Bug Report
SUMMARY
I'm working on the Ansible changelog / porting guide build (ansible-community/antsibull-build#103). Both that build, and the ACD build itself, are querying the Galaxy API for all included collections (~60 of them). It often happens to me that I get a lot of 520 HTTP status codes (seems to be a Cloudflare internal error code):
After adding code to retry the requests (with some increasing delay), it finally almost always completes (before I had to run it 2-10 times until it completed).
The text was updated successfully, but these errors were encountered: