Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes a lot of HTTP status codes 520 when accessing Galaxy API #2429

Closed
felixfontein opened this issue Jul 4, 2020 · 15 comments
Closed

Comments

@felixfontein
Copy link

Bug Report

SUMMARY

I'm working on the Ansible changelog / porting guide build (ansible-community/antsibull-build#103). Both that build, and the ACD build itself, are querying the Galaxy API for all included collections (~60 of them). It often happens to me that I get a lot of 520 HTTP status codes (seems to be a Cloudflare internal error code):

WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/community/azure/versions/0.1.0/', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/community/azure/versions/0.1.0/', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/nxos/versions/?format=json&page=2', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/google/cloud/versions/0.10.1/', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/ios/versions/?format=json&format=json&format=json&page=4', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/junipernetworks/junos/versions/?format=json&format=json&format=json&format=json&page=5', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/ios/versions/?format=json&format=json&format=json&page=4', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/vyos/vyos/versions/?format=json&format=json&format=json&format=json&page=5', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/junipernetworks/junos/versions/?format=json&format=json&format=json&format=json&page=5', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/community/vmware/versions/?format=json&format=json&page=3', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/community/vmware/versions/?format=json&format=json&page=3', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/arista/eos/versions/?format=json&format=json&format=json&format=json&format=json&format=json&page=7', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/ios/versions/?format=json&format=json&format=json&page=4', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/nxos/versions/?format=json&format=json&format=json&page=4', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/iosxr/versions/?format=json&format=json&format=json&format=json&page=5', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/community/vmware/versions/?format=json&format=json&page=3', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/iosxr/versions/?format=json&format=json&format=json&format=json&page=5', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/cisco/nxos/versions/?format=json&format=json&format=json&format=json&page=5', params={'format': 'json'}) failed with status code 520, retrying...
WARNING: aio_session.get('https://galaxy.ansible.com/api/v2/collections/community/vmware/versions/?format=json&format=json&page=3', params={'format': 'json'}) failed with status code 520, retrying...

After adding code to retry the requests (with some increasing delay), it finally almost always completes (before I had to run it 2-10 times until it completed).

@felixfontein
Copy link
Author

I now got this in a web brower as well: it's an error reported by Cloudflare:

Error 520 Ray ID: 5be2aa9b0df0be1e • 2020-08-05 18:43:54 UTC
Web server is returning an unknown error

You
Browser
Working

Milan
Cloudflare
Working

galaxy.ansible.com
Host
Error 

@dmsimard
Copy link
Contributor

When building a release for Ansible, part of the work is querying the API to retrieve the versions of collections we're interested in and then we download them to include in the release tarball.

The part where we query the API is often failing with error 520's. Despite the tooling providing exception handling and retries, it still ends up giving up.

Can we do something about this ?

@ironfroggy
Copy link
Contributor

There are known performance issues with fetching lots of collection data. There may be plans on the radar to flatten the requests needed to make this more performant for sync purposes, but I don't know if that's slated for community galaxy or only automation hub.

@dmsimard
Copy link
Contributor

dmsimard commented Jan 26, 2021

There are known performance issues with fetching lots of collection data. There may be plans on the radar to flatten the requests needed to make this more performant for sync purposes, but I don't know if that's slated for community galaxy or only automation hub.

I haven't personally run into performance problems but I learned that the HTTP 520s returned by cloudflare are likely due to rate limiting which could make sense given we make a number of requests in a short time -- there's already over 80 collections included so it quickly adds up.

Ironically, we end up doing more requests because we re-try on exceptions which further exacerbates the issue.

Edit: my personal experience in regards to performance might not be representative, I'm told it could be much faster :)

@felixfontein
Copy link
Author

I currently get these all the times in community.general's CI (Azure Pipelines). For example for this backport: ansible-collections/community.general#2002 I had to restart failing CI jobs multiple times before finally everything passed.

@felixfontein
Copy link
Author

To give some numbers: in the first run of ansible-collections/community.general#2004, 75 CI jobs failed because of this (77 succeeded). When rerunning them, 18 failed again. Only on the second rerun all passed.

@priteau
Copy link

priteau commented Mar 18, 2021

We regularly see failed CI jobs for Kayobe (which is part of Kolla in OpenStack) due to this error:

<role> was NOT installed successfully: None (HTTP Code: 520, Message: Origin Error)

Anecdotally, it seems to have become worse in the past few weeks.

@felixfontein
Copy link
Author

It got a lot worse ~2 weeks ago, and basically stayed that bad until now. In community.general, I still have to restart almost most stable-1 CI runs (but not only them, though later versions installed a lot less from galaxy) at least once, and usually at least twice.

I'm currently thinking of replacing installs from galaxy with clones of the corresponding git repos. Galaxy is getting pretty unusable :-(

@markgoddard
Copy link

This is getting quite painful for our CI environment.

felixfontein added a commit to felixfontein/community.routeros that referenced this issue Mar 26, 2021
felixfontein added a commit to felixfontein/community.hrobot that referenced this issue Mar 26, 2021
felixfontein added a commit to felixfontein/community.crypto that referenced this issue Mar 26, 2021
felixfontein added a commit to felixfontein/community.docker that referenced this issue Mar 26, 2021
felixfontein added a commit to felixfontein/community.docker that referenced this issue Mar 26, 2021
felixfontein added a commit to ansible-collections/community.crypto that referenced this issue Mar 27, 2021
felixfontein added a commit to ansible-collections/community.hrobot that referenced this issue Mar 27, 2021
felixfontein added a commit to ansible-collections/community.routeros that referenced this issue Mar 27, 2021
felixfontein added a commit to ansible-collections/community.docker that referenced this issue Mar 27, 2021
@ssbarnea
Copy link
Member

openstack-mirroring pushed a commit to openstack/kayobe that referenced this issue Mar 29, 2021
We still see flakiness when downloading content from Ansible Galaxy,
often HTTP 520. This change increases the retries from 3 to 10, and adds
a 5 second delay between attempts.

Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654
Related: ansible/galaxy#2429
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Mar 29, 2021
* Update kayobe from branch 'master'
  to 557f4f1ad3f275a0623b9663c3cc5557ef3559ea
  - Merge "CI: increase Ansible Galaxy retries & add delay"
  - CI: increase Ansible Galaxy retries & add delay
    
    We still see flakiness when downloading content from Ansible Galaxy,
    often HTTP 520. This change increases the retries from 3 to 10, and adds
    a 5 second delay between attempts.
    
    Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654
    Related: ansible/galaxy#2429
openstack-mirroring pushed a commit to openstack/kayobe that referenced this issue Mar 29, 2021
We still see flakiness when downloading content from Ansible Galaxy,
often HTTP 520. This change increases the retries from 3 to 10, and adds
a 5 second delay between attempts.

Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654
Related: ansible/galaxy#2429
(cherry picked from commit df00ba2)
openstack-mirroring pushed a commit to openstack/kayobe that referenced this issue Mar 29, 2021
We still see flakiness when downloading content from Ansible Galaxy,
often HTTP 520. This change increases the retries from 3 to 10, and adds
a 5 second delay between attempts.

Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654
Related: ansible/galaxy#2429
(cherry picked from commit df00ba2)
openstack-mirroring pushed a commit to openstack/kayobe that referenced this issue Mar 29, 2021
We still see flakiness when downloading content from Ansible Galaxy,
often HTTP 520. This change increases the retries from 3 to 10, and adds
a 5 second delay between attempts.

Change-Id: I0c46e5fcc6979027dc6f1bc5cc49e923a205f654
Related: ansible/galaxy#2429
(cherry picked from commit df00ba2)
@daviddavis
Copy link

We're hitting 520s in our CI as well while trying to install the amazon.aws collection. The ansible-galaxy CLI is performing quite a number of requests to galaxy to find the collection to install:

$ ansible-galaxy -vvvv collection install amazon.aws
[DEPRECATION WARNING]: Setting verbosity before the arg sub command is deprecated, set the verbosity after the sub command. This feature will be removed from ansible-base in version 2.13. 
Deprecation warnings can be disabled by setting deprecation_warnings=False in ansible.cfg.
ansible-galaxy 2.10.7
  config file = None
  configured module search path = ['/home/daviddavis/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/daviddavis/.local/lib/python3.9/site-packages/ansible
  executable location = /home/daviddavis/.local/bin/ansible-galaxy
  python version = 3.9.2 (default, Feb 20 2021, 00:00:00) [GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
No config file found; using defaults
Starting galaxy collection install process
Found installed collection amazon.aws:1.4.1 at '/home/daviddavis/.ansible/collections/ansible_collections/amazon/aws'
Process install dependency map
Initial connection to galaxy_server: https://galaxy.ansible.com
Opened /home/daviddavis/.ansible/galaxy_token
Calling Galaxy at https://galaxy.ansible.com/api/
Processing requirement collection 'amazon.aws'
Collection requirement 'amazon.aws' is the name of a collection
Found API version 'v1, v2' with Galaxy server default (https://galaxy.ansible.com/api/)
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=2
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=3
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=4
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=5
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=6
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=7
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=8
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=9
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=10
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=11
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=12
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=13
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=14
Calling Galaxy at https://galaxy.ansible.com/api/v2/collections/amazon/aws/versions/?page=15
Collection 'amazon.aws' obtained from server default https://galaxy.ansible.com/api/
Starting collection install process

@ssbarnea
Copy link
Member

ssbarnea commented Apr 8, 2021

Sadly galaxy install CLI does not have retry mechanism included in it, which I see as a bug (not missing feature...). Just yesterday I had to implement retry mechanism in ansible-lint specially as it was randomly failing to install collections.

Network operations can fail and will fail, we better have an option in galaxy CLI to retry at least twice. This will likely avoid most glitches.

@felixfontein
Copy link
Author

Hmm, I was assuming that ansible-galaxy collection install would use a larger page size. Or is that only implemented in stable-2.11 / devel? But anyway, having retries and a more efficient API would really help a lot...

@felixfontein
Copy link
Author

Hmm, apparently I'm mistaken, it does not seem to set page_size for collection version enumeration, it only does that for some role-related things.

@newswangerd
Copy link
Member

We've doubled the rate limit from 10 requests per second to 20 as a temporary fix and there's an issue for ansible-galaxy to correctly handle situations where it gets rate limited: ansible/ansible#74191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants