Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is the best layer to add throttling to avoid being marked for abuse? #431

Closed
dmitshur opened this issue Sep 21, 2016 · 4 comments
Closed
Labels

Comments

@dmitshur
Copy link
Member

dmitshur commented Sep 21, 2016

GitHub API has some guidelines to follow when accessing their API to avoid hitting abuse rate limit:

https://developer.github.com/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits

An application should follow those rules. Some of them can be enforced in a general/automated way via throttling, I think, and so they can be factored out into a library (rather than having to be done manually inside the application layer). Specifically:

  1. Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.

    If we can assume a single *github.Client maps to a single user or client ID, then this can be done by using a mutex or semaphore or similar.

  2. If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request.

    Again, if we can assume a single *github.Client maps to a single user or client ID, then this can be done by keeping track of the last POST, PATCH, PUT, or DELETE request time, and sleeping if neccessary to make the next one happen at least one second later.

  3. When you have been limited, wait the number of seconds specified in the Retry-After response header.

    We can detect abuse rate limit response (it's documented here), read the number of seconds specified in the Retry-After response header, and not allow making network requests until that time has passed (probably returning *RateLimitError error, or an abuse-specific version thereof, right away for all requests).

I see two places where this can be done:

  1. Possibly inside go-github library. Inside Client itself. Probably as a configurable option.
  2. Or outside of this library. As a higher-level wrapper layer around go-github library.

So far, go-github has been a relatively "low-level" API client, meaning it helps you out with type safety and provides structured output, but aside from that, it maps relatively closely to 1:1 to making HTTP requests to GitHub API.

It has taken some steps to be a smarter/higher-level client where it could be done so transparently, without blocking. For example, after #277, RateLimitError type was added, and after #347, go-github is smart enough to track rate limit reset time and not make network calls when the rate limit is known to still be exceeded, returning RateLimitError right away.

So far it has also been a "return-right-away" API, where none of the calls are artificially throttled on the client side, they just return an error if there's a rate limit problem. So it seems adding throttling to the go-github client itself may be out of place, and would belong in a higher-level wrapper around go-github.

On the other hand, if done as an option that can be controlled, perhaps this is the best place to do these things, since it means more people can use the GitHub API correctly with less work from their side, by default. The user of the github client could make calls concurrently, but they would block and get serialized by the client as to follow the GH API guidelines. It would effectively shift some of the throttling that would otherwise be done on GitHub server side, to the go-github library client side.

@willnorris, @gmlewis, what are your thoughts on this? Where is the best place for this functionality?

Related issues: #152, #153, #277, #347, #304.

@gmlewis
Copy link
Collaborator

gmlewis commented Sep 21, 2016

To be honest, every time I write a big API-dependent job, I do this:

githubDelay = 720 * time.Millisecond  // 5000 QPH = 83.3 QPM = 1.38 QPS = 720ms/query

and then call time.Sleep(githubDelay) before each and every API call. That way, no matter how long it takes to run my batch job, I never hit the API quota limit. The downside to this is if you have any other jobs running as that same GitHub owner (of the Personal Access Token), the two jobs might interfere with each other, which Will and I have been talking about a lot. We are hoping that the new Integrations API will help with this situation.

I know this doesn't really help much... but is just another data point.

@gmlewis
Copy link
Collaborator

gmlewis commented Feb 1, 2017

Shortly after I wrote the comment above, I switched to using a time.Ticker:

const githubRate = 720 * time.Millisecond
...
ticker := time.NewTicker(githubRate)
...
<-ticker.C // Don't hammer GitHub API
// call to GitHub API

which allows my tool to run for potentially hours and not abuse the rate limit.

bubg-dev pushed a commit to bubg-dev/go-github that referenced this issue Jun 16, 2017
It's similar to RateLimitError, but it's a different type of rate limit
error. It is documented at:

https://developer.github.com/v3/#abuse-rate-limits

Parse and include the Retry-After header value, if present. It is
documented at:

https://developer.github.com/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits

According to GitHub support, its type is an integer value representing
seconds.

Helps google#431.
nathanleiby added a commit to Clever/microplane that referenced this issue Dec 8, 2017
We do this because the Github API triggers abuse warnings if a single
user makes lots of concurrent requests. By having a shared ticker, we
ensure that Github API requests don't happen in parallel.

Moreover, we also ensure that there's a global rate limit of 1 request
to the GithubAPI per 720 milliseconds. We're following the approach
shown here: google/go-github#431

This means that we'll stay within the rate limit requirement
of 5000 requests-per-hour for authenticated users.

```
5000 QPH = 83.3 QPM = 1.38 QPS = 720ms/query
```
nathanleiby added a commit to Clever/microplane that referenced this issue Dec 8, 2017
We do this because the Github API triggers abuse warnings if a single
user makes lots of concurrent requests. By having a shared ticker, we
ensure that Github API requests don't happen in parallel.

Moreover, we also ensure that there's a global rate limit of 1 request
to the GithubAPI per 720 milliseconds. We're following the approach
shown here: google/go-github#431

This means that we'll stay within the rate limit requirement
of 5000 requests-per-hour for authenticated users.

```
5000 QPH = 83.3 QPM = 1.38 QPS = 720ms/query
```
@radeksimko
Copy link
Contributor

Here's how we implemented linked best practices in our repository:
https://github.com/terraform-providers/terraform-provider-github/blob/fa73654b66e37b1fd8d886141d9c2974e24ba42f/github/transport.go#L42-L109

I'd be more than happy to raise a PR if there's any interest in bringing these transports upstream along with tests. I don't expect everyone will want/need to use them, but it would be IMO nice to keep it upstream.

@gmlewis
Copy link
Collaborator

gmlewis commented Aug 6, 2021

Closing issue for lack of activity/interest. Feel free to reopen.

@gmlewis gmlewis closed this as completed Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants