Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: gateway rate limiting durable object #1178

Merged
merged 6 commits into from
Feb 3, 2022

Conversation

vasco-santos
Copy link
Contributor

@vasco-santos vasco-santos commented Jan 27, 2022

This PR adds a durable object that tracks the gateway requests over time to avoid rate limiting. For this, it simply keeps a state with the last n timestamps of the requests.

Timings on when the actual request reaches the gateway might influence negatively and we end up being rate limited anyway. While we work on further improvements with the gateways, we will track metrics that allows us to track requests being prevented (for now rate limit only) and status of failed requests (to track potential rate limiting errors). If we see "rate limit errored requests" we might need to consider having a safe buffer. We can also consider doing a random selection of the gateways and use a subset to decrease load

In the unfortunate event of all gateways getting rate limited, a redirect to the first gateway (ipfs.io) is performed

As discussed with @alanshaw metrics changed to a more concise logic to track response types and reasons for preventing requests. Added totalResponsesByStatus and totalRequestsPreventedByReason

Closes #1165

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 27, 2022

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 07958ec
Status: ✅  Deploy successful!
Preview URL: https://449af741.nft-storage-1at.pages.dev

View logs

@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 3 times, most recently from 393649f to ec65259 Compare January 28, 2022 13:50
@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 4 times, most recently from 03d17dc to e5d449d Compare January 31, 2022 10:56
@vasco-santos vasco-santos marked this pull request as ready for review January 31, 2022 10:58
@vasco-santos vasco-santos requested a review from alanshaw January 31, 2022 10:59
switch (gatewayUrl) {
case 'https://ipfs.io':
return Object.freeze({
RATE_LIMIT_REQUESTS: 800,
Copy link
Contributor Author

@vasco-santos vasco-santos Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is quite difficult to obtain, considering the infrastructure setup.

ipfs.io public gateway uses several different techniques on load balancing and rate limiting, which makes it difficult to predict the actual rate limit value.

ipfs.io gateway starts by geo routing requests, followed by load balancing. In our use case, I predict that all requests will be routed to the same geo area.

The rate limit is set per load balancer. For instance, same {IP_ADDR, URI} are limited at1/second or 15/minute to a particular load balancer. Then there is a global 800/s on each Load Balancer.

Moreover, it also uses bursting techniques that will start by delaying responses before it actually fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking on performing the load tests once we have this new set of PRs merged in and see if we end up having any gateways rate limited. If so, try to tweak a bit the numbers

@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from 3596690 to 74c38fb Compare February 1, 2022 16:10
@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 3 times, most recently from b68adb5 to 4db974e Compare February 2, 2022 12:29
@vasco-santos vasco-santos requested a review from alanshaw February 2, 2022 12:34
@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 2 times, most recently from de58738 to d98f541 Compare February 2, 2022 16:32
Comment on lines 114 to 117
metricsCollected.ipfsGateways[gw].totalResponsesByStatus[
HTTP_SUCCESS_CODE
] || 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanshaw decided to track this individually to have Prometheus metrics to be persisted with a different key to success to have a prometheus name tag, and only use the other key with "failed requests"

If you disagree, happy to change it to simply nftgateway_requests_by_status_total and have everything under the same name.

The main reason for this decision was that I wanted to have a metric that includes a count of all failed requests here, and for queries to Prometheus seemed more reasonable to have success separate. Lmk what you think

Copy link
Contributor

@alanshaw alanshaw Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be unexpected to not see 200 in nftgateway_requests_by_status_total...

If you want to have a success vs failure metric for convenience then that's fine but I'd still include 200 in nftgateway_requests_by_status_total. You can match a tag via regexp in prometheus so it should be easy to group failure like /^[45][0-9][0-9]$/ and success like /^2[0-9][0-9]$/ or something. It won't make it loads easier. Personally I wouldn't complicate it here but it's up to you.

The only other concern with explicitly pulling out 200 is I don't know that all successfull requests from the gateway are 200...but I guess probably yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but my current naming is nftgateway_failed_requests_by_status_total. Anyway, I will just remove failed and add 200 then.

)
.join('\n')
})
.filter((e) => !!e),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this filter is annoying, but was the sanest way of not having empty lines when status errors only exist for a subset of known gateways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I think not including success is unnecessarily complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not because of success, but because if there is for example status 429 for gateway x, but not for gateway y, we wend up with an empty entry

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok sorry!

* @property {number} [responseTime] number of milliseconds to get response
* @property {boolean} [winner]
* @property {boolean} [winner] response was from winner gateway
* @property {number} [requestPreventedCode] request not sent to upstream gateway reason code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd have gone with a string and "RATE_LIMIT" or something. It just means that when you're building graphs in grafana you don't have to refer back to the document/code that maps code => description.


const gatewayReqs = env.ipfsGateways.map((gwUrl) =>
_gatewayFetch(gwUrl, cid, {
pathname: reqUrl.pathname,
_gatewayFetch(gwUrl, cid, request, env, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do some methods here have underscore prefix? Not exporting it makes it private, and usually underscore is a hint to consumers that something is supposed to be private, in the case where it is not supported.

Suggested change
_gatewayFetch(gwUrl, cid, request, env, {
gatewayFetch(gwUrl, cid, request, env, {

Comment on lines 89 to 99
// We can already settle requests if Aggregate Error, as all promises were already rejected
if (err instanceof AggregateError) {
responses = await pSettle(gatewayReqs)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if some promises fulfilled but none were response.ok and the rest rejected? Do we still get an aggregate error?

Yes, the filter causes a FilterError to be thrown here: https://github.com/sindresorhus/p-some/blob/a7030ea6ad9971867ba5dfdb09034a416f3cf8e3/index.js#L60-L62

I think this will always be an AggregateError in our case. Or in other words, all of our promises will have rejected or been filtered (effectively rejected) here. We should always settle here to get the responses:

Suggested change
// We can already settle requests if Aggregate Error, as all promises were already rejected
if (err instanceof AggregateError) {
responses = await pSettle(gatewayReqs)
}
// All promises will have been rejected or filtered (effectively rejected) here.
responses = await pSettle(gatewayReqs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, thanks

Comment on lines 4 to 5
export const RATE_LIMIT_HTTP_ERROR_CODE = 429
export const HTTP_SUCCESS_CODE = 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export const RATE_LIMIT_HTTP_ERROR_CODE = 429
export const HTTP_SUCCESS_CODE = 200
export const HTTP_STATUS_RATE_LIMITED = 429
export const HTTP_STATUS_SUCCESS = 200

updateGatewayMetrics(request, env, r.value, false)
)
)
})()
)

// Redirect if all failed and at least one gateway was rate limited
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only want to redirect when all the gateways are rate limiting us (or we prevented the request because it would cause rate limiting)...because we're literally not able to service the request. If we got error responses from everyone is probably because the content could not be retrieved OR a problem with the content (not a unixfs thing). It's unlikely that all gateways will be down at the same time and in these cases we should not redirect and get the user to send another request because it's unlikely it'll work, OR all the gateways are on 🔥 and they don't need extra traffic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we have one failure that was not rate limit related we should not redirect!

But, there is an edge case here that I was including. For instance: What if we prevent two requests from running and the other is rate limited, or better, if we prevent all requests. So, what we need here is a conditional with:

All response status are HTTP_STATUS_RATE_LIMITED OR we prevented the request from happening

Comment on lines 114 to 117
metricsCollected.ipfsGateways[gw].totalResponsesByStatus[
HTTP_SUCCESS_CODE
] || 0
Copy link
Contributor

@alanshaw alanshaw Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be unexpected to not see 200 in nftgateway_requests_by_status_total...

If you want to have a success vs failure metric for convenience then that's fine but I'd still include 200 in nftgateway_requests_by_status_total. You can match a tag via regexp in prometheus so it should be easy to group failure like /^[45][0-9][0-9]$/ and success like /^2[0-9][0-9]$/ or something. It won't make it loads easier. Personally I wouldn't complicate it here but it's up to you.

The only other concern with explicitly pulling out 200 is I don't know that all successfull requests from the gateway are 200...but I guess probably yes.

)
.join('\n')
})
.filter((e) => !!e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I think not including success is unnecessarily complicated.

@@ -21,5 +21,6 @@ test('Caches content', async (t) => {
t.is(await response.text(), content)

const cachedRes = await caches.default.match(url)
t.is(await cachedRes.text(), content)
// Miniflare cache sometimes is not yet setup...
cachedRes && t.is(await cachedRes.text(), content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we need to get to the bottom of that...it will be a false positive. Maybe skip the test and open an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 2 times, most recently from 7b0240a to 8a5822d Compare February 3, 2022 12:47
@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from 8a5822d to f84f1cf Compare February 3, 2022 13:34
@vasco-santos vasco-santos requested a review from alanshaw February 3, 2022 13:57
@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from f84f1cf to a5689bb Compare February 3, 2022 14:09
@vasco-santos vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from a5689bb to b4e5178 Compare February 3, 2022 14:12
Copy link
Contributor

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1,28 +1,33 @@
/* eslint-env serviceworker, browser */
/* global Response caches */

import pAny from 'p-any'
import pAny, { AggregateError } from 'p-any'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not using AggregateError now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your view was outdated 😅 I removed it in last commit

packages/gateway/src/gateway.js Outdated Show resolved Hide resolved
Co-authored-by: Alan Shaw <alan.shaw@protocol.ai>
@vasco-santos vasco-santos merged commit 2b632e2 into main Feb 3, 2022
@vasco-santos vasco-santos deleted the feat/gateway-rate-limiting-do branch February 3, 2022 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track and Prevent IPFS public gateways rate limite
2 participants