-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: gateway rate limiting durable object #1178
Conversation
Deploying with Cloudflare Pages
|
393649f
to
ec65259
Compare
03d17dc
to
e5d449d
Compare
switch (gatewayUrl) { | ||
case 'https://ipfs.io': | ||
return Object.freeze({ | ||
RATE_LIMIT_REQUESTS: 800, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is quite difficult to obtain, considering the infrastructure setup.
ipfs.io public gateway uses several different techniques on load balancing and rate limiting, which makes it difficult to predict the actual rate limit value.
ipfs.io gateway starts by geo routing requests, followed by load balancing. In our use case, I predict that all requests will be routed to the same geo area.
The rate limit is set per load balancer. For instance, same {IP_ADDR, URI} are limited at1/second or 15/minute to a particular load balancer. Then there is a global 800/s on each Load Balancer.
Moreover, it also uses bursting techniques that will start by delaying responses before it actually fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking on performing the load tests once we have this new set of PRs merged in and see if we end up having any gateways rate limited. If so, try to tweak a bit the numbers
3596690
to
74c38fb
Compare
b68adb5
to
4db974e
Compare
de58738
to
d98f541
Compare
packages/gateway/src/metrics.js
Outdated
metricsCollected.ipfsGateways[gw].totalResponsesByStatus[ | ||
HTTP_SUCCESS_CODE | ||
] || 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanshaw decided to track this individually to have Prometheus metrics to be persisted with a different key to success to have a prometheus name tag, and only use the other key with "failed requests"
If you disagree, happy to change it to simply nftgateway_requests_by_status_total
and have everything under the same name.
The main reason for this decision was that I wanted to have a metric that includes a count of all failed requests here, and for queries to Prometheus seemed more reasonable to have success separate. Lmk what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll be unexpected to not see 200 in nftgateway_requests_by_status_total
...
If you want to have a success vs failure metric for convenience then that's fine but I'd still include 200 in nftgateway_requests_by_status_total
. You can match a tag via regexp in prometheus so it should be easy to group failure like /^[45][0-9][0-9]$/
and success like /^2[0-9][0-9]$/
or something. It won't make it loads easier. Personally I wouldn't complicate it here but it's up to you.
The only other concern with explicitly pulling out 200 is I don't know that all successfull requests from the gateway are 200...but I guess probably yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but my current naming is nftgateway_failed_requests_by_status_total
. Anyway, I will just remove failed and add 200 then.
) | ||
.join('\n') | ||
}) | ||
.filter((e) => !!e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this filter is annoying, but was the sanest way of not having empty lines when status errors only exist for a subset of known gateways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I think not including success is unnecessarily complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not because of success, but because if there is for example status 429
for gateway x, but not for gateway y
, we wend up with an empty entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok sorry!
* @property {number} [responseTime] number of milliseconds to get response | ||
* @property {boolean} [winner] | ||
* @property {boolean} [winner] response was from winner gateway | ||
* @property {number} [requestPreventedCode] request not sent to upstream gateway reason code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd have gone with a string and "RATE_LIMIT" or something. It just means that when you're building graphs in grafana you don't have to refer back to the document/code that maps code => description.
packages/gateway/src/gateway.js
Outdated
|
||
const gatewayReqs = env.ipfsGateways.map((gwUrl) => | ||
_gatewayFetch(gwUrl, cid, { | ||
pathname: reqUrl.pathname, | ||
_gatewayFetch(gwUrl, cid, request, env, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do some methods here have underscore prefix? Not exporting it makes it private, and usually underscore is a hint to consumers that something is supposed to be private, in the case where it is not supported.
_gatewayFetch(gwUrl, cid, request, env, { | |
gatewayFetch(gwUrl, cid, request, env, { |
packages/gateway/src/gateway.js
Outdated
// We can already settle requests if Aggregate Error, as all promises were already rejected | ||
if (err instanceof AggregateError) { | ||
responses = await pSettle(gatewayReqs) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if some promises fulfilled but none were response.ok
and the rest rejected? Do we still get an aggregate error?
Yes, the filter causes a FilterError
to be thrown here: https://github.com/sindresorhus/p-some/blob/a7030ea6ad9971867ba5dfdb09034a416f3cf8e3/index.js#L60-L62
I think this will always be an AggregateError
in our case. Or in other words, all of our promises will have rejected or been filtered (effectively rejected) here. We should always settle here to get the responses:
// We can already settle requests if Aggregate Error, as all promises were already rejected | |
if (err instanceof AggregateError) { | |
responses = await pSettle(gatewayReqs) | |
} | |
// All promises will have been rejected or filtered (effectively rejected) here. | |
responses = await pSettle(gatewayReqs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, thanks
packages/gateway/src/constants.js
Outdated
export const RATE_LIMIT_HTTP_ERROR_CODE = 429 | ||
export const HTTP_SUCCESS_CODE = 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export const RATE_LIMIT_HTTP_ERROR_CODE = 429 | |
export const HTTP_SUCCESS_CODE = 200 | |
export const HTTP_STATUS_RATE_LIMITED = 429 | |
export const HTTP_STATUS_SUCCESS = 200 |
packages/gateway/src/gateway.js
Outdated
updateGatewayMetrics(request, env, r.value, false) | ||
) | ||
) | ||
})() | ||
) | ||
|
||
// Redirect if all failed and at least one gateway was rate limited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only want to redirect when all the gateways are rate limiting us (or we prevented the request because it would cause rate limiting)...because we're literally not able to service the request. If we got error responses from everyone is probably because the content could not be retrieved OR a problem with the content (not a unixfs thing). It's unlikely that all gateways will be down at the same time and in these cases we should not redirect and get the user to send another request because it's unlikely it'll work, OR all the gateways are on 🔥 and they don't need extra traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we have one failure that was not rate limit related we should not redirect!
But, there is an edge case here that I was including. For instance: What if we prevent two requests from running and the other is rate limited, or better, if we prevent all requests. So, what we need here is a conditional with:
All response status are HTTP_STATUS_RATE_LIMITED
OR we prevented the request from happening
packages/gateway/src/metrics.js
Outdated
metricsCollected.ipfsGateways[gw].totalResponsesByStatus[ | ||
HTTP_SUCCESS_CODE | ||
] || 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll be unexpected to not see 200 in nftgateway_requests_by_status_total
...
If you want to have a success vs failure metric for convenience then that's fine but I'd still include 200 in nftgateway_requests_by_status_total
. You can match a tag via regexp in prometheus so it should be easy to group failure like /^[45][0-9][0-9]$/
and success like /^2[0-9][0-9]$/
or something. It won't make it loads easier. Personally I wouldn't complicate it here but it's up to you.
The only other concern with explicitly pulling out 200 is I don't know that all successfull requests from the gateway are 200...but I guess probably yes.
) | ||
.join('\n') | ||
}) | ||
.filter((e) => !!e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I think not including success is unnecessarily complicated.
packages/gateway/test/cache.spec.js
Outdated
@@ -21,5 +21,6 @@ test('Caches content', async (t) => { | |||
t.is(await response.text(), content) | |||
|
|||
const cachedRes = await caches.default.match(url) | |||
t.is(await cachedRes.text(), content) | |||
// Miniflare cache sometimes is not yet setup... | |||
cachedRes && t.is(await cachedRes.text(), content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, we need to get to the bottom of that...it will be a false positive. Maybe skip the test and open an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7b0240a
to
8a5822d
Compare
8a5822d
to
f84f1cf
Compare
f84f1cf
to
a5689bb
Compare
a5689bb
to
b4e5178
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
packages/gateway/src/gateway.js
Outdated
@@ -1,28 +1,33 @@ | |||
/* eslint-env serviceworker, browser */ | |||
/* global Response caches */ | |||
|
|||
import pAny from 'p-any' | |||
import pAny, { AggregateError } from 'p-any' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not using AggregateError
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your view was outdated 😅 I removed it in last commit
Co-authored-by: Alan Shaw <alan.shaw@protocol.ai>
This PR adds a durable object that tracks the gateway requests over time to avoid rate limiting. For this, it simply keeps a state with the last n timestamps of the requests.
Timings on when the actual request reaches the gateway might influence negatively and we end up being rate limited anyway. While we work on further improvements with the gateways, we will track metrics that allows us to track requests being prevented (for now rate limit only) and status of failed requests (to track potential rate limiting errors). If we see "rate limit errored requests" we might need to consider having a safe buffer. We can also consider doing a random selection of the gateways and use a subset to decrease load
In the unfortunate event of all gateways getting rate limited, a redirect to the first gateway (ipfs.io) is performed
As discussed with @alanshaw metrics changed to a more concise logic to track response types and reasons for preventing requests. Added
totalResponsesByStatus
andtotalRequestsPreventedByReason
Closes #1165