feat: gateway rate limiting durable object #1178

vasco-santos · 2022-01-27T15:35:33Z

This PR adds a durable object that tracks the gateway requests over time to avoid rate limiting. For this, it simply keeps a state with the last n timestamps of the requests.

Timings on when the actual request reaches the gateway might influence negatively and we end up being rate limited anyway. While we work on further improvements with the gateways, we will track metrics that allows us to track requests being prevented (for now rate limit only) and status of failed requests (to track potential rate limiting errors). If we see "rate limit errored requests" we might need to consider having a safe buffer. We can also consider doing a random selection of the gateways and use a subset to decrease load

In the unfortunate event of all gateways getting rate limited, a redirect to the first gateway (ipfs.io) is performed

As discussed with @alanshaw metrics changed to a more concise logic to track response types and reasons for preventing requests. Added totalResponsesByStatus and totalRequestsPreventedByReason

Closes #1165

cloudflare-workers-and-pages · 2022-01-27T15:39:40Z

Deploying with Cloudflare Pages

Latest commit:	`07958ec`
Status:	✅ Deploy successful!
Preview URL:	https://449af741.nft-storage-1at.pages.dev

View logs

packages/gateway/src/durable-objects/gateway-rate-limits.js

vasco-santos · 2022-01-31T13:09:07Z

packages/gateway/src/durable-objects/gateway-rate-limits.js

+  switch (gatewayUrl) {
+    case 'https://ipfs.io':
+      return Object.freeze({
+        RATE_LIMIT_REQUESTS: 800,


This one is quite difficult to obtain, considering the infrastructure setup.

ipfs.io public gateway uses several different techniques on load balancing and rate limiting, which makes it difficult to predict the actual rate limit value.

ipfs.io gateway starts by geo routing requests, followed by load balancing. In our use case, I predict that all requests will be routed to the same geo area.

The rate limit is set per load balancer. For instance, same {IP_ADDR, URI} are limited at1/second or 15/minute to a particular load balancer. Then there is a global 800/s on each Load Balancer.

Moreover, it also uses bursting techniques that will start by delaying responses before it actually fails.

I am thinking on performing the load tests once we have this new set of PRs merged in and see if we end up having any gateways rate limited. If so, try to tweak a bit the numbers

packages/gateway/src/durable-objects/gateway-metrics.js

packages/gateway/src/durable-objects/gateway-rate-limits.js

vasco-santos · 2022-02-02T16:39:31Z

packages/gateway/src/metrics.js

+          metricsCollected.ipfsGateways[gw].totalResponsesByStatus[
+            HTTP_SUCCESS_CODE
+          ] || 0


@alanshaw decided to track this individually to have Prometheus metrics to be persisted with a different key to success to have a prometheus name tag, and only use the other key with "failed requests"

If you disagree, happy to change it to simply nftgateway_requests_by_status_total and have everything under the same name.

The main reason for this decision was that I wanted to have a metric that includes a count of all failed requests here, and for queries to Prometheus seemed more reasonable to have success separate. Lmk what you think

I think it'll be unexpected to not see 200 in nftgateway_requests_by_status_total...

If you want to have a success vs failure metric for convenience then that's fine but I'd still include 200 in nftgateway_requests_by_status_total. You can match a tag via regexp in prometheus so it should be easy to group failure like /^[45][0-9][0-9]$/ and success like /^2[0-9][0-9]$/ or something. It won't make it loads easier. Personally I wouldn't complicate it here but it's up to you.

The only other concern with explicitly pulling out 200 is I don't know that all successfull requests from the gateway are 200...but I guess probably yes.

Yes, but my current naming is nftgateway_failed_requests_by_status_total. Anyway, I will just remove failed and add 200 then.

vasco-santos · 2022-02-02T16:42:11Z

packages/gateway/src/metrics.js

+          )
+          .join('\n')
+      })
+      .filter((e) => !!e),


this filter is annoying, but was the sanest way of not having empty lines when status errors only exist for a subset of known gateways

Ya, I think not including success is unnecessarily complicated.

This is not because of success, but because if there is for example status 429 for gateway x, but not for gateway y, we wend up with an empty entry

Oh ok sorry!

alanshaw · 2022-02-03T10:10:28Z

packages/gateway/src/durable-objects/gateway-metrics.js

 * @property {number} [responseTime] number of milliseconds to get response
- * @property {boolean} [winner]
+ * @property {boolean} [winner] response was from winner gateway
+ * @property {number} [requestPreventedCode] request not sent to upstream gateway reason code


Personally I'd have gone with a string and "RATE_LIMIT" or something. It just means that when you're building graphs in grafana you don't have to refer back to the document/code that maps code => description.

alanshaw · 2022-02-03T10:27:23Z

packages/gateway/src/gateway.js


  const gatewayReqs = env.ipfsGateways.map((gwUrl) =>
-    _gatewayFetch(gwUrl, cid, {
-      pathname: reqUrl.pathname,
+    _gatewayFetch(gwUrl, cid, request, env, {


Why do some methods here have underscore prefix? Not exporting it makes it private, and usually underscore is a hint to consumers that something is supposed to be private, in the case where it is not supported.

Suggested change

_gatewayFetch(gwUrl, cid, request, env, {

gatewayFetch(gwUrl, cid, request, env, {

alanshaw · 2022-02-03T11:07:48Z

packages/gateway/src/gateway.js

+    // We can already settle requests if Aggregate Error, as all promises were already rejected
+    if (err instanceof AggregateError) {
+      responses = await pSettle(gatewayReqs)
+    }


What happens if some promises fulfilled but none were response.ok and the rest rejected? Do we still get an aggregate error?

Yes, the filter causes a FilterError to be thrown here: https://github.com/sindresorhus/p-some/blob/a7030ea6ad9971867ba5dfdb09034a416f3cf8e3/index.js#L60-L62

I think this will always be an AggregateError in our case. Or in other words, all of our promises will have rejected or been filtered (effectively rejected) here. We should always settle here to get the responses:

Suggested change

// We can already settle requests if Aggregate Error, as all promises were already rejected

if (err instanceof AggregateError) {

responses = await pSettle(gatewayReqs)

}

// All promises will have been rejected or filtered (effectively rejected) here.

responses = await pSettle(gatewayReqs)

That's right, thanks

alanshaw · 2022-02-03T11:10:14Z

packages/gateway/src/constants.js

+export const RATE_LIMIT_HTTP_ERROR_CODE = 429
+export const HTTP_SUCCESS_CODE = 200


Suggested change

export const RATE_LIMIT_HTTP_ERROR_CODE = 429

export const HTTP_SUCCESS_CODE = 200

export const HTTP_STATUS_RATE_LIMITED = 429

export const HTTP_STATUS_SUCCESS = 200

alanshaw · 2022-02-03T11:23:58Z

packages/gateway/src/gateway.js

            updateGatewayMetrics(request, env, r.value, false)
          )
        )
      })()
    )

+    // Redirect if all failed and at least one gateway was rate limited


I think we only want to redirect when all the gateways are rate limiting us (or we prevented the request because it would cause rate limiting)...because we're literally not able to service the request. If we got error responses from everyone is probably because the content could not be retrieved OR a problem with the content (not a unixfs thing). It's unlikely that all gateways will be down at the same time and in these cases we should not redirect and get the user to send another request because it's unlikely it'll work, OR all the gateways are on 🔥 and they don't need extra traffic.

Yes, if we have one failure that was not rate limit related we should not redirect!

But, there is an edge case here that I was including. For instance: What if we prevent two requests from running and the other is rate limited, or better, if we prevent all requests. So, what we need here is a conditional with:

All response status are HTTP_STATUS_RATE_LIMITED OR we prevented the request from happening

alanshaw · 2022-02-03T11:40:38Z

packages/gateway/src/metrics.js

+          metricsCollected.ipfsGateways[gw].totalResponsesByStatus[
+            HTTP_SUCCESS_CODE
+          ] || 0


I think it'll be unexpected to not see 200 in nftgateway_requests_by_status_total...

If you want to have a success vs failure metric for convenience then that's fine but I'd still include 200 in nftgateway_requests_by_status_total. You can match a tag via regexp in prometheus so it should be easy to group failure like /^[45][0-9][0-9]$/ and success like /^2[0-9][0-9]$/ or something. It won't make it loads easier. Personally I wouldn't complicate it here but it's up to you.

The only other concern with explicitly pulling out 200 is I don't know that all successfull requests from the gateway are 200...but I guess probably yes.

alanshaw · 2022-02-03T11:41:11Z

packages/gateway/src/metrics.js

+          )
+          .join('\n')
+      })
+      .filter((e) => !!e),


Ya, I think not including success is unnecessarily complicated.

alanshaw · 2022-02-03T11:43:52Z

packages/gateway/test/cache.spec.js

@@ -21,5 +21,6 @@ test('Caches content', async (t) => {
  t.is(await response.text(), content)

  const cachedRes = await caches.default.match(url)
-  t.is(await cachedRes.text(), content)
+  // Miniflare cache sometimes is not yet setup...
+  cachedRes && t.is(await cachedRes.text(), content)


hmm, we need to get to the bottom of that...it will be a false positive. Maybe skip the test and open an issue?

…ason

alanshaw

LGTM

alanshaw · 2022-02-03T13:49:08Z

packages/gateway/src/gateway.js

@@ -1,28 +1,33 @@
 /* eslint-env serviceworker, browser */
 /* global Response caches */

-import pAny from 'p-any'
+import pAny, { AggregateError } from 'p-any'


Are we not using AggregateError now?

I think your view was outdated 😅 I removed it in last commit

packages/gateway/src/gateway.js

Co-authored-by: Alan Shaw <alan.shaw@protocol.ai>

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 3 times, most recently from 393649f to ec65259 Compare January 28, 2022 13:50

vasco-santos commented Jan 28, 2022

View reviewed changes

packages/gateway/src/durable-objects/gateway-rate-limits.js Show resolved Hide resolved

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 4 times, most recently from 03d17dc to e5d449d Compare January 31, 2022 10:56

vasco-santos marked this pull request as ready for review January 31, 2022 10:58

vasco-santos requested a review from alanshaw January 31, 2022 10:59

vasco-santos commented Jan 31, 2022

View reviewed changes

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from 3596690 to 74c38fb Compare February 1, 2022 16:10

alanshaw requested changes Feb 2, 2022

View reviewed changes

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 3 times, most recently from b68adb5 to 4db974e Compare February 2, 2022 12:29

vasco-santos requested a review from alanshaw February 2, 2022 12:34

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 2 times, most recently from de58738 to d98f541 Compare February 2, 2022 16:32

vasco-santos commented Feb 2, 2022

View reviewed changes

vasco-santos mentioned this pull request Feb 3, 2022

fix: not prevent ipfs io requests #1230

Merged

3 tasks

alanshaw requested changes Feb 3, 2022

View reviewed changes

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch 2 times, most recently from 7b0240a to 8a5822d Compare February 3, 2022 12:47

vasco-santos mentioned this pull request Feb 3, 2022

Gateway API cache with miniflare test is skipped #1231

Closed

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from 8a5822d to f84f1cf Compare February 3, 2022 13:34

vasco-santos requested a review from alanshaw February 3, 2022 13:57

feat: gateway rate limiting durable object

9d17a80

vasco-santos added 3 commits February 3, 2022 15:07

fix: improve metrics naming

f39e227

fix: address review comments

0247a19

fix: add total responses by status and total requests prevented by re…

9604b3c

…ason

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from f84f1cf to a5689bb Compare February 3, 2022 14:09

fix: address review comments

b4e5178

vasco-santos force-pushed the feat/gateway-rate-limiting-do branch from a5689bb to b4e5178 Compare February 3, 2022 14:12

alanshaw approved these changes Feb 3, 2022

View reviewed changes

chore: apply suggestions from code review

07958ec

Co-authored-by: Alan Shaw <alan.shaw@protocol.ai>

vasco-santos merged commit 2b632e2 into main Feb 3, 2022

vasco-santos deleted the feat/gateway-rate-limiting-do branch February 3, 2022 15:55

github-actions bot mentioned this pull request Feb 3, 2022

chore(main): release gateway 1.0.0 #1156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: gateway rate limiting durable object #1178

feat: gateway rate limiting durable object #1178

vasco-santos commented Jan 27, 2022 •

edited

Loading

cloudflare-workers-and-pages bot commented Jan 27, 2022 •

edited

Loading

vasco-santos Jan 31, 2022 •

edited

Loading

vasco-santos Jan 31, 2022

vasco-santos Feb 2, 2022

alanshaw Feb 3, 2022 •

edited

Loading

vasco-santos Feb 3, 2022

vasco-santos Feb 2, 2022

alanshaw Feb 3, 2022

vasco-santos Feb 3, 2022

alanshaw Feb 3, 2022

alanshaw Feb 3, 2022

alanshaw Feb 3, 2022

alanshaw Feb 3, 2022

vasco-santos Feb 3, 2022

alanshaw Feb 3, 2022

alanshaw Feb 3, 2022

vasco-santos Feb 3, 2022

alanshaw Feb 3, 2022 •

edited

Loading

alanshaw Feb 3, 2022

alanshaw Feb 3, 2022

vasco-santos Feb 3, 2022

alanshaw left a comment

alanshaw Feb 3, 2022

vasco-santos Feb 3, 2022

	_gatewayFetch(gwUrl, cid, request, env, {
	gatewayFetch(gwUrl, cid, request, env, {

		export const RATE_LIMIT_HTTP_ERROR_CODE = 429
		export const HTTP_SUCCESS_CODE = 200

feat: gateway rate limiting durable object #1178

feat: gateway rate limiting durable object #1178

Conversation

vasco-santos commented Jan 27, 2022 • edited Loading

cloudflare-workers-and-pages bot commented Jan 27, 2022 • edited Loading

Deploying with Cloudflare Pages

vasco-santos Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanshaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasco-santos commented Jan 27, 2022 •

edited

Loading

cloudflare-workers-and-pages bot commented Jan 27, 2022 •

edited

Loading

vasco-santos Jan 31, 2022 •

edited

Loading

alanshaw Feb 3, 2022 •

edited

Loading

alanshaw Feb 3, 2022 •

edited

Loading