Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect & visualise spark-stats API usage based on access tokens #206

Closed
5 of 7 tasks
bajtos opened this issue Dec 2, 2024 · 37 comments · Fixed by filecoin-station/spark-stats-request-metrics#1
Closed
5 of 7 tasks
Assignees

Comments

@bajtos
Copy link
Contributor

bajtos commented Dec 2, 2024

We started to hand out API keys to people who signed up with their email for Spark Data offering. Now we would like to understand which user is using Spark API and how frequently.

Docs which the users see: https://filspark.com/api-docs

Notes:

  • We are not enforcing the access token yet, anonymous usage is allowed. We want to track anonymous requests too.
  • The API is behind Cloudflare cache, some requests are served from the cache and don't hit our spark-stats service. We want to include the cached requests in the data.
  • Our API docs used to be slightly wrong. We need to look for the API key in the api-key header & api-key query string in addition to the standard Authorization header in Bearer format.

TODO:

Update our internal spark-stats consumers to send an access token:

@bajtos bajtos mentioned this issue Dec 2, 2024
15 tasks
@pyropy
Copy link

pyropy commented Dec 4, 2024

Are we currently using any tools for monitoring and collecting logs? Also, what service is responsible for managing API keys and setting rate limits?

@juliangruber
Copy link
Member

Logs are being collected in Papertrail, which I believe Miro gave you access to.

I believe at the moment API keys are only parsed in Cloudflare, @bajtos correct me if I'm wrong.

@pyropy
Copy link

pyropy commented Dec 4, 2024

Logs are being collected in Papertrail, which I believe Miro gave you access to.

I believe at the moment API keys are only parsed in Cloudflare, @bajtos correct me if I'm wrong.

I don't have access to Papertrail yet.

@pyropy
Copy link

pyropy commented Dec 5, 2024

I believe if we want to track both cached and non-cached requests we'd have to use cloudflare workers. In that case we'd have two options:

  • Push logs directly from cloudflare worker to papertrail
  • Log on cloudflare worker and setup cloudflare logpush that would forward cloudflare worker logs over to papertrail

@juliangruber
Copy link
Member

Instead of pushing logs to papertrail, where we then would parse them (I assume), can we make the cloudflare worker submit information about the request into a DB, via an HTTP API that we create? It could also feed this data into InfluxDB, for time series processing

@pyropy
Copy link

pyropy commented Dec 5, 2024

Instead of pushing logs to papertrail, where we then would parse them (I assume), can we make the cloudflare worker submit information about the request into a DB, via an HTTP API that we create? It could also feed this data into InfluxDB, for time series processing

Yeah I would prefer something like that over the papertrail also. It will be much easier to aggregate and visualize the data.

@juliangruber
Copy link
Member

Cool, let's either add a route to spark-api then which persists this information in Postgres, or write it to InfluxDB. If both are viable options to you, best create a small document or GitHub comment describing your decision making process (pros cons, etc)

@bajtos
Copy link
Contributor Author

bajtos commented Dec 5, 2024

Can we put the new route to spark-stats please? We are measuring spark-stats API usage, nothing from spark-api is involved in this.

@bajtos
Copy link
Contributor Author

bajtos commented Dec 5, 2024

Writing from Cloudflare worker directly to InfluxDB may be the simplest option, as we don't need to write any new REST API 💪🏻

@juliangruber
Copy link
Member

Good point, yes should be in spark-stats

@juliangruber
Copy link
Member

Writing from Cloudflare worker directly to InfluxDB may be the simplest option, as we don't need to write any new REST API 💪🏻

It also has the advantage of automatic retention policies

@pyropy
Copy link

pyropy commented Dec 6, 2024

Do we already have InfluxDB deployed? If so I'd appreciate access to the instance 🙏🏻

@juliangruber
Copy link
Member

Invite sent 👌

@pyropy
Copy link

pyropy commented Dec 9, 2024

I have deployed a cloudflare worker that should report data to influxdb. I have yet to configure it properly now.

I am proposing is adding another CNAME record (i.e. stats-test) on filspark.com domain that will point to spark-stats API and setting up cloudflare worker to that route.

After we make sure that cloudflare worker is working correctly, we can setup another CNAME record (i.e. stats-api) on filspark.com domain and point it to the spark-stats API while configuring cloudflare worker route to original CNAME (stats) so all the traffic is routed trough the worker.

@juliangruber
Copy link
Member

I am proposing is adding another CNAME record (i.e. stats-test) on filspark.com domain that will point to spark-stats API and setting up cloudflare worker to that route.

This would be a kind of staging environment, right? We can ourselves make requests against that endpoint, and verify they end up in InfluxDB correctly.

As long as the Cloudflare worker doesn't negatively affect requests, I don't see an immediate problem with just using it in prod. Can we just make all requests to api.filspark.com go through your worker?

@pyropy
Copy link

pyropy commented Dec 9, 2024

Yeah, it's kind of a separate environment but In this case I didn't use cloudflare worker environments. Generally it shouldn't affect request performance as metrics are reported after the request has been executed.

I have created a repository which itself is a fork of some other repository I've used as started template. In the end I didn't end up using much of the code inside that repository. I am wondering should we fork my repository as an organization or could I create a new repository and push the code. Given that it's a MIT license I guess it isn't legally binding even if we copy code into new repo?

@pyropy
Copy link

pyropy commented Dec 10, 2024

@juliangruber I've tried adding new CNAME records but it seems like there's some issue with TSL certificates (I get error 525 when trying to send a request). Are the certificates managed by cloudflare or it's done outside of cloudflare?

@juliangruber
Copy link
Member

Generally it shouldn't affect request performance as metrics are reported after the request has been executed.

Ok, let's just attach the worker to the production route then (with manual monitoring that nothing goes wrong), for simplicity.

I have created a repository which itself is a fork of some other repository I've used as started template. In the end I didn't end up using much of the code inside that repository. I am wondering should we fork my repository as an organization or could I create a new repository and push the code. Given that it's a MIT license I guess it isn't legally binding even if we copy code into new repo?

Are you able to create new repos in the Filecoin-station org? I think we can then transfer your repository into it. Yes we could just copy the code over, but forks are convenient because you see where the code came from, and that makes it easier to pull in upstream changes.

@juliangruber
Copy link
Member

@juliangruber I've tried adding new CNAME records but it seems like there's some issue with TSL certificates (I get error 525 when trying to send a request). Are the certificates managed by cloudflare or it's done outside of cloudflare?

Can we use the existing domain name instead?

@pyropy
Copy link

pyropy commented Dec 10, 2024

@juliangruber I've tried adding new CNAME records but it seems like there's some issue with TSL certificates (I get error 525 when trying to send a request). Are the certificates managed by cloudflare or it's done outside of cloudflare?

Can we use the existing domain name instead?

Yes we could do that, I'm only worried if we're going to miss cache. I've switched worker for few minutes to stats.filspark.com this morning and it reported cache status of DYNAMIC.

Screenshot 2024-12-10 at 13 47 47

@juliangruber
Copy link
Member

I'm only worried if we're going to miss cache

Does this mean the API cache will be inactive with workers, or that it will be invalidated for every worker update?

@pyropy
Copy link

pyropy commented Dec 10, 2024

I'm only worried if we're going to miss cache

Does this mean the API cache will be inactive with workers, or that it will be invalidated for every worker update?

If the spark-stats server remains proxied behind cloudflare (like it currently is) it should still be proxied (hence that's why I wanted to add new CNAME record). Other option would be to cache responses directly in worker and have custom cache behavior.

@juliangruber
Copy link
Member

I don't understand. We want spark-stats to stay proxied, so that we can use Cloudflare's caching behavior. We want that proxy to also submit usage data to the API, through a Cloudflare worker. What am I missing?

@pyropy
Copy link

pyropy commented Dec 10, 2024

Missing thing is that cloudflare worker and proxy cannot reside at the same route. What I'm suggesting is that we add new route stats-api.filspark.com and point it to spark-stats server and have cloudflare worker take old route stats.filspark.com

proxy

@juliangruber
Copy link
Member

Got it, I wasn't aware that the worker needs to sit in front of the proxy. This makes sense then. Your graphic helped clarify this 👏 I will take a look at the certificate setup and get back to you

@juliangruber
Copy link
Member

The problem is that the fly.io deployment requires the hostname to be stats.filspark.com. I will look into it there

@juliangruber
Copy link
Member

stats-api isn't the most descriptive name to me, what do you think about stats-backend?

Thinking about this more, are you sure this is the right way to go about things? Ie is it a documented pattern to put the worker in front of the proxy? Or are there alternatives, like proxying from inside the worker?

@juliangruber
Copy link
Member

FYI, https://stats-api.filspark.com works now

@pyropy
Copy link

pyropy commented Dec 10, 2024

stats-api isn't the most descriptive name to me, what do you think about stats-backend?

Thinking about this more, are you sure this is the right way to go about things? Ie is it a documented pattern to put the worker in front of the proxy? Or are there alternatives, like proxying from inside the worker?

We could proxy inside the worker, sure. I guess it would also be more appropriate and also would require a lot less DNS resolutions. I'll see how to use basic cloudflare cache setup with workers.

@pyropy
Copy link

pyropy commented Dec 10, 2024

@juliangruber I have added caching inside cloudflare worker here. With that I guess there's no need for additional CNAME records so I'm going to delete them. I've yet to add tests and alter the docs, but the it has been deployed to cloudflare and it seems to be working.

@bajtos
Copy link
Contributor Author

bajtos commented Dec 10, 2024

Implementing a proxy inside a worker feels rather suboptimal to me. Doesn't Cloudflare CDN offer any hooks we can observe from a worker? In the future, we will want to validate the access token and reject requests with missing or invalid access tokens.

(EDITED) ^^^ That's no longer relevant based on what I learned while writing this comment. ^^^

Here is some documentation that may be relevant:

Auth with headers:

Using Cloudflare cache from workers:

Quoting from https://developers.cloudflare.com/workers/reference/how-the-cache-works/

Conceptually, there are two ways to interact with Cloudflare’s Cache using a Worker:
Call to fetch() in a Workers script. Requests proxied through Cloudflare are cached even without Workers according to a zone’s default or configured behavior (for example, static assets like files ending in .jpg are cached by default). Workers can further customize this behavior by:

  • Setting Cloudflare cache rules (that is, operating on the cf object of a request).

If I am understanding this topic correctly, then we should have the following architecture:

[API client] ---> [Cloudflare worker calling fetch()] --> [Fly.io app]

In other words:

  • The API client fetches https://stats.filspark.com/{endpoint}
  • Our Cloudflare Worker is invoked and it fetches https://spark-stats.fly.dev/{endpoint}
  • Behind the scene, this second fetch may be served from Cloudflare's cache (that's how I understand Cloudflare's architecture)

@pyropy
Copy link

pyropy commented Dec 10, 2024

@bajtos I think you might be right about this. To be honest I was confused by this 👇🏻

Call to fetch() in a Workers script. Requests proxied through Cloudflare are cached even without Workers according to a zone’s default or configured behavior (for example, static assets like files ending in .jpg are cached by default). Workers can further customize this behavior by...

As I understood it, request is cached if server is proxied behind cloudflare, hence me asking for another CNAME that we could put out server behind.

@pyropy
Copy link

pyropy commented Dec 17, 2024

I have added token to data source on https://spacemeridian.grafana.net/ that reads spark-stats API. Are the dashboards built there are using that data source also published on https://grafana.filstation.app/?

If that's the case, if I'd to edit API token directly on the dashboard I am afraid we would risk leaking API token.

@juliangruber
Copy link
Member

Are the dashboards built there are using that data source also published on https://grafana.filstation.app/?

I don't understand this sentence

@pyropy
Copy link

pyropy commented Dec 17, 2024

Are the dashboards built there are using that data source also published on https://grafana.filstation.app/?

I don't understand this sentence

Woah, I made a mistake. 🤦

I wanted to ask if dashboards built on https://spacemeridian.grafana.net/ are published on https://grafana.filstation.app/ and use data sources defined in https://spacemeridian.grafana.net/?

@bajtos
Copy link
Contributor Author

bajtos commented Dec 17, 2024

I wanted to ask if dashboards built on https://spacemeridian.grafana.net/ are published on https://grafana.filstation.app/ and use data sources defined in https://spacemeridian.grafana.net/?

No, these two Grafana instances are completely independent. If we want to use the same datasource from both instances, then we need to configure it twice.

I believe we don't need to expose API usage data publicly yet (if ever), so there is no need to touch https://grafana.filstation.app/.

@pyropy
Copy link

pyropy commented Dec 17, 2024

All requests to spark-stats should be proxied via cloudflare worker and collected data is visualized on grafana dashboard. Since we're not going to setup API key on https://grafana.filstation.app/ I'm going to close this issue.

@pyropy pyropy closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants