Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design a static HTTP API for serving pre-computed information #278

Open
woodruffw opened this issue Dec 12, 2024 · 0 comments
Open

Design a static HTTP API for serving pre-computed information #278

woodruffw opened this issue Dec 12, 2024 · 0 comments
Labels
enhancement New feature or request performance

Comments

@woodruffw
Copy link
Owner

zizmor currently does a lot of GitHub API calls in online mode, the overwhelming majority of which are redundant or universal across different users.

For example, impostor-commit needs to fetch the entire branch/tag history for each repo, which means in practice that actions/checkout gets hit, over and over, by hundreds of users. This isn't a good use of anybody's time or API quota 🙂

Separately, zizmor currently hardcodes a lot of "coordinates of interest," e.g. use-trusted-publishing hardcodes the rubygems/release-gem and pypa/gh-action-pypi-publish actions (among others). This isn't maintainable/ideal long term, since changes to the list of actions require a new release of zizmor that everybody has to re-download.

The solution to both of these problems is the same: a static, non-quota'd API that zizmor (the CLI client) can hit to retrieve batched information and timely updates relevant to specific audits.

The easiest place for us to serve this static API is probably on https://woodruffw.github.io/zizmor/, since it's already a static website and should be able to easily handle a sub-hierarchy for API routes.

Here's a rough sketch of what I'm thinking (these routes can be relative to whatever):

/data-api
  /v1                 # everything goes under v1 for now
    /last-update.dat  # returns the most recent update to the data files
    /common-refs.dat  # map of slug -> set[ref]
    /known-vulns.dat  # GHSA known vulnerabilities

...and so forth. I'm using .dat as the suffix because I'm not sure how we want to serialize these yet (JSON would be the easy choice, but these might get large and thus something like bincode or maybe rkyv might make sense).

We'd also want to cache these locally, so that repeated invocations of zizmor don't have to re-fetch them. This is a current limitation of our GitHub API usage as well.

In addition to these being served over HTTP, a copy of them (or some of them?) would also be baked into the zizmor builds themselves. This would ensure graceful degradation in the offline mode.

In sum, the order of operations with these data files would become:

  • If offline, use only the embedded copy and/or previously cached copy (and maybe emit a warning if it's more than 24 hours old)
  • If online:
    • Check the age of the cached data, if it exists. If it's fresh, use it.
    • Check the age of the embedded data. If it's fresh, use it.
    • If neither is fresh, attempt to hit the static API. Fail gracefully if the request fails for any reason, and fall back.
    • If the static API's last-update isn't new, fall back.
    • Use the fresh response and cache it.
    • Finally, if the static data doesn't have what we're looking for, use the GitHub APIs.

For the hot path (repos like actions/checkout), this should make things way faster. For the slow path, it'll at least be no slower than it was before.

CC @ubiratansoares for thoughts on the rough sketch above, since you arrived at the same idea 🙂 -- I'm not strongly bound to any of the design pieces above, so I'm curious if you have alternative ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

1 participant