Design a static HTTP API for serving pre-computed information #278

woodruffw · 2024-12-12T00:57:46Z

zizmor currently does a lot of GitHub API calls in online mode, the overwhelming majority of which are redundant or universal across different users.

For example, impostor-commit needs to fetch the entire branch/tag history for each repo, which means in practice that actions/checkout gets hit, over and over, by hundreds of users. This isn't a good use of anybody's time or API quota 🙂

Separately, zizmor currently hardcodes a lot of "coordinates of interest," e.g. use-trusted-publishing hardcodes the rubygems/release-gem and pypa/gh-action-pypi-publish actions (among others). This isn't maintainable/ideal long term, since changes to the list of actions require a new release of zizmor that everybody has to re-download.

The solution to both of these problems is the same: a static, non-quota'd API that zizmor (the CLI client) can hit to retrieve batched information and timely updates relevant to specific audits.

The easiest place for us to serve this static API is probably on https://woodruffw.github.io/zizmor/, since it's already a static website and should be able to easily handle a sub-hierarchy for API routes.

Here's a rough sketch of what I'm thinking (these routes can be relative to whatever):

/data-api
  /v1                 # everything goes under v1 for now
    /last-update.dat  # returns the most recent update to the data files
    /common-refs.dat  # map of slug -> set[ref]
    /known-vulns.dat  # GHSA known vulnerabilities

...and so forth. I'm using .dat as the suffix because I'm not sure how we want to serialize these yet (JSON would be the easy choice, but these might get large and thus something like bincode or maybe rkyv might make sense).

We'd also want to cache these locally, so that repeated invocations of zizmor don't have to re-fetch them. This is a current limitation of our GitHub API usage as well.

In addition to these being served over HTTP, a copy of them (or some of them?) would also be baked into the zizmor builds themselves. This would ensure graceful degradation in the offline mode.

In sum, the order of operations with these data files would become:

If offline, use only the embedded copy and/or previously cached copy (and maybe emit a warning if it's more than 24 hours old)
If online:
- Check the age of the cached data, if it exists. If it's fresh, use it.
- Check the age of the embedded data. If it's fresh, use it.
- If neither is fresh, attempt to hit the static API. Fail gracefully if the request fails for any reason, and fall back.
- If the static API's last-update isn't new, fall back.
- Use the fresh response and cache it.
- Finally, if the static data doesn't have what we're looking for, use the GitHub APIs.

For the hot path (repos like actions/checkout), this should make things way faster. For the slow path, it'll at least be no slower than it was before.

CC @ubiratansoares for thoughts on the rough sketch above, since you arrived at the same idea 🙂 -- I'm not strongly bound to any of the design pieces above, so I'm curious if you have alternative ideas!

The text was updated successfully, but these errors were encountered:

woodruffw added enhancement New feature or request performance labels Dec 12, 2024

woodruffw mentioned this issue Dec 12, 2024

New audit: cache poisoning #261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design a static HTTP API for serving pre-computed information #278

Design a static HTTP API for serving pre-computed information #278

woodruffw commented Dec 12, 2024

Design a static HTTP API for serving pre-computed information #278

Design a static HTTP API for serving pre-computed information #278

Comments

woodruffw commented Dec 12, 2024