Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching issues breaking fetching binaries #3224

Open
domdomegg opened this issue Mar 15, 2023 · 5 comments
Open

Caching issues breaking fetching binaries #3224

domdomegg opened this issue Mar 15, 2023 · 5 comments

Comments

@domdomegg
Copy link

When a new node version is released, the list of versions at https://nodejs.org/dist/index.json is updated and the binaries are uploaded at https://nodejs.org/dist/vVERSION (e.g. https://nodejs.org/dist/v19.8.1/).

However, the website sits behind a Cloudflare cache and sometimes the data seems out of sync: the API responds saying there is a version, but the binaries are not available.

This then causes tools that try to download the latest version using this information to fail, e.g. actions/setup-node#717

I suspect there might be a few things contributing to this:

  1. We should set Cache-Control: no-cache headers on 404 pages. Otherwise, caching proxies in the middle may fallback to default values, resulting in stale cache data. I think this might be what's happening inside the GitHub Actions environment, but I'm not sure.
  2. We should ensure things are being purged correctly: having tried a few links from my own computer with no particular proxy cache inbetween me and Cloudflare, it seems like a few URLs (e.g. https://nodejs.org/dist/v19.8.1/win-x64/node.exe) had taken at least a few hours to update after the binaries should have been there.
  3. (less important, but would be nice) We should ensure the CDN is able to serve the binaries before we update the index.json. Provided the other things are taken care of, this will prevent the issue for the window where Cloudflare might have invalidated the index.json but not the binary URLs.
@rvagg
Copy link
Member

rvagg commented Mar 16, 2023

index.* and the release files are "promoted" at the same time and a cache flush is triggered afterward.

I think what you're seeing is a flapping load balancer; which continues to happen and really needs to be diagnosed and fixed (I've been suspecting our main server has i/o issues, triggered by rsync, perhaps other things, that gets overloaded and can't respond in time to reply to health checks). So what happens is that CF switches from our primary server to the backup server, but the backup server doesn't strictly say in sync with the main server, that only happens periodically (I don't recall the frequency, something like every 15 minutes probably). So what I think you're experiencing is this:

  1. New Node.js release, (or some other event causing traffic spikes to the backend—which should only be CF edge nodes btw) a flurry of download attempts
  2. Primary backend has one of its pathetic panic attacks (it's not a tiny server, we used to bypass all downloads directly to the server until a couple of years ago)
  3. CF's LB flaps and switches to our backup server
  4. Backup server doesn't yet have new release files, so it's a legitimate 404
  5. Perhaps even when there's a recovery there might be a sticky 404 recorded by CF? I would hope not, but there's no triggering of a cache flush on our even when this all occurs so it might be messy if they assume both our LB backends have the same content.

Even if we fix the LB flapping, it's probably not good to have them out of sync like this. I've thought for a long time that asset promotion by releasers should also have some step that causes a sync to the secondary server.

Maybe the easiest and most sensible thing (regardless of LB issues) to do here would be cause a sync to the secondary server before we request CF purge their cache. Here would be the place to do that: https://github.com/nodejs/build/blob/main/ansible/www-standalone/resources/scripts/cdn-purge.sh.j2 - this script runs privileged, so prior to the curl it could call out to the rsync that gets the backend in sync. I just don't recall whether the rsync is a push from the primary or a pull from the secondary, it might not be as simple as calling another script.

@targos
Copy link
Member

targos commented Mar 16, 2023

#3226 added Cache-Control: no-cache to 404 responses. Without it CF was keeping the response in cache for 3 minutes: https://developers.cloudflare.com/cache/how-to/configure-cache-status-code/#edge-ttl

@richardlau
Copy link
Member

I found an issue where one of the index.html files no longer exists after the switch to Next.js, which meant that our check-build-site.sh script always triggered a website rebuild every five minutes, which cannot be helping. #3230

Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Jan 11, 2024
@domdomegg
Copy link
Author

Commenting to avoid stale closure. I think we've mitigated this somewhat, but this is still an issue.

@github-actions github-actions bot removed the stale label Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants