Rip out current scheduling code - it's not working well #1

aboodman · 2022-01-31T08:53:01Z

We are experiencing a problem with this server where the peer-to-peer latency is unexpectedly high: you move a mouse on one tab and it takes awhile to show up on other other tab. Watching the server output I can see that what's happening is that something is going wrong with scheduling: the server calculates that the frame should play in the future and waits -- sometimes awhile! -- to play it.

It seems to get worse the longer the client is connected. I regularly see the server wait a full second which makes the experience pretty bad.

Cloudflare Workers have hobbled clocks for security reasons: https://developers.cloudflare.com/workers/learning/security-model#there-is-no-fix-for-spectre. It seems very likely that the behavior observed is related to this somehow, but I can't figure out what's going wrong yet.

aboodman · 2022-01-31T08:57:11Z

As a first step I think it would be good to rip out the scheduling code and do something much simpler:

When mutations come in they get queued per-client
While there are any mutations pending, the server runs in a constant 16ms interval (e.g., using setInterval)
Each time around the interval, the server processes all available mutations from all clients

This will not work long term for a few reasons:

We need to preserve the intended timing from clients to get perfect playback
We need to buffer on the server-side (per client) to account for inconsistent network times
This proposal would result in us writing to DO every frame, but we've calculated that would be expensive -- we really only want to commit every fourth frame.

... but for now, it will give us a rough idea of how well Durable Objects can work and will enable out customers to being playing.

It also hopefully provides a baseline that works fairly well to start understanding if we can implement the full gameloop on workers.

phritz · 2022-01-31T21:31:14Z

Though something clock-related seems a likely culprit, multi-tenancy could easily lead to performance problems for us. I don't understand where/how CF runs workers or DOs at this point. Guess this is another thing that could be dealbreaker, one of us should look into it (me?). Likely starting places:

https://developers.cloudflare.com/workers/learning/security-model
https://blog.cloudflare.com/cloud-computing-without-containers/
etc

This just runs mutations as fast as they come in 🤷‍♂️ and rips out all the complex timing stuff. The goal is to see how fast it can go and sort of verify that what we are seeing is likely logical bugs / maybe bad APIs and not perf limitations, as we as to create a well performing baseline to build on top of. Fixes #1

aboodman changed the title ~~Very bad end-to-end latency~~ Rip out current scheduling code - it's not working well Jan 31, 2022

aboodman mentioned this issue Feb 1, 2022

Simplify scheduling of running mutations. #11

Merged

aboodman closed this as completed in #11 Feb 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rip out current scheduling code - it's not working well #1

Rip out current scheduling code - it's not working well #1

aboodman commented Jan 31, 2022

aboodman commented Jan 31, 2022 •

edited

Loading

phritz commented Jan 31, 2022

Rip out current scheduling code - it's not working well #1

Rip out current scheduling code - it's not working well #1

Comments

aboodman commented Jan 31, 2022

aboodman commented Jan 31, 2022 • edited Loading

phritz commented Jan 31, 2022

aboodman commented Jan 31, 2022 •

edited

Loading