Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rip out current scheduling code - it's not working well #1

Closed
aboodman opened this issue Jan 31, 2022 · 2 comments · Fixed by #11
Closed

Rip out current scheduling code - it's not working well #1

aboodman opened this issue Jan 31, 2022 · 2 comments · Fixed by #11

Comments

@aboodman
Copy link
Contributor

We are experiencing a problem with this server where the peer-to-peer latency is unexpectedly high: you move a mouse on one tab and it takes awhile to show up on other other tab. Watching the server output I can see that what's happening is that something is going wrong with scheduling: the server calculates that the frame should play in the future and waits -- sometimes awhile! -- to play it.

It seems to get worse the longer the client is connected. I regularly see the server wait a full second which makes the experience pretty bad.

Cloudflare Workers have hobbled clocks for security reasons: https://developers.cloudflare.com/workers/learning/security-model#there-is-no-fix-for-spectre. It seems very likely that the behavior observed is related to this somehow, but I can't figure out what's going wrong yet.

@aboodman
Copy link
Contributor Author

aboodman commented Jan 31, 2022

As a first step I think it would be good to rip out the scheduling code and do something much simpler:

  • When mutations come in they get queued per-client
  • While there are any mutations pending, the server runs in a constant 16ms interval (e.g., using setInterval)
  • Each time around the interval, the server processes all available mutations from all clients

This will not work long term for a few reasons:

  1. We need to preserve the intended timing from clients to get perfect playback
  2. We need to buffer on the server-side (per client) to account for inconsistent network times
  3. This proposal would result in us writing to DO every frame, but we've calculated that would be expensive -- we really only want to commit every fourth frame.

... but for now, it will give us a rough idea of how well Durable Objects can work and will enable out customers to being playing.

It also hopefully provides a baseline that works fairly well to start understanding if we can implement the full gameloop on workers.

@aboodman aboodman changed the title Very bad end-to-end latency Rip out current scheduling code - it's not working well Jan 31, 2022
@phritz
Copy link
Contributor

phritz commented Jan 31, 2022

Though something clock-related seems a likely culprit, multi-tenancy could easily lead to performance problems for us. I don't understand where/how CF runs workers or DOs at this point. Guess this is another thing that could be dealbreaker, one of us should look into it (me?). Likely starting places:

https://developers.cloudflare.com/workers/learning/security-model
https://blog.cloudflare.com/cloud-computing-without-containers/
etc

aboodman added a commit that referenced this issue Feb 1, 2022
This just runs mutations as fast as they come in 🤷‍♂️ and rips
out all the complex timing stuff. The goal is to see how fast it
can go and sort of verify that what we are seeing is likely logical
bugs / maybe bad APIs and not perf limitations, as we as to create
a well performing baseline to build on top of.

Fixes #1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants