Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mux prediction events #1405

Merged
merged 22 commits into from
Feb 12, 2024
Merged

Mux prediction events #1405

merged 22 commits into from
Feb 12, 2024

Conversation

technillogue
Copy link
Contributor

@technillogue technillogue commented Nov 29, 2023

A critical part of concurrent predictions is multiplexing several prediction outputs over the same pipe. This takes a stab at that. Once this is done, we might be able to drop some parts of runner.

We tag each _PublicEventType with a prediction id, introduce a Mux, and have a _read_events task responsible for reading events from the pipe and writing them to the mux. The mux adds it to the right queue, and then the places that previously called _wait instead call Mux.read.

We also add a semaphore and keep track of predictions in flight. READY is renamed to IDLE, but that may need to be reworked further.


Some challenges

  • contextvar to tag logs that were emitted from inside a predict()
    • however, logs emitted from cross-prediction stuff (like the actual batching code) have to be discarded to not leak information
  • aioprocessing uses a threadpool and "does not re-implement multiprocessing using asynchronous I/O". it hangs at shutdown. it's still useful for getting the rest of the code in shape for now.
    • aioprocessing itself, especially the part we use, is extremely small (730 loc) so we could vendor/fix it if we wanted to.
    • https://github.com/kchmck/aiopipe/tree/master is closest to what we would need, but looks a little awkward.
    • I also have an example that uses loop.connect_read_pipe correctly but it's a little wordy and would take some hacking to be suitable for duplex use
  • I don't know what to do with the hypothesis tests

[x] mux events
[x] doesn't deadlock
[x] hypothesis tests mostly pass
[ ] serious pipe implementation (future PR?)
[ ] cancellation
[x] READY / PROCESSING semaphore
[~] route predict logs to prediction if only one prediction is running

@technillogue
Copy link
Contributor Author

outstanding questions:

  • how to handle cancellation? it's hard to rely on the behavior of raising exceptions from signal handlers if there's an event loop running -- the exception could be raised in any coroutine or the event loop code instead of specifically inside predict. canceling tasks works, but the asyncio.CancelledError only gets raised on the next await and cannot happen inside blocking C code. my best guess is some combination of a new Cancel event and keeping SIGUSR1.
  • we probably need to have a mapping from prediction_id as used by the cancel endpoint to id as used by worker. why not use the same id for both?
  • I'm a little confused why read_setup_events/read_predict_events were separate hypothesis rules that had an argument. I've removed it for now and it works, but I don't understand why it was there in the first place
  • how do we want to test this stuff? Can we parameterize hypothesis with a few different predictors?
  • even without cancellation this is pretty big PR, I would love suggestions for how to break it up into smaller chunks

python/cog/server/eventtypes.py Outdated Show resolved Hide resolved
python/cog/server/helpers.py Show resolved Hide resolved
python/cog/server/worker.py Show resolved Hide resolved
python/cog/server/worker.py Outdated Show resolved Hide resolved
python/cog/server/helpers.py Outdated Show resolved Hide resolved
trace("recv", event)
except asyncio.CancelledError:
return
if id == "LOG" and "SETUP" in self._mux.outs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe check self._state here instead? then we can get rid of _mux.outs

python/cog/server/worker.py Show resolved Hide resolved
python/cog/server/worker.py Show resolved Hide resolved
python/cog/server/worker.py Outdated Show resolved Hide resolved
python/cog/server/worker.py Show resolved Hide resolved
…logs

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…ad event loop

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
…er have capacity

Signed-off-by: technillogue <technillogue@gmail.com>
previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

Signed-off-by: technillogue <technillogue@gmail.com>
…aphore

Signed-off-by: technillogue <technillogue@gmail.com>
Signed-off-by: technillogue <technillogue@gmail.com>
@yorickvP yorickvP added the async label Feb 8, 2024
@technillogue technillogue merged commit fb41455 into async Feb 12, 2024
11 checks passed
@technillogue technillogue deleted the syl/mux branch February 12, 2024 21:09
technillogue added a commit that referenced this pull request Feb 13, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Feb 13, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Feb 13, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Feb 13, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Feb 21, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Feb 21, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Feb 21, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Jun 19, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Jun 19, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
technillogue added a commit that referenced this pull request Jul 18, 2024
* race utility for racing awaitables
* start mux, tag events with id, read pipe in a task, get events from mux
* use async pipe for async child loop
* _shutting_down vs _terminating
* race with shutdown event
* keep reading events during shutdown, but call terminate after the last Done
* emit heartbeats from mux.read
* don't use _wait. instead, setup reads event from the mux too
* worker semaphore and prediction ctx
* where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly
* fix event loop errors for <3.9
* keep track of predictions in flight explicitly and use that to route logs
* don't wait for executor shutdown
* progress: check for cancelation in task done_handler
* let mux check if child is alive and set mux shutdown after leaving read event loop
* close pipe when exiting
* predict requires IDLE or PROCESSING
* try adding a BUSY state distinct from PROCESSING when we no longer have capacity
* move resetting events to setup() instead of _read_events()

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there

* state_from_predictions_in_flight instead of checking the value of semaphore
* make prediction_ctx "private"

Signed-off-by: technillogue <technillogue@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants