Mux prediction events #1405

technillogue · 2023-11-29T20:03:50Z

A critical part of concurrent predictions is multiplexing several prediction outputs over the same pipe. This takes a stab at that. Once this is done, we might be able to drop some parts of runner.

We tag each _PublicEventType with a prediction id, introduce a Mux, and have a _read_events task responsible for reading events from the pipe and writing them to the mux. The mux adds it to the right queue, and then the places that previously called _wait instead call Mux.read.

We also add a semaphore and keep track of predictions in flight. READY is renamed to IDLE, but that may need to be reworked further.

Some challenges

contextvar to tag logs that were emitted from inside a predict()
- however, logs emitted from cross-prediction stuff (like the actual batching code) have to be discarded to not leak information
aioprocessing uses a threadpool and "does not re-implement multiprocessing using asynchronous I/O". it hangs at shutdown. it's still useful for getting the rest of the code in shape for now.
- aioprocessing itself, especially the part we use, is extremely small (730 loc) so we could vendor/fix it if we wanted to.
- https://github.com/kchmck/aiopipe/tree/master is closest to what we would need, but looks a little awkward.
- I also have an example that uses loop.connect_read_pipe correctly but it's a little wordy and would take some hacking to be suitable for duplex use
I don't know what to do with the hypothesis tests

[x] mux events
[x] doesn't deadlock
[x] hypothesis tests mostly pass
[ ] serious pipe implementation (future PR?)
[ ] cancellation
[x] READY / PROCESSING semaphore
[~] route predict logs to prediction if only one prediction is running

python/cog/server/helpers.py

technillogue · 2023-12-05T23:30:46Z

outstanding questions:

how to handle cancellation? it's hard to rely on the behavior of raising exceptions from signal handlers if there's an event loop running -- the exception could be raised in any coroutine or the event loop code instead of specifically inside predict. canceling tasks works, but the asyncio.CancelledError only gets raised on the next await and cannot happen inside blocking C code. my best guess is some combination of a new Cancel event and keeping SIGUSR1.
we probably need to have a mapping from prediction_id as used by the cancel endpoint to id as used by worker. why not use the same id for both?
I'm a little confused why read_setup_events/read_predict_events were separate hypothesis rules that had an argument. I've removed it for now and it works, but I don't understand why it was there in the first place
how do we want to test this stuff? Can we parameterize hypothesis with a few different predictors?
even without cancellation this is pretty big PR, I would love suggestions for how to break it up into smaller chunks

python/tests/server/test_worker.py

python/cog/server/eventtypes.py

python/cog/server/helpers.py

python/cog/server/worker.py

python/cog/server/helpers.py

yorickvP · 2023-12-06T14:59:55Z

python/cog/server/worker.py

+                trace("recv", event)
+            except asyncio.CancelledError:
+                return
+            if id == "LOG" and "SETUP" in self._mux.outs:


maybe check self._state here instead? then we can get rid of _mux.outs

python/cog/server/worker.py

…logs Signed-off-by: technillogue <technillogue@gmail.com>

Signed-off-by: technillogue <technillogue@gmail.com>

…ad event loop Signed-off-by: technillogue <technillogue@gmail.com>

Signed-off-by: technillogue <technillogue@gmail.com>

…er have capacity Signed-off-by: technillogue <technillogue@gmail.com>

previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there Signed-off-by: technillogue <technillogue@gmail.com>

…aphore Signed-off-by: technillogue <technillogue@gmail.com>

Signed-off-by: technillogue <technillogue@gmail.com>

* race utility for racing awaitables * start mux, tag events with id, read pipe in a task, get events from mux * use async pipe for async child loop * _shutting_down vs _terminating * race with shutdown event * keep reading events during shutdown, but call terminate after the last Done * emit heartbeats from mux.read * don't use _wait. instead, setup reads event from the mux too * worker semaphore and prediction ctx * where _wait used to raise a fatal error, have _read_events set an error on Mux, and then Mux.read can raise the error in the right context. otherwise, the exception is stuck in a task and doesn't propagate correctly * fix event loop errors for <3.9 * keep track of predictions in flight explicitly and use that to route logs * don't wait for executor shutdown * progress: check for cancelation in task done_handler * let mux check if child is alive and set mux shutdown after leaving read event loop * close pipe when exiting * predict requires IDLE or PROCESSING * try adding a BUSY state distinct from PROCESSING when we no longer have capacity * move resetting events to setup() instead of _read_events() previously this was in _read_events because it's a coroutine that will have the correct event loop. however, _read_events actually gets created in a task, which can run *after* the first mux.read call by setup. since setup is now the first async entrypoint in worker and in tests, we can safely move it there * state_from_predictions_in_flight instead of checking the value of semaphore * make prediction_ctx "private" Signed-off-by: technillogue <technillogue@gmail.com>

technillogue force-pushed the syl/mux branch 4 times, most recently from b835dbe to 41f93a1 Compare November 29, 2023 23:33

technillogue requested review from nickstenning and mattt November 30, 2023 19:50

technillogue force-pushed the syl/mux branch 6 times, most recently from 51a15b2 to a4dba69 Compare December 2, 2023 20:56

yorickvP reviewed Dec 4, 2023

View reviewed changes

python/cog/server/helpers.py Outdated Show resolved Hide resolved

technillogue force-pushed the syl/mux branch 7 times, most recently from 6a4f27f to 688b152 Compare December 5, 2023 08:45

technillogue force-pushed the syl/mux branch 2 times, most recently from 59ee830 to 976fba8 Compare December 6, 2023 09:26

nickstenning reviewed Dec 6, 2023

View reviewed changes

python/tests/server/test_worker.py Outdated Show resolved Hide resolved

nickstenning reviewed Dec 6, 2023

View reviewed changes

python/tests/server/test_worker.py Outdated Show resolved Hide resolved

nickstenning reviewed Dec 6, 2023

View reviewed changes

python/tests/server/test_worker.py Outdated Show resolved Hide resolved

yorickvP reviewed Dec 6, 2023

View reviewed changes

technillogue mentioned this pull request Dec 6, 2023

async worker event pipe #1410

Merged

technillogue force-pushed the syl/mux branch 2 times, most recently from 9073dc6 to 294d603 Compare December 7, 2023 23:40

technillogue added 9 commits February 2, 2024 14:33

keep track of predictions in flight explicitly and use that to route …

f8ccfd8

…logs Signed-off-by: technillogue <technillogue@gmail.com>

don't wait for executor shutdown

b2a5fef

Signed-off-by: technillogue <technillogue@gmail.com>

progress: check for cancelation in task done_handler

3bd794f

Signed-off-by: technillogue <technillogue@gmail.com>

let mux check if child is alive and set mux shutdown after leaving re…

b0a526b

…ad event loop Signed-off-by: technillogue <technillogue@gmail.com>

close pipe when exiting

925fe5e

Signed-off-by: technillogue <technillogue@gmail.com>

predict requires IDLE or PROCESSING

c2a075d

Signed-off-by: technillogue <technillogue@gmail.com>

idk, try adding a BUSY state distinct from PROCESSING when we no long…

24bf187

…er have capacity Signed-off-by: technillogue <technillogue@gmail.com>

state_from_predictions_in_flight instead of checking the value of sem…

ec19c1e

…aphore Signed-off-by: technillogue <technillogue@gmail.com>

technillogue force-pushed the syl/mux branch from 7559564 to ec19c1e Compare February 2, 2024 19:40

make prediction_ctx "private"

6f52b94

Signed-off-by: technillogue <technillogue@gmail.com>

yorickvP added the async label Feb 8, 2024

technillogue merged commit fb41455 into async Feb 12, 2024
11 checks passed

technillogue deleted the syl/mux branch February 12, 2024 21:09

This was referenced May 17, 2024

fix flaky runner test #1669

Merged

[async] Include prediction id upload request #1680

Closed

technillogue mentioned this pull request Jun 4, 2024

fix upload redirect handling #1714

Merged

technillogue mentioned this pull request Jun 19, 2024

async but refactored #1752

Closed

technillogue mentioned this pull request Jul 23, 2024

syl/fix setup shutdown bug #1819

Merged

aron mentioned this pull request Oct 17, 2024

[async] Support custom filename to be provided to URLFile #1997

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mux prediction events #1405

Mux prediction events #1405

technillogue commented Nov 29, 2023 •

edited

Loading

technillogue commented Dec 5, 2023

yorickvP Dec 6, 2023

Mux prediction events #1405

Mux prediction events #1405

Conversation

technillogue commented Nov 29, 2023 • edited Loading

technillogue commented Dec 5, 2023

yorickvP Dec 6, 2023

Choose a reason for hiding this comment

technillogue commented Nov 29, 2023 •

edited

Loading