-
-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get N items from Channel #562
Comments
Hey Donald! It's certainly doable, but... could you back up and give some context on what you're trying to do and how you concluded that this will help do it? |
So I'm basically processing events that are being fed from one system, doing some processing on them, then sending them off to another system. The problem is when I send them to this other system, I need to send them in batches (because the number of incoming invents is a reasonably significant number, and this other system doesn't want that many single API calls, it wants them batched up). So I have one task that is accepting incoming events, processing them, and then throwing them into a I can do something like: async def sender(q):
to_send = []
while True:
with trio.move_on_after(30):
to_send.append(await q.get())
if len(to_send) < 10000:
continue
# TODO: Do the actual sending logic
to_send = [] That works, though it introduces a second queue (but at least this second queue is bounded still), but the big problem with this, is if you have multiple instances of Another option to get around it is to instead of having the sending tasks directly processing the async def sender(q):
while True:
with trio.move_on_after(30):
to_send = await q.get_batch(10000)
# TODO: Do the actual sending logic. I think that provides a much cleaner API than the other options provide. |
You can solve the problem easily if you move the batching logic to the
Adding a task that queues partial batches after a timeout is left as an exercise to the reader. ;-) |
@dstufft Okay, so... let's see if I have this right. You have a bunch of receivers who take in events from some external sources (a bunch of TCP connections). You want to batch up these events, and then once every X seconds or Y events – whichever comes first – you want to submit a whole batch to some other service. The simplest way to do this would be: async def sender(q):
while True:
batch = []
# Note that we to put a limit on the whole batch collection process,
# so we put the loop inside move_on_after
with move_on_after(X):
while len(batch) < Y:
batch.append(await q.get())
await send_batch(batch) You also asked about what happens if you have multiple async def sender(q, sender_lock):
while True:
async with sender_lock:
batch = []
with move_on_after(X):
while len(batch) < Y:
batch.append(await q.get())
await send_batch(batch) So only one But let's back up a moment. I'm assuming the reason you want multiple senders is because you want to increase robustness. I think there are two things we would want to watch out for: something going wonky with the Having multiple So maybe you want something like: # A robust version of send_batch: we can kick this off, and then it will take care
# of itself, including making sure not to stall out forever
async def send_batch(batch):
# let's do some retries in case the first try fails
for _ in range(3):
try:
# And if an attempt stalls, treat that as a failure and go to the retry path
with fail_after(10):
...
except ...:
...
else:
# success!
return
# ...I guess all the tries failed. What should we do with the data?
# log something and drop it on the floor, I guess?
async def sender(q):
async with open_nursery() as nursery:
while True:
batch = []
with move_on_after(X):
while len(batch) < Y:
batch.append(await q.get())
nursery.start_soon(send_batch, batch) So here we might end up with several calls to Given that the actual goal of this batching is to avoid overloading the service you're ultimately sending it to, I wonder if it would be better to drop the maximum batch size limit entirely, and just say "we'll send out a request every X seconds, with whatever we've gathered". That slightly increases how much buffering you're doing, but presumably you can hold, like, 1 or 5 or 10 seconds worth of data in memory, and losing 1 or 5 or 10 seconds of data to a server crash isn't appreciably worse than losing 0.5 seconds or whatever it would be with the If you do it like that, then you can make this even simple -- you don't even need a async def receiver(data_to_send):
while True:
data_to_send.append(await get_another_item())
async def sender(data_to_send):
async with open_nursery() as nursery:
while True:
await sleep(X)
batch = data_to_send.copy()
data_to_send.clear()
nursery.start_soon(send_batch, batch)
async def main():
async with open_nursery() as nursery:
data_to_send = []
nursery.start_soon(receiver, data_to_send)
nursery.start_soon(sender, data_to_send) What do you think? (And returning to the original topic of this issue: I don't think any of my suggestions here would actually benefit from a |
Well, the The primary benefit I saw in the
Compare this to something like Ultimately though, it's the same sort of thing (I think) that caused trio to have cancel scopes instead of individual timeouts to particular function calls, If you want to buffer at most N items, you have to decide how you're going to split that across all of the various queues/lists just like with timeouts you'd have to decide how to divvy up your "timeout budget" across multiple calls. |
@dstufft Ah, you're totally right of course, I was getting thrown by my own forays into batching-up-logs-to-submit-to-long-term-storage, where for the service I was targeting the main limit was requests/second, not items/request. So my intuition was wrong. But yeah, I was curious so I looked at the BigQuery docs (cough assuming hypothetically that this unnamed project might want to submit things to BigQuery), and I see the same thing you do: they don't care at all about how many insert requests/second you do, as long as each one isn't too big. Do you want some kind of "if we haven't put together a full batch after X seconds, then go ahead and submit a partial batch" logic? I'm trying to figure out what exactly the |
Updating the name to reflect the change to use Channels. Not sure if this is still feasible to think about adding; keeping it open for now as a marker to think about it. |
It would be super useful to be able to get multiple items from a
trio.Queue
object at once, ideally with a few options to tune if you want to get exactly N items (and block until N items happen), or if you're happy with less than that (and if you block, something with timeouts to be able to get what you've gotten so far).The text was updated successfully, but these errors were encountered: