-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: main loop blocked, server stuck #5851
Comments
I would prefer to go with option #3:
I still have no idea how to deal with context self-extend. Is kv_cache_seq_* can be called concurrently ? main loop should only be responsible for llama_decode. Metrics/health might access a read-only slots state updated concurrently by each slot. I would prefer to get rid of the tasks queue at the main loop level as this approach is designed for non blocking ops. But first, all should be well tested. We still lack multimodal tests. |
We can think about the
Only the backend can use the I'm not really sure that the current implementation is problematic. Yes, the response to a health check can get blocked and delayed, but when the client eventually receives the response, they know it is correct. While, if the response was asynchronous in some way, the information in it would be immediately outdated. Consider:
So the current implementation has the advantage that health and metrics provide up-to-date information, but the disadvantage of blocking requests. I'm looking for ways to avoid adding more mutexes and locking since things can easily get complicated this way. So I'm wondering if we really need non-blocking requests to health and metrics endpoints |
The proposal is clear enough for me, but I prefer the first option because the high-level llamax is something we already discussed quite a long time now (but kinda lost in time). I see a real need of this such library, because it will also ease the development of other examples. One thing I'm not sure about your 3rd option though: which call currently takes more than 1 miliseconds to process? I think our current blocking point is The health and metrics endpoint is mostly blocked by Btw, one trick to have In my personal implementation, I break the prompt into chunks of 20 tokens each. The inference is done on an old CPU so performance lost is too small. |
Thanks for your explanation. Understood and agreed for health. But is it normal prompt & image processing, and context self-extend for one slot is blocking the main loop ? can't it be done concurrently ? |
It can - it's all about how the We can improve this at some point, but atm I don't think it is a huge issue |
Closing as it works as expected. |
Context
Call to following functions are blocking the main loop and the server stuck for all slots / requests in method
update_slots
Global:
Per slot:
If prompt is big enough, self extend or continuous batching are enabled.
Proposal
We need to separate slots state management, tokens retrieval from slots processing but keeping one batch for the whole server.
Firstly, it should be well tested and reproducible in the test server framework in a slow test with a real prompt and model (as in the passkey).
I see 3 options:
n_slots
which will call all this function asynchronously@ggerganov @ngxson please confirm the list of blocking method, which one must be thread safe (I meant only in the main loop).
I am welling to implement option 2 or 3, assign me back the issue if you are OK.
The text was updated successfully, but these errors were encountered: