[need help] a simple python implementation of parallel.cpp #930
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I am in need of an HTTP API that supports continuous batch processing, so I have decided to implement it myself.
I encountered some issues while trying to implement continuous batch processing using the low-level llama.cpp API provided by this project. Therefore, I have posted my implementation here to seek help.
i mainly refer to:https://github.com/ggerganov/llama.cpp/blob/master/examples/parallel/parallel.cpp
I am experiencing a possible memory leak when performing continuous batch processing with large contexts and batches.
I have raised two separate issues, one in this repository (llama.cpp) and another in llama-cpp-python, to provide more information about the problem.
Welcome to point out any errors, and I will fix them as soon as possible.
notice:
this demo not support grammar、 terminal args、 prompt file