-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rwkv.cpp server #17
Comments
Hi @abetlen, thanks! Having an OpenAI-compatible inference server is indeed great and will definetly increase usability of If the server requires more structure, then I'm not sure... Having it inside a subdirectory may not work: last time I checked, Python does not like referencing py files in subdirectories. Maybe in this case Please also don't forget to check out new I can also suggest implementing a state cache, like this. The idea is to cache states by hash of prompt string, so during next inference call, we don't need to go through whole prompt again -- I can't overstate how slow it is having long conversations with chatbots on CPU without a cache. You decide tho, the cache is an additional complication after all :) |
Thanks for the reply @saharNooby I'll see about putting it into a single file, right now it depends on 3 packages: fastapi (framework), sse_starlette (handle server-sent events), and uvicorn (server). And thank you for sharing that cache implementation I'll be sure to integrate it (something I'm working on for the llama.cpp server actually)! I think a pip package would be very useful, if you need any help putting that together I'd be happy to assist. I have one for my llama.cpp python bindings and the approach that I took for the server is to distribute using a subpackage (ie To handle the C library dependency I ended up using scikit-build which has support for building native shared libraries. That way when users do a pip install it builds from source on their system ensuring the proper optimisations are selected. Let me know and I can put together a PR or something to get you going in that regard and gladly share any bugfixes between the projects. PS: Will definitely check out that new quantization format, thanks! |
Have we merged this change yet? |
@ss-zheng As I know, adding the server requires #21 to be merged, and it is not merged yet. But there are already OpenAI-compatible REST servers that support rwkv.cpp, like https://github.com/go-skynet/LocalAI |
Great thanks for pointing me to it! |
I think it would be great to have a straight python server rather than LocalAI's docker build process. |
So, I made this fork on my git server https://git.brz9.dev/ed/rwkv.cpp with this extra flask_server.py in rwkv/. It can be run with Then queried by :
This is still a work in progress but that would be a small addition to the codebase |
@saharNooby thank you for your great work. |
Hi @saharNooby, first off amazing work in this repo, I've been looking for a cpu implementation of RWKV to experiment with using the pre-trained models (don't have a large gpu).
I've put together a basic port of my OpenAI-compatible webserver from llama-cpp-python and tested it on Linux with the your library and the RWKV Raven 3B model in f16, q4_0, and q4_1 (pictured below). Going to try some larger models this weekend to test performance / quality. The cool thing about exposing the model through this server is that It opens the project up to be connected to any OpenAI client (langchain, chatui's, multi-language client libraries).
Let me know if you want me to put a PR to merge this in somewhere and if so the best place to put it.
Cheers, and again great work!
The text was updated successfully, but these errors were encountered: