Experiments with llama.cpp internal API.
This is a low-level binding for llama.cpp in WASM that supports low-level API like (de)tokenization, embeddings,...
How to build
# require having docker compose installed
cd wasm
./build.sh
# output binary can be found in /build/wllama.wasm
How to use
See wasm/main.html
and wasm/main.js
Due to CORS limitation, to try the demode, please install http-server and run cd wasm && http-server . -c-1
TODO
- Support multi-sequences: knowing the resource limitation when using WASM, I don't think having multi-sequences is a good idea
- Multi-modal: Awaiting refactoring LLaVA implementation from llama.cpp
Experiment on how to save state to disk as Q4_K (while keeping inference with f16)
To build, run ./build.sh