add KV cache quantization options #1307

Limour-dev · 2024-03-28T11:28:53Z

#1220
#1305

abetlen#1220 abetlen#1305

ddh0 · 2024-04-01T02:47:08Z

Would love for this to get merged!

abetlen · 2024-04-01T14:18:47Z

Hey @Limour-dev thanks for the contribution.

A couple changes I made before merging:

Added ggml_type int enum values to llama_cpp.py and used these instead of the mapped string names for kv quantization types.
Added support for the type_k and type_v to the server as well.

* add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com>

* feat: add support for KV cache quantization options (abetlen#1307) * add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> * fix: Changed local API doc references to hosted (abetlen#1317) * chore: Bump version * fix: last tokens passing to sample_repetition_penalties function (abetlen#1295) Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Andrei <abetlen@gmail.com> * feat: Update llama.cpp * fix: segfault when logits_all=False. Closes abetlen#1319 * feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247) * Generate binary wheel index on release * Add total release downloads badge * Update download label * Use official cibuildwheel action * Add workflows to build CUDA and Metal wheels * Update generate index workflow * Update workflow name * feat: Update llama.cpp * chore: Bump version * fix(ci): use correct script name * docs: LLAMA_CUBLAS -> LLAMA_CUDA * docs: Add docs explaining how to install pre-built wheels. * docs: Rename cuBLAS section to CUDA * fix(docs): incorrect tool_choice example (abetlen#1330) * feat: Update llama.cpp * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314 * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314 * feat: Update llama.cpp * fix: Always embed metal library. Closes abetlen#1332 * feat: Update llama.cpp * chore: Bump version --------- Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com> Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: lawfordp2017 <lawfordp@gmail.com> Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com> Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Limour-dev and others added 4 commits March 28, 2024 19:27

add KV cache quantization options

94a9519

abetlen#1220 abetlen#1305

Merge branch 'main' into Limour-dev/main

c4a2f85

Add ggml_type

1a6a9a3

Use ggml_type instead of string for quantization

fcb8051

Add server support

7828382

abetlen merged commit f165048 into abetlen:main Apr 1, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add KV cache quantization options #1307

add KV cache quantization options #1307

Limour-dev commented Mar 28, 2024

ddh0 commented Apr 1, 2024

abetlen commented Apr 1, 2024

add KV cache quantization options #1307

add KV cache quantization options #1307

Conversation

Limour-dev commented Mar 28, 2024

ddh0 commented Apr 1, 2024

abetlen commented Apr 1, 2024