Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add KV cache quantization options #1307

Merged
merged 5 commits into from
Apr 1, 2024
Merged

add KV cache quantization options #1307

merged 5 commits into from
Apr 1, 2024

Conversation

Limour-dev
Copy link
Contributor

@ddh0
Copy link
Contributor

ddh0 commented Apr 1, 2024

Would love for this to get merged!

@abetlen
Copy link
Owner

abetlen commented Apr 1, 2024

Hey @Limour-dev thanks for the contribution.

A couple changes I made before merging:

  • Added ggml_type int enum values to llama_cpp.py and used these instead of the mapped string names for kv quantization types.
  • Added support for the type_k and type_v to the server as well.

@abetlen abetlen merged commit f165048 into abetlen:main Apr 1, 2024
16 checks passed
xhedit pushed a commit to xhedit/llama-cpp-conv that referenced this pull request Apr 6, 2024
* add KV cache quantization options

abetlen#1220
abetlen#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>
xhedit added a commit to xhedit/llama-cpp-conv that referenced this pull request Apr 6, 2024
* feat: add support for KV cache quantization options (abetlen#1307)

* add KV cache quantization options

abetlen#1220
abetlen#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>

* fix: Changed local API doc references to hosted (abetlen#1317)

* chore: Bump version

* fix: last tokens passing to sample_repetition_penalties function (abetlen#1295)

Co-authored-by: ymikhaylov <ymikhaylov@x5.ru>
Co-authored-by: Andrei <abetlen@gmail.com>

* feat: Update llama.cpp

* fix: segfault when logits_all=False. Closes abetlen#1319

* feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247)

* Generate binary wheel index on release

* Add total release downloads badge

* Update download label

* Use official cibuildwheel action

* Add workflows to build CUDA and Metal wheels

* Update generate index workflow

* Update workflow name

* feat: Update llama.cpp

* chore: Bump version

* fix(ci): use correct script name

* docs: LLAMA_CUBLAS -> LLAMA_CUDA

* docs: Add docs explaining how to install pre-built wheels.

* docs: Rename cuBLAS section to CUDA

* fix(docs): incorrect tool_choice example (abetlen#1330)

* feat: Update llama.cpp

* fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314

* fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314

* feat: Update llama.cpp

* fix: Always embed metal library. Closes abetlen#1332

* feat: Update llama.cpp

* chore: Bump version

---------

Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com>
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: lawfordp2017 <lawfordp@gmail.com>
Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com>
Co-authored-by: ymikhaylov <ymikhaylov@x5.ru>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants