Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) - 0.2.58 #1319

Closed
4 tasks done
anakin87 opened this issue Apr 2, 2024 · 11 comments
Closed
4 tasks done

Segmentation fault (core dumped) - 0.2.58 #1319

anakin87 opened this issue Apr 2, 2024 · 11 comments
Labels
bug Something isn't working high-priority

Comments

@anakin87
Copy link

anakin87 commented Apr 2, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I'm trying to create a completion using a GGUF model

Current Behavior

from llama_cpp import Llama

model = Llama(model_path="./tests/models/openchat-3.5-1210.Q3_K_S.gguf",  n_ctx=128, n_batch=128)

questions_and_answers = [
    ("What's the capital of France?", "Paris"),
    ("What is the capital of Canada?", "Ottawa"),
    ("What is the capital of Ghana?", "Accra"),
]

for i, (question, answer) in enumerate(questions_and_answers):
    prompt = f"GPT4 Correct User: Answer in a single word. {question} <|end_of_turn|>\n GPT4 Correct Assistant:"
    result = model.create_completion(prompt=prompt)

    print(i)
    print(result)

Segmentation fault (core dumped) on 0.2.58
(works well on 0.2.57)

Environment and Context

Failure Information (for bugs)

Segmentation fault (core dumped)

@riedgar-ms
Copy link

We have seen this issue in the guidance project.

@anakin87
Copy link
Author

anakin87 commented Apr 2, 2024

Thanks!
Related: guidance-ai/guidance#735

@abetlen abetlen added bug Something isn't working high-priority labels Apr 2, 2024
@sepcnt
Copy link

sepcnt commented Apr 3, 2024

Here is a rough bug trace. It seems that any context longer than a certain bounder would cause the crash down

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Generating Responses:   0%|                                                                                                                                                       | 0/96 [00:00<?, ?it/s]Fatal Python error: Segmentation fault

Thread 0x00007f8851089700 (most recent call first):
  File "~/miniforge3/lib/python3.10/threading.py", line 324 in wait
  File "~/miniforge3/lib/python3.10/threading.py", line 607 in wait
  File "~/miniforge3/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "~/miniforge3/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "~/miniforge3/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f8a06f84740 (most recent call first):
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/llama_cpp/_llama_cpp.py", line 165 in get_logits
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 624 in __call__
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 1188 in _run_stateless
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 992 in __add__
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 984 in __add__
  File "~/workspace/models/inference.py", line 100 in predict


Extension modules: (total: 224)
Segmentation fault (core dumped)

Probably relevant to ggerganov/llama.cpp#6017 (comment)

@abetlen
Copy link
Owner

abetlen commented Apr 3, 2024

I'll work on a fix once I get a chance to repro, in the meantime let me share the recipe for debugging this (on Linux at least).

# install the package in debug mode to maintain symbols (may need to add additional cmake flags for specific backends)
python3 -m pip install \
  --verbose \
  --config-settings cmake.args='-DCMAKE_BUILD_TYPE=Debug;-DCMAKE_CXX_FLAGS=-g3;-DCMAKE_C_FLAGS=-g3' \
  --config-settings cmake.verbose=true \
  --config-settings logging.level=INFO \
  --config-settings install.strip=false \
  --editable .
# run test script with gdb
gdb --args python3 test_script.py

@anakin87
Copy link
Author

anakin87 commented Apr 3, 2024

(gdb) run try_llama.py 
Starting program: /home/anakin87/apps/haystack-core-integrations/integrations/llama_cpp/.hatch/llama-cpp-haystack/bin/python try_llama.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2dff640 (LWP 25295)]
[New Thread 0x7ffff25fe640 (LWP 25296)]
[New Thread 0x7fffeddfd640 (LWP 25297)]
[New Thread 0x7fffed5fc640 (LWP 25298)]
[New Thread 0x7fffeadfb640 (LWP 25299)]
[New Thread 0x7fffe65fa640 (LWP 25300)]
[New Thread 0x7fffe3df9640 (LWP 25301)]
[New Thread 0x7fffe35f8640 (LWP 25302)]
[New Thread 0x7fffe0df7640 (LWP 25303)]
[New Thread 0x7fffdc5f6640 (LWP 25304)]
[New Thread 0x7fffdbdf5640 (LWP 25305)]
[New Thread 0x7fffd75f4640 (LWP 25306)]
[New Thread 0x7fffd4df3640 (LWP 25307)]
[New Thread 0x7fffd25f2640 (LWP 25308)]
[New Thread 0x7fffcfdf1640 (LWP 25309)]
[New Thread 0x7fffcd5f0640 (LWP 25310)]
[New Thread 0x7fffcadef640 (LWP 25311)]
[New Thread 0x7fffca5ee640 (LWP 25312)]
[New Thread 0x7fffc7ded640 (LWP 25313)]
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./tests/models/openchat-3.5-1210.Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = openchat_openchat-3.5-1210
llama_model_loader: - kv   2:                       llama.context_length u32              = 8192
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 11
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q3_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.95 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = openchat_openchat-3.5-1210
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3017.28 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    16.00 MiB
llama_new_context_with_model: KV self size  =   16.00 MiB, K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    20.06 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{{ 'GPT4 Correct ' + message['role'].title() + ': ' + message['content'] + '<|end_of_turn|>'}}{% endfor %}{% if add_generation_prompt %}{{ 'GPT4 Correct Assistant:' }}{% endif %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '8192', 'general.name': 'openchat_openchat-3.5-1210', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '11'}
Using gguf chat template: {{ bos_token }}{% for message in messages %}{{ 'GPT4 Correct ' + message['role'].title() + ': ' + message['content'] + '<|end_of_turn|>'}}{% endfor %}{% if add_generation_prompt %}{{ 'GPT4 Correct Assistant:' }}{% endif %}
Using chat eos_token: <|end_of_turn|>
Using chat bos_token: <s>
[New Thread 0x7fff01f3a640 (LWP 25314)]
[New Thread 0x7fff01739640 (LWP 25315)]
[New Thread 0x7fff00f38640 (LWP 25316)]
[New Thread 0x7fff00737640 (LWP 25317)]
[New Thread 0x7ffefff36640 (LWP 25318)]
[New Thread 0x7ffeff735640 (LWP 25319)]
[New Thread 0x7ffefef34640 (LWP 25320)]
[New Thread 0x7ffefe733640 (LWP 25321)]
[New Thread 0x7ffefdf32640 (LWP 25322)]
[Thread 0x7ffefdf32640 (LWP 25322) exited]
[Thread 0x7ffefe733640 (LWP 25321) exited]
[Thread 0x7ffefef34640 (LWP 25320) exited]
[Thread 0x7ffeff735640 (LWP 25319) exited]
[Thread 0x7ffefff36640 (LWP 25318) exited]
[Thread 0x7fff00737640 (LWP 25317) exited]
[Thread 0x7fff00f38640 (LWP 25316) exited]
[Thread 0x7fff01739640 (LWP 25315) exited]
[Thread 0x7fff01f3a640 (LWP 25314) exited]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff7ba7d83 in ?? () from /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so

@abetlen
Copy link
Owner

abetlen commented Apr 3, 2024

@anakin87 ah okay I think I see what it is, can you set Llama(..., logits_all=True) in your test, almost certain that's it, I'll work on a fix.

@abetlen
Copy link
Owner

abetlen commented Apr 3, 2024

Long-term I'll set up some tests that use Qwen1.5 0.5B or some other small model to smoke test for issues like this.

@anakin87
Copy link
Author

anakin87 commented Apr 3, 2024

@abetlen setting logits_all=True fixes the problem.

@abetlen
Copy link
Owner

abetlen commented Apr 3, 2024

@anakin87 should be fixed now in v0.2.59

@riedgar-ms
Copy link

@abetlen I still seem to be seeing this (or a related error) when I try upgrading to llama-cpp-python==0.2.59.

On Windows (and Python 3.12), our tests which try to use a Llama model are failing. Sample output:

>                   sampling_order = torch.multinomial(probs_torch, len(probs_torch)).cpu().numpy()
E                   RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

guidance\models\_model.py:258: RuntimeError
------------------------------------------------------------------------------------------------------------- Captured stderr setup -------------------------------------------------------------------------------------------------------------
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:\Users\riedgar\.cache\huggingface\hub\models--TheBloke--Llama-2-7B-GGUF\snapshots\b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80\llama-2-7b.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.45 GiB (5.68 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  4560.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '17', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}
Using fallback chat format: None

A similar error when fetching the logits appears in our MacOS builds (and I'm just waiting to see if the Ubuntu build also fails).

This continues to work with v0.2.57.

@rsoika
Copy link

rsoika commented Apr 4, 2024

I think I have the same issue (see #1326) and setting logits_all=True fixes the problem...

xhedit pushed a commit to xhedit/llama-cpp-conv that referenced this issue Apr 6, 2024
xhedit added a commit to xhedit/llama-cpp-conv that referenced this issue Apr 6, 2024
* feat: add support for KV cache quantization options (abetlen#1307)

* add KV cache quantization options

abetlen#1220
abetlen#1305

* Add ggml_type

* Use ggml_type instead of string for quantization

* Add server support

---------

Co-authored-by: Andrei Betlen <abetlen@gmail.com>

* fix: Changed local API doc references to hosted (abetlen#1317)

* chore: Bump version

* fix: last tokens passing to sample_repetition_penalties function (abetlen#1295)

Co-authored-by: ymikhaylov <ymikhaylov@x5.ru>
Co-authored-by: Andrei <abetlen@gmail.com>

* feat: Update llama.cpp

* fix: segfault when logits_all=False. Closes abetlen#1319

* feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247)

* Generate binary wheel index on release

* Add total release downloads badge

* Update download label

* Use official cibuildwheel action

* Add workflows to build CUDA and Metal wheels

* Update generate index workflow

* Update workflow name

* feat: Update llama.cpp

* chore: Bump version

* fix(ci): use correct script name

* docs: LLAMA_CUBLAS -> LLAMA_CUDA

* docs: Add docs explaining how to install pre-built wheels.

* docs: Rename cuBLAS section to CUDA

* fix(docs): incorrect tool_choice example (abetlen#1330)

* feat: Update llama.cpp

* fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314

* fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314

* feat: Update llama.cpp

* fix: Always embed metal library. Closes abetlen#1332

* feat: Update llama.cpp

* chore: Bump version

---------

Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com>
Co-authored-by: Andrei Betlen <abetlen@gmail.com>
Co-authored-by: lawfordp2017 <lawfordp@gmail.com>
Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com>
Co-authored-by: ymikhaylov <ymikhaylov@x5.ru>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority
Projects
None yet
Development

No branches or pull requests

5 participants