Segmentation fault (core dumped) - 0.2.58 #1319

anakin87 · 2024-04-02T15:07:12Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I'm trying to create a completion using a GGUF model

Current Behavior

from llama_cpp import Llama

model = Llama(model_path="./tests/models/openchat-3.5-1210.Q3_K_S.gguf",  n_ctx=128, n_batch=128)

questions_and_answers = [
    ("What's the capital of France?", "Paris"),
    ("What is the capital of Canada?", "Ottawa"),
    ("What is the capital of Ghana?", "Accra"),
]

for i, (question, answer) in enumerate(questions_and_answers):
    prompt = f"GPT4 Correct User: Answer in a single word. {question} <|end_of_turn|>\n GPT4 Correct Assistant:"
    result = model.create_completion(prompt=prompt)

    print(i)
    print(result)

Segmentation fault (core dumped) on 0.2.58
(works well on 0.2.57)

Environment and Context

all GithHub runners: Linux, macOS, Windows https://github.com/deepset-ai/haystack-core-integrations/actions/runs/8515449565
my Ubuntu laptop

Failure Information (for bugs)

Segmentation fault (core dumped)

riedgar-ms · 2024-04-02T15:37:15Z

We have seen this issue in the guidance project.

anakin87 · 2024-04-02T15:42:29Z

Thanks!
Related: guidance-ai/guidance#735

sepcnt · 2024-04-03T03:07:04Z

Here is a rough bug trace. It seems that any context longer than a certain bounder would cause the crash down

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
Generating Responses:   0%|                                                                                                                                                       | 0/96 [00:00<?, ?it/s]Fatal Python error: Segmentation fault

Thread 0x00007f8851089700 (most recent call first):
  File "~/miniforge3/lib/python3.10/threading.py", line 324 in wait
  File "~/miniforge3/lib/python3.10/threading.py", line 607 in wait
  File "~/miniforge3/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "~/miniforge3/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "~/miniforge3/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f8a06f84740 (most recent call first):
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/llama_cpp/_llama_cpp.py", line 165 in get_logits
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 624 in __call__
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 1188 in _run_stateless
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 992 in __add__
  File "~/miniforge3/lib/python3.10/site-packages/guidance/models/_model.py", line 984 in __add__
  File "~/workspace/models/inference.py", line 100 in predict


Extension modules: (total: 224)
Segmentation fault (core dumped)

Probably relevant to ggerganov/llama.cpp#6017 (comment)

abetlen · 2024-04-03T07:36:18Z

I'll work on a fix once I get a chance to repro, in the meantime let me share the recipe for debugging this (on Linux at least).

# install the package in debug mode to maintain symbols (may need to add additional cmake flags for specific backends)
python3 -m pip install \
  --verbose \
  --config-settings cmake.args='-DCMAKE_BUILD_TYPE=Debug;-DCMAKE_CXX_FLAGS=-g3;-DCMAKE_C_FLAGS=-g3' \
  --config-settings cmake.verbose=true \
  --config-settings logging.level=INFO \
  --config-settings install.strip=false \
  --editable .
# run test script with gdb
gdb --args python3 test_script.py

anakin87 · 2024-04-03T08:09:47Z

(gdb) run try_llama.py 
Starting program: /home/anakin87/apps/haystack-core-integrations/integrations/llama_cpp/.hatch/llama-cpp-haystack/bin/python try_llama.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff2dff640 (LWP 25295)]
[New Thread 0x7ffff25fe640 (LWP 25296)]
[New Thread 0x7fffeddfd640 (LWP 25297)]
[New Thread 0x7fffed5fc640 (LWP 25298)]
[New Thread 0x7fffeadfb640 (LWP 25299)]
[New Thread 0x7fffe65fa640 (LWP 25300)]
[New Thread 0x7fffe3df9640 (LWP 25301)]
[New Thread 0x7fffe35f8640 (LWP 25302)]
[New Thread 0x7fffe0df7640 (LWP 25303)]
[New Thread 0x7fffdc5f6640 (LWP 25304)]
[New Thread 0x7fffdbdf5640 (LWP 25305)]
[New Thread 0x7fffd75f4640 (LWP 25306)]
[New Thread 0x7fffd4df3640 (LWP 25307)]
[New Thread 0x7fffd25f2640 (LWP 25308)]
[New Thread 0x7fffcfdf1640 (LWP 25309)]
[New Thread 0x7fffcd5f0640 (LWP 25310)]
[New Thread 0x7fffcadef640 (LWP 25311)]
[New Thread 0x7fffca5ee640 (LWP 25312)]
[New Thread 0x7fffc7ded640 (LWP 25313)]
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./tests/models/openchat-3.5-1210.Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = openchat_openchat-3.5-1210
llama_model_loader: - kv   2:                       llama.context_length u32              = 8192
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 11
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q3_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.95 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = openchat_openchat-3.5-1210
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|end_of_turn|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  3017.28 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: n_batch    = 128
llama_new_context_with_model: n_ubatch   = 128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    16.00 MiB
llama_new_context_with_model: KV self size  =   16.00 MiB, K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    20.06 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{{ 'GPT4 Correct ' + message['role'].title() + ': ' + message['content'] + '<|end_of_turn|>'}}{% endfor %}{% if add_generation_prompt %}{{ 'GPT4 Correct Assistant:' }}{% endif %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.eos_token_id': '32000', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '8192', 'general.name': 'openchat_openchat-3.5-1210', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '11'}
Using gguf chat template: {{ bos_token }}{% for message in messages %}{{ 'GPT4 Correct ' + message['role'].title() + ': ' + message['content'] + '<|end_of_turn|>'}}{% endfor %}{% if add_generation_prompt %}{{ 'GPT4 Correct Assistant:' }}{% endif %}
Using chat eos_token: <|end_of_turn|>
Using chat bos_token: <s>
[New Thread 0x7fff01f3a640 (LWP 25314)]
[New Thread 0x7fff01739640 (LWP 25315)]
[New Thread 0x7fff00f38640 (LWP 25316)]
[New Thread 0x7fff00737640 (LWP 25317)]
[New Thread 0x7ffefff36640 (LWP 25318)]
[New Thread 0x7ffeff735640 (LWP 25319)]
[New Thread 0x7ffefef34640 (LWP 25320)]
[New Thread 0x7ffefe733640 (LWP 25321)]
[New Thread 0x7ffefdf32640 (LWP 25322)]
[Thread 0x7ffefdf32640 (LWP 25322) exited]
[Thread 0x7ffefe733640 (LWP 25321) exited]
[Thread 0x7ffefef34640 (LWP 25320) exited]
[Thread 0x7ffeff735640 (LWP 25319) exited]
[Thread 0x7ffefff36640 (LWP 25318) exited]
[Thread 0x7fff00737640 (LWP 25317) exited]
[Thread 0x7fff00f38640 (LWP 25316) exited]
[Thread 0x7fff01739640 (LWP 25315) exited]
[Thread 0x7fff01f3a640 (LWP 25314) exited]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffff7ba7d83 in ?? () from /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so

abetlen · 2024-04-03T10:41:22Z

@anakin87 ah okay I think I see what it is, can you set Llama(..., logits_all=True) in your test, almost certain that's it, I'll work on a fix.

abetlen · 2024-04-03T10:43:05Z

Long-term I'll set up some tests that use Qwen1.5 0.5B or some other small model to smoke test for issues like this.

anakin87 · 2024-04-03T10:59:22Z

@abetlen setting logits_all=True fixes the problem.

abetlen · 2024-04-03T19:40:04Z

@anakin87 should be fixed now in v0.2.59

riedgar-ms · 2024-04-04T15:54:15Z

@abetlen I still seem to be seeing this (or a related error) when I try upgrading to llama-cpp-python==0.2.59.

On Windows (and Python 3.12), our tests which try to use a Llama model are failing. Sample output:

>                   sampling_order = torch.multinomial(probs_torch, len(probs_torch)).cpu().numpy()
E                   RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

guidance\models\_model.py:258: RuntimeError
------------------------------------------------------------------------------------------------------------- Captured stderr setup -------------------------------------------------------------------------------------------------------------
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from C:\Users\riedgar\.cache\huggingface\hub\models--TheBloke--Llama-2-7B-GGUF\snapshots\b4e04e128f421c93a5f1e34ac4d7ca9b0af47b80\llama-2-7b.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 4.45 GiB (5.68 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  4560.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    70.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': 'LLaMA v2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.block_count': '32', 'llama.feed_forward_length': '11008', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '17', 'llama.attention.head_count_kv': '32', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0'}
Using fallback chat format: None

A similar error when fetching the logits appears in our MacOS builds (and I'm just waiting to see if the Ubuntu build also fails).

This continues to work with v0.2.57.

rsoika · 2024-04-04T18:16:19Z

I think I have the same issue (see #1326) and setting logits_all=True fixes the problem...

* feat: add support for KV cache quantization options (abetlen#1307) * add KV cache quantization options abetlen#1220 abetlen#1305 * Add ggml_type * Use ggml_type instead of string for quantization * Add server support --------- Co-authored-by: Andrei Betlen <abetlen@gmail.com> * fix: Changed local API doc references to hosted (abetlen#1317) * chore: Bump version * fix: last tokens passing to sample_repetition_penalties function (abetlen#1295) Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Andrei <abetlen@gmail.com> * feat: Update llama.cpp * fix: segfault when logits_all=False. Closes abetlen#1319 * feat: Binary wheels for CPU, CUDA (12.1 - 12.3), Metal (abetlen#1247) * Generate binary wheel index on release * Add total release downloads badge * Update download label * Use official cibuildwheel action * Add workflows to build CUDA and Metal wheels * Update generate index workflow * Update workflow name * feat: Update llama.cpp * chore: Bump version * fix(ci): use correct script name * docs: LLAMA_CUBLAS -> LLAMA_CUDA * docs: Add docs explaining how to install pre-built wheels. * docs: Rename cuBLAS section to CUDA * fix(docs): incorrect tool_choice example (abetlen#1330) * feat: Update llama.cpp * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 abetlen#1314 * fix: missing logprobs in response, incorrect response type for functionary, minor type issues. Closes abetlen#1328 Closes abetlen#1314 * feat: Update llama.cpp * fix: Always embed metal library. Closes abetlen#1332 * feat: Update llama.cpp * chore: Bump version --------- Co-authored-by: Limour <93720049+Limour-dev@users.noreply.github.com> Co-authored-by: Andrei Betlen <abetlen@gmail.com> Co-authored-by: lawfordp2017 <lawfordp@gmail.com> Co-authored-by: Yuri Mikhailov <bitsharp@gmail.com> Co-authored-by: ymikhaylov <ymikhaylov@x5.ru> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

anakin87 mentioned this issue Apr 2, 2024

LlamaCPP: fix failing tests deepset-ai/haystack-core-integrations#638

Closed

abetlen added bug Something isn't working high-priority labels Apr 2, 2024

ChengjieLi28 mentioned this issue Apr 3, 2024

TST: Fix tests due to llama-cpp-python v0.2.58 xorbitsai/inference#1242

Merged

ee2110 mentioned this issue Apr 3, 2024

Segmentation fault (core dumped) error #1323

Closed

abetlen closed this as completed in 8649d76 Apr 3, 2024

This was referenced Apr 3, 2024

Segmentation fault (core dumped) when generating #1292

Closed

Mistral 7b crashes permanently with GPU #1326

Open

riedgar-ms mentioned this issue Apr 4, 2024

[Test] llama-cpp-python update guidance-ai/guidance#746

Merged

TheLounger mentioned this issue Apr 4, 2024

Add c4ai-command-r-v01 Support oobabooga/text-generation-webui#5762

Closed

dgdguk mentioned this issue Apr 5, 2024

Segfault with llama-cpp-python >= 0.2.59 oobabooga/text-generation-webui#5812

Closed

xhedit pushed a commit to xhedit/llama-cpp-conv that referenced this issue Apr 6, 2024

fix: segfault when logits_all=False. Closes abetlen#1319

3fb119d

BetaDoggo mentioned this issue Apr 6, 2024

Integrate local model support via llama-cpp-python zydxt/sd-webui-rpg-diffusionmaster#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault (core dumped) - 0.2.58 #1319

Segmentation fault (core dumped) - 0.2.58 #1319

anakin87 commented Apr 2, 2024

riedgar-ms commented Apr 2, 2024

anakin87 commented Apr 2, 2024

sepcnt commented Apr 3, 2024 •

edited

Loading

abetlen commented Apr 3, 2024

anakin87 commented Apr 3, 2024

abetlen commented Apr 3, 2024

abetlen commented Apr 3, 2024

anakin87 commented Apr 3, 2024

abetlen commented Apr 3, 2024

riedgar-ms commented Apr 4, 2024

rsoika commented Apr 4, 2024

Segmentation fault (core dumped) - 0.2.58 #1319

Segmentation fault (core dumped) - 0.2.58 #1319

Comments

anakin87 commented Apr 2, 2024

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

riedgar-ms commented Apr 2, 2024

anakin87 commented Apr 2, 2024

sepcnt commented Apr 3, 2024 • edited Loading

abetlen commented Apr 3, 2024

anakin87 commented Apr 3, 2024

abetlen commented Apr 3, 2024

abetlen commented Apr 3, 2024

anakin87 commented Apr 3, 2024

abetlen commented Apr 3, 2024

riedgar-ms commented Apr 4, 2024

rsoika commented Apr 4, 2024

sepcnt commented Apr 3, 2024 •

edited

Loading