Segmentation fault #5655

TruongGiangBT · 2024-02-22T01:47:05Z

I apologize for the inconvenience. I am deploying a server with the following parameters: -cb -v --embedding -np 3 -c 8192 --host "0.0.0.0" -ngl 64. When I perform multiple embedding requests, a segmentation fault occurs. I noticed that if there are 2 slots performing the embedding task simultaneously, it causes an error. I hope to receive a solution soon. Thank you very much.

Originally posted by @TruongGiangBT in #3876 (comment)

allow to enable VERBOSE mode

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

phymbert · 2024-02-24T11:35:41Z

Added in: issues.feature

@ggerganov @ngxson On it 👍

…t request. server: tests: add multi users embeddings as fixed

#5699) * server: #5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs

phymbert · 2024-02-24T18:18:08Z

@TruongGiangBT Please confirm you do not face any error anymore. Feel free to reopen if any.

TruongGiangBT · 2024-02-26T02:35:07Z

@phymbert The segmentation fault error has been fixed. However, when I simultaneously execute multiple requests with the same input value, the embedding results are different.
My run command:
docker run -dp 6900:8080
--gpus '"device=1"'
--name vistral_llm
--restart=always
-v /home/gianght/llm_models/:/models mistral-wrapped-llama.cpp:full-cuda
--server -m /models/ggml-vistral-7B-chat-q4_1.gguf
-cb -v --embedding
-np 8
-c 32768
--host "0.0.0.0"
-ngl 64
-b 4096
Test code:

phymbert · 2024-02-26T07:54:39Z

Could you please open a dedicated discussion on embedding feature ?

TruongGiangBT · 2024-02-26T08:14:38Z

llama.cpp:

server.cpp:

What do you think if i fix it like this?
I have tried. With the same input, if the tokens of the sequence are entirely within one batch when decoding, then the embedding result will be the same. If the tokens of that slot are decoded multiple times, the results will be different, but the cosine similarity with the correct embedding result is > 0.999.

phymbert · 2024-02-26T21:12:17Z

@ngxson Hi, any idea on the matter please ? i

ngxson · 2024-02-26T21:58:47Z

It's true that there's a bug on the line that @TruongGiangBT pointed out: the llama_get_embeddings does not care about the fact that we're now having multi-seq.

The fact that you see different results was because you're using -np 8 which allows having 8 seq per batch. llama_get_embeddings always returns the first seq.

llama_get_embeddings_ith should be correct in this case, though I've never used it so idk what's the correct argument.

It's better to check with @ggerganov I think.

ngxson · 2024-02-26T21:59:43Z

@TruongGiangBT Can you open a PR with the changes you proposed?

TruongGiangBT · 2024-02-27T02:03:54Z

My changes only help me run correctly to get the embedding result of multiple sequences in the decode batch. These modifications will affect the embedding function result. The crux of this issue is that we need to determine the index of the last token of the sequences, while also having a reasonable plan to store the embedding result of the sequences into ctx->embedding.

phymbert · 2024-02-27T07:02:30Z

Can you add a scenario in issues.feature ? it will really help to trace your issue.

ggerganov · 2024-02-27T14:10:59Z

If the tokens of that slot are decoded multiple times, the results will be different, but the cosine similarity with the correct embedding result is > 0.999.

I think this is expected due to the way the KV cache works. But I need to verify before explaining. If you could provide some basic instructions / commands to run this scenario it would be of great help. Otherwise, I have to invest time to create curl queries and / or bash scripts to match what you are doing

TruongGiangBT · 2024-02-28T01:38:19Z

@ggerganov I also think it might be due to the way the KV cache operates. Although the results are somewhat different, they can be acceptable. As discussed above, I encountered a problem with extracting embeddings and I have temporarily fixed it (I also shared it). But it only works well for my requirements, we need to find a general solution.

Docker command:
docker run -dp 6900:8080 --gpus '"device=1"' --name vistral_llm -v /home/gianght/llm_models/:/models mistral-wrapped-llama.cpp:full-cuda --server -m /models/ggml-vistral-7B-chat-q4_1.gguf -cb -v --embedding -np 8 -c 32768 -ngl 64
Jupyter notebook test code:
import asyncio
import requests

async def requests_post_async(*args, **kwargs):
return await asyncio.to_thread(requests.post, *args, **kwargs)

model_url = "http://127.0.0.1:6900"
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
url= f"{model_url}/embedding",
json= {"content": "0"*1024}
) for i in range(8)])
for response in responses:
embedding = response.json()["embedding"]
print(embedding[-4:])

Thank you

ggerganov · 2024-02-29T07:53:19Z

Ok thanks. I was willing to look more into this, but you are not making it easy for me.

I copied this code in a test.py file, and tried to indent it:

import asyncio
import requests

async def requests_post_async(*args, **kwargs):
    return await asyncio.to_thread(requests.post, *args, **kwargs)

model_url = "http://127.0.0.1:6900"
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
    url= f"{model_url}/embedding",
    json= {"content": "0"*1024}
    ) for i in range(8)])

for response in responses:
    embedding = response.json()["embedding"]
    print(embedding[-4:])

However I get an error:

$ ▶ python3 test.py 
  File "/Users/ggerganov/development/github/llama.cpp/test.py", line 8
    responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: 'await' outside function

I have no idea what is the issue and don't want to spend time to fix it.

Providing simple repro instructions goes a long way to help maintainers help you

TruongGiangBT · 2024-02-29T08:07:13Z

Oh, I am sorry. I am using a Jupyter notebook. Here is the code for a .py file:

import asyncio
import requests

async def requests_post_async(*args, **kwargs):
    return await asyncio.to_thread(requests.post, *args, **kwargs)

async def main():
    model_url = "http://127.0.0.1:6900"
    responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
        url= f"{model_url}/embedding",
        json= {"content": "0"*1024}
    ) for i in range(8)])

    for response in responses:
        embedding = response.json()["embedding"]
        print(embedding[-4:])

asyncio.run(main())

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov#5699) * server: ggerganov#5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs

* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov#5699) * server: ggerganov#5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs

phymbert mentioned this issue Feb 22, 2024

server: init functional tests #5566

Merged

28 tasks

phymbert added a commit to phymbert/llama.cpp that referenced this issue Feb 23, 2024

server: tests: adding concurrent embedding in issue ggerganov#5655

6c0e6f4

allow to enable VERBOSE mode

ggerganov added the bug Something isn't working label Feb 24, 2024

phymbert added a commit that referenced this issue Feb 24, 2024

server: #5655 - continue to update other slots on embedding concurren…

ec243ca

…t request. server: tests: add multi users embeddings as fixed

phymbert added a commit that referenced this issue Feb 24, 2024

server: #5655 - continue to update other slots on embedding concurren…

09b77b4

…t request. server: tests: add multi users embeddings as fixed

phymbert mentioned this issue Feb 24, 2024

server: continue to update other slots on embedding concurrent request #5699

Merged

phymbert added the server/webui label Feb 24, 2024

phymbert closed this as completed in #5699 Feb 24, 2024

ngxson reopened this Feb 26, 2024

ggerganov closed this as completed Feb 29, 2024

ggerganov mentioned this issue Feb 29, 2024

llama : fix embeddings #5796

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault #5655

Segmentation fault #5655

TruongGiangBT commented Feb 22, 2024 •

edited by phymbert

Loading

phymbert commented Feb 24, 2024

phymbert commented Feb 24, 2024 •

edited

Loading

TruongGiangBT commented Feb 26, 2024

phymbert commented Feb 26, 2024

TruongGiangBT commented Feb 26, 2024 •

edited

Loading

phymbert commented Feb 26, 2024

ngxson commented Feb 26, 2024

ngxson commented Feb 26, 2024

TruongGiangBT commented Feb 27, 2024

phymbert commented Feb 27, 2024

ggerganov commented Feb 27, 2024

TruongGiangBT commented Feb 28, 2024

ggerganov commented Feb 29, 2024

TruongGiangBT commented Feb 29, 2024

Segmentation fault #5655

Segmentation fault #5655

Comments

TruongGiangBT commented Feb 22, 2024 • edited by phymbert Loading

phymbert commented Feb 24, 2024

phymbert commented Feb 24, 2024 • edited Loading

TruongGiangBT commented Feb 26, 2024

phymbert commented Feb 26, 2024

TruongGiangBT commented Feb 26, 2024 • edited Loading

phymbert commented Feb 26, 2024

ngxson commented Feb 26, 2024

ngxson commented Feb 26, 2024

TruongGiangBT commented Feb 27, 2024

phymbert commented Feb 27, 2024

ggerganov commented Feb 27, 2024

TruongGiangBT commented Feb 28, 2024

ggerganov commented Feb 29, 2024

TruongGiangBT commented Feb 29, 2024

TruongGiangBT commented Feb 22, 2024 •

edited by phymbert

Loading

phymbert commented Feb 24, 2024 •

edited

Loading

TruongGiangBT commented Feb 26, 2024 •

edited

Loading