Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault #5655

Closed
TruongGiangBT opened this issue Feb 22, 2024 · 14 comments · Fixed by #5699
Closed

Segmentation fault #5655

TruongGiangBT opened this issue Feb 22, 2024 · 14 comments · Fixed by #5699
Labels
bug Something isn't working server/webui

Comments

@TruongGiangBT
Copy link

TruongGiangBT commented Feb 22, 2024

I apologize for the inconvenience. I am deploying a server with the following parameters: -cb -v --embedding -np 3 -c 8192 --host "0.0.0.0" -ngl 64. When I perform multiple embedding requests, a segmentation fault occurs. I noticed that if there are 2 slots performing the embedding task simultaneously, it causes an error. I hope to receive a solution soon. Thank you very much.

Originally posted by @TruongGiangBT in #3876 (comment)

phymbert added a commit to phymbert/llama.cpp that referenced this issue Feb 23, 2024
phymbert added a commit that referenced this issue Feb 24, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" #3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault #5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@phymbert
Copy link
Collaborator

Added in: issues.feature

@ggerganov @ngxson On it 👍

@ggerganov ggerganov added the bug Something isn't working label Feb 24, 2024
phymbert added a commit that referenced this issue Feb 24, 2024
…t request.

server: tests: add multi users embeddings as fixed
phymbert added a commit that referenced this issue Feb 24, 2024
…t request.

server: tests: add multi users embeddings as fixed
phymbert added a commit that referenced this issue Feb 24, 2024
#5699)

* server: #5655 - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs
@phymbert
Copy link
Collaborator

phymbert commented Feb 24, 2024

@TruongGiangBT Please confirm you do not face any error anymore. Feel free to reopen if any.

@TruongGiangBT
Copy link
Author

@phymbert The segmentation fault error has been fixed. However, when I simultaneously execute multiple requests with the same input value, the embedding results are different.
My run command:
docker run -dp 6900:8080
--gpus '"device=1"'
--name vistral_llm
--restart=always
-v /home/gianght/llm_models/:/models mistral-wrapped-llama.cpp:full-cuda
--server -m /models/ggml-vistral-7B-chat-q4_1.gguf
-cb -v --embedding
-np 8
-c 32768
--host "0.0.0.0"
-ngl 64
-b 4096
Test code:

image

@phymbert
Copy link
Collaborator

Could you please open a dedicated discussion on embedding feature ?

@TruongGiangBT
Copy link
Author

TruongGiangBT commented Feb 26, 2024

llama.cpp:
image

server.cpp:
image

What do you think if i fix it like this?
I have tried. With the same input, if the tokens of the sequence are entirely within one batch when decoding, then the embedding result will be the same. If the tokens of that slot are decoded multiple times, the results will be different, but the cosine similarity with the correct embedding result is > 0.999.

@phymbert
Copy link
Collaborator

@ngxson Hi, any idea on the matter please ? i

@ngxson
Copy link
Collaborator

ngxson commented Feb 26, 2024

It's true that there's a bug on the line that @TruongGiangBT pointed out: the llama_get_embeddings does not care about the fact that we're now having multi-seq.

The fact that you see different results was because you're using -np 8 which allows having 8 seq per batch. llama_get_embeddings always returns the first seq.

llama_get_embeddings_ith should be correct in this case, though I've never used it so idk what's the correct argument.

It's better to check with @ggerganov I think.

@ngxson
Copy link
Collaborator

ngxson commented Feb 26, 2024

@TruongGiangBT Can you open a PR with the changes you proposed?

@ngxson ngxson reopened this Feb 26, 2024
@TruongGiangBT
Copy link
Author

My changes only help me run correctly to get the embedding result of multiple sequences in the decode batch. These modifications will affect the embedding function result. The crux of this issue is that we need to determine the index of the last token of the sequences, while also having a reasonable plan to store the embedding result of the sequences into ctx->embedding.

@phymbert
Copy link
Collaborator

Can you add a scenario in issues.feature ? it will really help to trace your issue.

@ggerganov
Copy link
Owner

If the tokens of that slot are decoded multiple times, the results will be different, but the cosine similarity with the correct embedding result is > 0.999.

I think this is expected due to the way the KV cache works. But I need to verify before explaining. If you could provide some basic instructions / commands to run this scenario it would be of great help. Otherwise, I have to invest time to create curl queries and / or bash scripts to match what you are doing

@TruongGiangBT
Copy link
Author

@ggerganov I also think it might be due to the way the KV cache operates. Although the results are somewhat different, they can be acceptable. As discussed above, I encountered a problem with extracting embeddings and I have temporarily fixed it (I also shared it). But it only works well for my requirements, we need to find a general solution.

Docker command:
docker run -dp 6900:8080 --gpus '"device=1"' --name vistral_llm -v /home/gianght/llm_models/:/models mistral-wrapped-llama.cpp:full-cuda --server -m /models/ggml-vistral-7B-chat-q4_1.gguf -cb -v --embedding -np 8 -c 32768 -ngl 64
Jupyter notebook test code:
import asyncio
import requests

async def requests_post_async(*args, **kwargs):
return await asyncio.to_thread(requests.post, *args, **kwargs)

model_url = "http://127.0.0.1:6900"
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
url= f"{model_url}/embedding",
json= {"content": "0"*1024}
) for i in range(8)])
for response in responses:
embedding = response.json()["embedding"]
print(embedding[-4:])

Thank you

@ggerganov
Copy link
Owner

Ok thanks. I was willing to look more into this, but you are not making it easy for me.

I copied this code in a test.py file, and tried to indent it:

import asyncio
import requests

async def requests_post_async(*args, **kwargs):
    return await asyncio.to_thread(requests.post, *args, **kwargs)

model_url = "http://127.0.0.1:6900"
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
    url= f"{model_url}/embedding",
    json= {"content": "0"*1024}
    ) for i in range(8)])

for response in responses:
    embedding = response.json()["embedding"]
    print(embedding[-4:])

However I get an error:

$ ▶ python3 test.py 
  File "/Users/ggerganov/development/github/llama.cpp/test.py", line 8
    responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: 'await' outside function

I have no idea what is the issue and don't want to spend time to fix it.

Providing simple repro instructions goes a long way to help maintainers help you

@TruongGiangBT
Copy link
Author

Oh, I am sorry. I am using a Jupyter notebook. Here is the code for a .py file:

import asyncio
import requests

async def requests_post_async(*args, **kwargs):
    return await asyncio.to_thread(requests.post, *args, **kwargs)

async def main():
    model_url = "http://127.0.0.1:6900"
    responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
        url= f"{model_url}/embedding",
        json= {"content": "0"*1024}
    ) for i in range(8)])

    for response in responses:
        embedding = response.json()["embedding"]
        print(embedding[-4:])

asyncio.run(main())

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this issue Mar 13, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault ggerganov#5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this issue Mar 13, 2024
ggerganov#5699)

* server: ggerganov#5655 - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs
hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault ggerganov#5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this issue Apr 1, 2024
ggerganov#5699)

* server: ggerganov#5655 - continue to update other slots on embedding concurrent request.

* server: tests: add multi users embeddings as fixed

* server: tests: adding OAI compatible embedding concurrent endpoint

* server: tests: adding OAI compatible embedding with multiple inputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working server/webui
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants