-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625
Comments
For a long time, it has been known that llama models are not good at generating usable embeddings. Furthermore, it seems that the server example provides incorrect embeddings (in comparison to the |
@FSSRepo thanks for the fast reply! and confirming my wild guess. at this point i would be more than happy if i get usable embeddings. but i would follow your hint to use a different model for embeddings later. thanks for that |
@adrianliechti I am working on improving the server example, but the generation of embeddings is not working for me, and I haven't found the reason yet. #3589 |
For usable embeddings, you need an encoder. Many LLMs these days, including Llama, don't have encoders. For local inference, I'm using one of these: https://mteb-leaderboard.hf.space/?__theme=dark |
@shibe2 Or is there an encoding/tokenize model to use in llama.cpp? Or what is the /embedding endpoint for another usecase and I just made a wrong assumption? |
@FSSRepo I press all my thumbs for you and am curious for the result. |
I tested the very same with the python-llama-cpp version... and here the result looks much better in a very first small test! docker run --rm -it -p 8080:8000 -v $(pwd):/models -e MODEL=/models/codellama-7b-instruct.gguf ghcr.io/abetlen/llama-cpp-python:latest since this is just a wrapper, I have a feeling that having usable embeddings in the example server is feasible :) |
hmmmm new observation
|
I can confirm that |
@adrianliechti I can confirm this, the embedding.exe -m ..\models\vicuna-7b.gguf -p "Which is your dog?" Embedding with 2.071708 0.094213 -0.879323 -0.868794 -1.877190 0.149166 0.456107 -1.180170 -1.899972 0.324895 0.137798 1.375898 -0.929520 -0.724631 -3.895488 0.835485 2.534587 -1.334827 -1.068944 -1.522295 -2.574473 -1.089735 0.195261 0.207192 0.607332 -0.160022 Embedding with 0.001389 0.025688 -0.022911 0.001389 0.001389 -0.022911 0.001389 0.001389 0.001389 -0.022911 -0.022911 -0.022911 0.001389 0.001389 0.025688 0.025688 0.001389 0.001389 0.001389 -0.022911 0.025688 0.001389 -0.047211 0.001389 -0.054474 0.013565 -0.020454 0.013565 -0.020454 Possibly the embeddings tensor is not being downloaded from the GPU. |
There is a TODO referencing the embeddings layer, but I don't recall what it means: Lines 3438 to 3441 in e74c705
We are intentionally not offloading that layer (see that the |
@FSSRepo @adrianliechti Can you give #3657 a try and see if it fixes the issue with CUDA? |
ciao @slaren @ggerganov Just gave this version a spin, and the embeddings look much better now (also the count in the array)!! Thank you very very much! {
"embedding": [
-0.4961467981338501,
0.04445655643939972,
-0.6152415871620178,
-0.37043583393096924,
-0.27538755536079407,
-0.8489164710044861,
-1.178662896156311,
... // SNAP // ...
-0.5386767983436584,
0.9538010954856873,
-0.49098142981529236,
-0.589079737663269,
-0.18938979506492615,
0.2513873875141144,
0.8637508749961853,
-0.10916957259178162,
-1.1789305210113525
]
} |
Update:
Disabling GPU Offloading (
--n-gpu-layers 83
to--n-gpu-layers 0
) seems to "fix" my issue with Embeddings.Dear Llama Community,
I might need a hint about embeddings API on the (example)server. And already say thanks a lot for taking your time and any help
I'm sending a string (e.g. "hello") to the /embedding endpoint of the llama-example-server against a 70b llama2 model.
In the raw result, I see this json:
My goal is to covert this to an OpenAI compatible format and use it with langchain and a chroma db.
(see here for a llama.cpp-to-openai-server: https://github.com/adrianliechti/llama/blob/main/llama-openai/provider/llama/llama.go)
Currently the vector db search returns no results when using the llama-embeddings (but works fine using the openai embeddings).
My assumption is that I somehow have to convert or format this result... I am not very sure about the "null" since I haven' see such values in OpenAI results...
Do these embeddings number look normal or weird to you? do you have a hint how to proper convert them? If i parse them as float32 in golang, the "null" would get "0". would that make sense?
Thanks a ton!! Adrian
The text was updated successfully, but these errors were encountered: