Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

Closed
adrianliechti opened this issue Oct 14, 2023 · 13 comments · Fixed by #3657
Closed

Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

adrianliechti opened this issue Oct 14, 2023 · 13 comments · Fixed by #3657

Comments

@adrianliechti
Copy link

adrianliechti commented Oct 14, 2023

Update:
Disabling GPU Offloading (--n-gpu-layers 83 to --n-gpu-layers 0) seems to "fix" my issue with Embeddings.

Dear Llama Community,

I might need a hint about embeddings API on the (example)server. And already say thanks a lot for taking your time and any help

I'm sending a string (e.g. "hello") to the /embedding endpoint of the llama-example-server against a 70b llama2 model.
In the raw result, I see this json:

{"embedding":[3.1069589551009844e-41,4.1016006050787396e-42,4.736388809417882e-43,3.8956097308229915e-43,5.1834030195374983e-42,4.200111887120774e-41,1.0165019060212223e-41,4.1883409800204457e-41,0.0883883461356163,7.370829922348538e-43,3.685414961174269e-43,1.1832564232758755e-41,4.188761369559743e-41,4.75040179406113e-42,1.8483126744444337e-42,4.512181055125911e-43,0.0,0.0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,1.401298464324817e-45,0.0,0.0,null,null,null,null,null,0.009384572505950928,-0.029291629791259766,0.007542848587036133,0.025960087776184082,-0.005349218845367432,0.014909744262695313,0.007542848587036133,-0.0035074949264526367,-0.0016657710075378418,-0.0163995623588562,-0.012716114521026611,...

My goal is to covert this to an OpenAI compatible format and use it with langchain and a chroma db.
(see here for a llama.cpp-to-openai-server: https://github.com/adrianliechti/llama/blob/main/llama-openai/provider/llama/llama.go)

Currently the vector db search returns no results when using the llama-embeddings (but works fine using the openai embeddings).

My assumption is that I somehow have to convert or format this result... I am not very sure about the "null" since I haven' see such values in OpenAI results...

Do these embeddings number look normal or weird to you? do you have a hint how to proper convert them? If i parse them as float32 in golang, the "null" would get "0". would that make sense?

Thanks a ton!! Adrian

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 14, 2023

For a long time, it has been known that llama models are not good at generating usable embeddings. Furthermore, it seems that the server example provides incorrect embeddings (in comparison to the example/embedding).

@adrianliechti
Copy link
Author

@FSSRepo thanks for the fast reply! and confirming my wild guess.

at this point i would be more than happy if i get usable embeddings. but i would follow your hint to use a different model for embeddings later. thanks for that

@adrianliechti adrianliechti changed the title Question about Server Embeddings Result Question / Bug about Server Embeddings Result Oct 14, 2023
@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 14, 2023

@adrianliechti I am working on improving the server example, but the generation of embeddings is not working for me, and I haven't found the reason yet. #3589

@shibe2
Copy link
Contributor

shibe2 commented Oct 15, 2023

For usable embeddings, you need an encoder. Many LLMs these days, including Llama, don't have encoders. For local inference, I'm using one of these: https://mteb-leaderboard.hf.space/?__theme=dark

@adrianliechti
Copy link
Author

adrianliechti commented Oct 15, 2023

@shibe2
This seems convincing to me - also considering that e.g. OpenAI uses a different model for embedding...
I will experiment with another embedder directly my code instead of the llama.cpp endpoint then.

Or is there an encoding/tokenize model to use in llama.cpp? Or what is the /embedding endpoint for another usecase and I just made a wrong assumption?

@adrianliechti
Copy link
Author

@FSSRepo I press all my thumbs for you and am curious for the result.
Since my c++ is not where it needs to be for this project - I probably only can help with testing / validate your version.

@adrianliechti
Copy link
Author

adrianliechti commented Oct 15, 2023

I tested the very same with the python-llama-cpp version... and here the result looks much better in a very first small test!

docker run --rm -it -p 8080:8000 -v $(pwd):/models -e MODEL=/models/codellama-7b-instruct.gguf ghcr.io/abetlen/llama-cpp-python:latest

since this is just a wrapper, I have a feeling that having usable embeddings in the example server is feasible :)

@adrianliechti
Copy link
Author

adrianliechti commented Oct 15, 2023

hmmmm new observation

  1. llama-2-70b-chat.Q5_K_M.gguf running on a Nvidia A100 -> embeddings contain these weird null fields / long values
  2. codellama-7b-instruct.Q4_K_M.gguf locally on a MacBook M1 -> embeddings look good / seem to work with chromadb

@adrianliechti
Copy link
Author

I can confirm that --n-gpu-layers 0 changed the result on embeddings for the better...

@adrianliechti adrianliechti changed the title Question / Bug about Server Embeddings Result Bug: Invalid Server Embeddings when GPU offloaded (CUDA) Oct 15, 2023
@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 16, 2023

@adrianliechti I can confirm this, the embedding example too have this problem.

embedding.exe -m ..\models\vicuna-7b.gguf -p "Which is your dog?"

Embedding with -ngl 0 (correct):

2.071708 0.094213 -0.879323 -0.868794 -1.877190 0.149166 0.456107 -1.180170 -1.899972 0.324895 0.137798 1.375898 -0.929520 -0.724631 -3.895488 0.835485 2.534587 -1.334827 -1.068944 -1.522295 -2.574473 -1.089735 0.195261 0.207192 0.607332 -0.160022 

Embedding with -ngl 33:

0.001389 0.025688 -0.022911 0.001389 0.001389 -0.022911 0.001389 0.001389 0.001389 -0.022911 -0.022911 -0.022911 0.001389 0.001389 0.025688 0.025688 0.001389 0.001389 0.001389 -0.022911 0.025688 0.001389 -0.047211 0.001389 -0.054474 0.013565 -0.020454 0.013565 -0.020454

Possibly the embeddings tensor is not being downloaded from the GPU.

@adrianliechti adrianliechti changed the title Bug: Invalid Server Embeddings when GPU offloaded (CUDA) Bug: Invalid Embeddings if GPU offloaded (CUDA) Oct 17, 2023
@ggerganov
Copy link
Owner

There is a TODO referencing the embeddings layer, but I don't recall what it means:

llama.cpp/llama.cpp

Lines 3438 to 3441 in e74c705

// cur = cur*norm(broadcasted)
cur = ggml_mul(ctx0, cur, model.output_norm);
// offload_func_nr(cur); // TODO CPU + GPU mirrored backend
ggml_set_name(cur, "result_norm");

We are intentionally not offloading that layer (see that the offload_func_nr is commented out) in order to be able to read the embeddings, but there might be some issue still.

@ggerganov
Copy link
Owner

@FSSRepo @adrianliechti Can you give #3657 a try and see if it fixes the issue with CUDA?

@adrianliechti
Copy link
Author

adrianliechti commented Oct 17, 2023

ciao @slaren @ggerganov

Just gave this version a spin, and the embeddings look much better now (also the count in the array)!!
I am really excited and grateful! What a great project and community!

Thank you very very much!

{
    "embedding": [
        -0.4961467981338501,
        0.04445655643939972,
        -0.6152415871620178,
        -0.37043583393096924,
        -0.27538755536079407,
        -0.8489164710044861,
        -1.178662896156311,
        ... // SNAP // ...
        -0.5386767983436584,
        0.9538010954856873,
        -0.49098142981529236,
        -0.589079737663269,
        -0.18938979506492615,
        0.2513873875141144,
        0.8637508749961853,
        -0.10916957259178162,
        -1.1789305210113525
    ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants