Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

adrianliechti · 2023-10-14T16:07:14Z

Update:
Disabling GPU Offloading (--n-gpu-layers 83 to --n-gpu-layers 0) seems to "fix" my issue with Embeddings.

Dear Llama Community,

I might need a hint about embeddings API on the (example)server. And already say thanks a lot for taking your time and any help

I'm sending a string (e.g. "hello") to the /embedding endpoint of the llama-example-server against a 70b llama2 model.
In the raw result, I see this json:

{"embedding":[3.1069589551009844e-41,4.1016006050787396e-42,4.736388809417882e-43,3.8956097308229915e-43,5.1834030195374983e-42,4.200111887120774e-41,1.0165019060212223e-41,4.1883409800204457e-41,0.0883883461356163,7.370829922348538e-43,3.685414961174269e-43,1.1832564232758755e-41,4.188761369559743e-41,4.75040179406113e-42,1.8483126744444337e-42,4.512181055125911e-43,0.0,0.0,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,1.401298464324817e-45,0.0,0.0,null,null,null,null,null,0.009384572505950928,-0.029291629791259766,0.007542848587036133,0.025960087776184082,-0.005349218845367432,0.014909744262695313,0.007542848587036133,-0.0035074949264526367,-0.0016657710075378418,-0.0163995623588562,-0.012716114521026611,...

My goal is to covert this to an OpenAI compatible format and use it with langchain and a chroma db.
(see here for a llama.cpp-to-openai-server: https://github.com/adrianliechti/llama/blob/main/llama-openai/provider/llama/llama.go)

Currently the vector db search returns no results when using the llama-embeddings (but works fine using the openai embeddings).

My assumption is that I somehow have to convert or format this result... I am not very sure about the "null" since I haven' see such values in OpenAI results...

Do these embeddings number look normal or weird to you? do you have a hint how to proper convert them? If i parse them as float32 in golang, the "null" would get "0". would that make sense?

Thanks a ton!! Adrian

The text was updated successfully, but these errors were encountered:

FSSRepo · 2023-10-14T17:41:43Z

For a long time, it has been known that llama models are not good at generating usable embeddings. Furthermore, it seems that the server example provides incorrect embeddings (in comparison to the example/embedding).

adrianliechti · 2023-10-14T17:45:55Z

@FSSRepo thanks for the fast reply! and confirming my wild guess.

at this point i would be more than happy if i get usable embeddings. but i would follow your hint to use a different model for embeddings later. thanks for that

FSSRepo · 2023-10-14T22:57:22Z

@adrianliechti I am working on improving the server example, but the generation of embeddings is not working for me, and I haven't found the reason yet. #3589

shibe2 · 2023-10-15T04:55:51Z

For usable embeddings, you need an encoder. Many LLMs these days, including Llama, don't have encoders. For local inference, I'm using one of these: https://mteb-leaderboard.hf.space/?__theme=dark

adrianliechti · 2023-10-15T10:38:19Z

@shibe2
This seems convincing to me - also considering that e.g. OpenAI uses a different model for embedding...
I will experiment with another embedder directly my code instead of the llama.cpp endpoint then.

Or is there an encoding/tokenize model to use in llama.cpp? Or what is the /embedding endpoint for another usecase and I just made a wrong assumption?

adrianliechti · 2023-10-15T10:40:00Z

@FSSRepo I press all my thumbs for you and am curious for the result.
Since my c++ is not where it needs to be for this project - I probably only can help with testing / validate your version.

adrianliechti · 2023-10-15T12:38:12Z

I tested the very same with the python-llama-cpp version... and here the result looks much better in a very first small test!

docker run --rm -it -p 8080:8000 -v $(pwd):/models -e MODEL=/models/codellama-7b-instruct.gguf ghcr.io/abetlen/llama-cpp-python:latest

since this is just a wrapper, I have a feeling that having usable embeddings in the example server is feasible :)

adrianliechti · 2023-10-15T12:56:31Z

hmmmm new observation

llama-2-70b-chat.Q5_K_M.gguf running on a Nvidia A100 -> embeddings contain these weird null fields / long values
codellama-7b-instruct.Q4_K_M.gguf locally on a MacBook M1 -> embeddings look good / seem to work with chromadb

adrianliechti · 2023-10-15T13:18:56Z

I can confirm that --n-gpu-layers 0 changed the result on embeddings for the better...

FSSRepo · 2023-10-16T01:10:39Z

@adrianliechti I can confirm this, the embedding example too have this problem.

embedding.exe -m ..\models\vicuna-7b.gguf -p "Which is your dog?"

Embedding with -ngl 0 (correct):

2.071708 0.094213 -0.879323 -0.868794 -1.877190 0.149166 0.456107 -1.180170 -1.899972 0.324895 0.137798 1.375898 -0.929520 -0.724631 -3.895488 0.835485 2.534587 -1.334827 -1.068944 -1.522295 -2.574473 -1.089735 0.195261 0.207192 0.607332 -0.160022

Embedding with -ngl 33:

0.001389 0.025688 -0.022911 0.001389 0.001389 -0.022911 0.001389 0.001389 0.001389 -0.022911 -0.022911 -0.022911 0.001389 0.001389 0.025688 0.025688 0.001389 0.001389 0.001389 -0.022911 0.025688 0.001389 -0.047211 0.001389 -0.054474 0.013565 -0.020454 0.013565 -0.020454

Possibly the embeddings tensor is not being downloaded from the GPU.

ggerganov · 2023-10-17T17:06:02Z

There is a TODO referencing the embeddings layer, but I don't recall what it means:

llama.cpp/llama.cpp

Lines 3438 to 3441 in e74c705

    
           // cur = cur*norm(broadcasted) 
        
           cur = ggml_mul(ctx0, cur, model.output_norm); 
        
           // offload_func_nr(cur); // TODO CPU + GPU mirrored backend 
        
           ggml_set_name(cur, "result_norm");

We are intentionally not offloading that layer (see that the offload_func_nr is commented out) in order to be able to read the embeddings, but there might be some issue still.

ggerganov · 2023-10-17T19:38:34Z

@FSSRepo @adrianliechti Can you give #3657 a try and see if it fixes the issue with CUDA?

adrianliechti · 2023-10-17T19:58:39Z

ciao @slaren @ggerganov

Just gave this version a spin, and the embeddings look much better now (also the count in the array)!!
I am really excited and grateful! What a great project and community!

Thank you very very much!

{
    "embedding": [
        -0.4961467981338501,
        0.04445655643939972,
        -0.6152415871620178,
        -0.37043583393096924,
        -0.27538755536079407,
        -0.8489164710044861,
        -1.178662896156311,
        ... // SNAP // ...
        -0.5386767983436584,
        0.9538010954856873,
        -0.49098142981529236,
        -0.589079737663269,
        -0.18938979506492615,
        0.2513873875141144,
        0.8637508749961853,
        -0.10916957259178162,
        -1.1789305210113525
    ]
}

adrianliechti changed the title ~~Question about Server Embeddings Result~~ Question / Bug about Server Embeddings Result Oct 14, 2023

adrianliechti changed the title ~~Question / Bug about Server Embeddings Result~~ Bug: Invalid Server Embeddings when GPU offloaded (CUDA) Oct 15, 2023

FSSRepo mentioned this issue Oct 16, 2023

server : parallel decoding and multimodal #3589

Closed

9 tasks

adrianliechti changed the title ~~Bug: Invalid Server Embeddings when GPU offloaded (CUDA)~~ Bug: Invalid Embeddings if GPU offloaded (CUDA) Oct 17, 2023

slaren mentioned this issue Oct 17, 2023

fix embeddings when using CUDA #3657

Merged

slaren closed this as completed in #3657 Oct 17, 2023

ringohoffman mentioned this issue Dec 19, 2023

API for manipulating token-level input embeddings #4537

Closed

4 tasks

frank-onspecta mentioned this issue Feb 28, 2024

Langchain embedding with llama.cpp does not work AmpereComputingAI/llm_app_frameworks#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

adrianliechti commented Oct 14, 2023 •

edited

Loading

FSSRepo commented Oct 14, 2023

adrianliechti commented Oct 14, 2023

FSSRepo commented Oct 14, 2023 •

edited

Loading

shibe2 commented Oct 15, 2023

adrianliechti commented Oct 15, 2023 •

edited

Loading

adrianliechti commented Oct 15, 2023

adrianliechti commented Oct 15, 2023 •

edited

Loading

adrianliechti commented Oct 15, 2023 •

edited

Loading

adrianliechti commented Oct 15, 2023

FSSRepo commented Oct 16, 2023 •

edited

Loading

ggerganov commented Oct 17, 2023

ggerganov commented Oct 17, 2023

adrianliechti commented Oct 17, 2023 •

edited

Loading

Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625

Comments

adrianliechti commented Oct 14, 2023 • edited Loading

FSSRepo commented Oct 14, 2023

adrianliechti commented Oct 14, 2023

FSSRepo commented Oct 14, 2023 • edited Loading

shibe2 commented Oct 15, 2023

adrianliechti commented Oct 15, 2023 • edited Loading

adrianliechti commented Oct 15, 2023

adrianliechti commented Oct 15, 2023 • edited Loading

adrianliechti commented Oct 15, 2023 • edited Loading

adrianliechti commented Oct 15, 2023

FSSRepo commented Oct 16, 2023 • edited Loading

ggerganov commented Oct 17, 2023

ggerganov commented Oct 17, 2023

adrianliechti commented Oct 17, 2023 • edited Loading

adrianliechti commented Oct 14, 2023 •

edited

Loading

FSSRepo commented Oct 14, 2023 •

edited

Loading

adrianliechti commented Oct 15, 2023 •

edited

Loading

adrianliechti commented Oct 15, 2023 •

edited

Loading

adrianliechti commented Oct 15, 2023 •

edited

Loading

FSSRepo commented Oct 16, 2023 •

edited

Loading

adrianliechti commented Oct 17, 2023 •

edited

Loading