-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Why prompt is being run trough the network before generating new tokens? #719
Comments
LLM essentially continue the text they were previously given. Thus any LLM would first run on the whole prompt before predicting what the next token would be. In that sense huggingface and llama.cpp should behave similarly. The difference between HF and llama.cpp behaviour probably lies somewhere else. For example, huggingface class is usually initialized well in advance so you don't see it loading weights when you start generate. Conversely, running .main in llama.cpp first loads the weights, which can be time consuming. |
I'm not talking about the weights. Even original weights are loading into memory in a few seconds on my machine. If you put a long prompt into
You can see initial prompt being printed slowly before it starts to generate new tokens. |
increase the batch size and it will goop a larger part in one go |
It's faster with larger batch size, but I still don't understand why it needs to do anything with the prompt. |
You're right. There's something odd about it that's not quite working right. I've run across discussions about this in the past where people had the same reasoning as you. Unfortunately, I can't seem to find them right now. Basically it seemed like Georgio and others were aware of the issue and were trying to figure out how to resolve it. It's been a bit, but I'm sure they're planning to look more into it sometime. It's not a simple adjust an argument fix. |
Your prompt is split into groups of b tokens, those are processed in parallel using n threads until the whole prompt was processed. |
I wonder if I describe the issue to chat GPT and give it some context with
code if it could help narrow it down.😂 But definitely we should focus on a
more experienced person :)
…On Sun, Apr 2, 2023, 22:29 John ***@***.***> wrote:
It's faster with larger batch size, but I still don't understand why it
needs to do anything with the prompt. transformers pretty much starts to
generate new tokens immediately
Your prompt is split into groups of b tokens, those are processed in
parallel using n threads until the whole prompt was processed.
If you have the spare memory you can use a larger -b, not sure if you
actually win performance that way (I don't think so).
Maybe someone familiar with the python transformer implementation can
explain what they do different.
—
Reply to this email directly, view it on GitHub
<#719 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYMC3AAYBINYB6KT4PTX4ALW7IYYTANCNFSM6AAAAAAWQPXMB4>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
It looks to me like eval is being called on the "prompt" embeddings just to load them into the attention K/V memory (it doesn't look like it's kept anywhere else and the KV calculations don't actually depend on the layer activations.) That could probably be factored out to its own much faster function that mutates the model as well instead of calculating and then throwing away the predictions for tokens that you already have. |
Yep, seems like KV cache becomes the state, so it must run the network on the prompt. 🤔 Here's the code with |
@python273 @notswileynet did you ever figure out why |
@LostRuins |
Here's my attempt at benchmarking. With https://gist.github.com/python273/ca23361caf1cde9dc06bbc9acd44b22d tldr: 7B q4 on AMD Ryzen 9 5950X 16-Core Processor: 7B q4 BLAS (seems to be slower) 7B 8bit generation time: 5748.10601 ms 7B 4bit if generating only 4 tokens in python: 8bit 4bit |
Okay maybe 10x is an exaggeration especially considering BLAS, apologize for the hyperbole, but it is still significantly faster. With the above prompt (1151 tokens) and generating only one (1) extra token, I am getting: No BLAS = 151s (this is fully on CPU, on GPU the pytorch one is much much faster) |
Can you post your python code to run on cpu? Also full output compiling and running llama.cpp might be useful. |
I am running it through KoboldAI: https://github.com/0cc4m/KoboldAI |
Some messing around: Inside It's kinda significant, because with both functions enabled (BLAS) it takes about 70ms per token. So the overhead is costing more than both the mat mul and the dequantization combined. |
@LostRuins I tried to replicate that on my computer and the overhead that I got was between 10 and 20ms. Most of it seems to be in small matrix multiplication. You can try the steps described at https://github.com/ggerganov/llama.cpp/wiki/GGML-Tips-&-Tricks to get a breakdown per operation. |
It doesn't seem that this was actually resolved to me. I've seen this comment about how quickly transformers starts to generate tokens vs llama.cpp and I've never really seen an answer. I think the core of the issue is described in the first few messages in the thread, and might have been overshadowed by LostRuin's 10x exaggeration. But if this is solved, it will have a significant impact for many people. |
Yes I admit it was nadir of me to say 10x faster especially without comparing against BLAS. I should have just said, significantly faster. I think that further gains are possible but may still be somewhat slower than a GPU alternative. |
These numbers look reasonable to me. Based on discussion with @guillaumekln (see ggerganov/whisper.cpp#589 (reply in thread)) the MKL implementation is considerably faster than OpenBLAS. I think if we "plug" MKL into |
would be awesome if that is possible and works just as well - at least for the intel users |
Since MKL is cblas compatible it's a easy drop-in replacement for OpenBLAS. I added this to the Makefile using the MKL link advisor using the default libmkl 2020 on Ubuntu 22. I used OpenMP threading and did not see a performance difference with TBB threading.
We can confirm that MKL is linked properly in the binary if compiled with LLAMA_MKL=on
I ran my usual test script which has a ~320 token prompt on Llama 13B with a batch size of 1024. Averaged out the timings at the prompt eval stage with MKL were very similar to OpenBLAS (around 150ms/token on an 16GB i5-6500). Example MKL result:
Example OpenBLAS result:
For fun here's a run with no BLAS lib (generally I see a 2x improvement in prompt eval speed with OpenBLAS).
As I'm using an older architecture and older (2020) version of MKL I'm curious if people are seeing actual performance improvements with a newer setup. |
Ok, so based on the results from @LostRuins and the MKL test by @eiery , @eiery Maybe wrap this |
Rerunning tests... |
@0cc4m What is the batch and prompt size in your experiments? (i.e. the |
@ggerganov These are results from a few days back and I just noticed some inconsistencies. I'll retest and update them, alongside adding the used batch and prompt size. |
@ggerganov I apologize, my last results were wrong. Here is what I found:
I think there must be some architectural advantage to how Pytorch/transformers handles the context. |
@eiery I want to try MKL to compare, but I can't seem to find the location to actually obtain the MKL library for windows. Any idea? |
@ggerganov The linking process for MKL is complex (hence why there's a link advisor) and users need different commands depending on OS, MKL version, and so on. My current LLAMA_MKL option only works on Ubuntu with the MKL version from the package manager and probably won't work with say Mac or Windows. If people are interested in testing I would recommend they add the lines to the Makefile themselves and replace my lines with ones that match their setup. @LostRuins You can get the Windows version here. Note that I haven't tested it in Windows. |
It would also be an interesting experiment for someone who has it all set up to try compiling llama.cpp with the full Intel One system (ICC, MKL, etc.) to see what gains we can achieve. I used GCC for my tests and am not sure if using the full suite will provide additional improvements. |
As I understand, the NN doesn't have a state, so you should be able to put whatever tokens into context and start generating new tokens immediately. But right now, it first runs the NN on the prompt it seems? So with a long prompt, it takes some time until it starts to generate new tokens.
I though I was missing something, but huggingface
transformers
starts to generate the tokens immediately.The text was updated successfully, but these errors were encountered: