-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665
Comments
Have you tried tweaking The amount of threads (32) is excessive, the VM seems to come with 7 cores, which explains 700% CPU use. llama.cpp detects 32 from the physical CPU. Try running with |
@SlyEcho Thank you for the reply. |
It was added in PR #1530 |
@SlyEcho Current performance for
Thank you so much for your help. I'll close this issue now. I'll also look into the build env params you've linked to at #1530 to see if it helps bring out even more tokens/s. Do you think it makes sense for me to open a documentation PR that adds a short performance FAQ to the readme? |
I had to use
Sure. |
@SlyEcho Will do this during the weekend. Thank you for your help! |
hey my doubt revolves in calculating that |
Hi all,
Apologies if this is the wrong place.
My goal is to reach token generation speed of 10+/second w/ a model of 30B params.
I've tried to follow the readme instructions precisely in order to run llama.cpp with GPU acceleration, but I can't seem to get any relevant generation speed. I'm currently at less than 1 token/minute.
my installation steps:
llama.cpp
withgit clone https://github.com/ggerganov/llama.cpp
make LLAMA_CUBLAS=1
since I have a CUDA enabled nVidia graphics cardWizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
)My inference command
./main -m path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin -n 50 -ngl 2000000 -p "Hey, can you please "
Expected behavior
Inference works with at least 1 token / second (maybe even 10/second with this "beefy" machine?)
Actual behavior
Inference works, but token generation speed is about 1 token / minute.
llama.cpp claims that work is being offloaded to GPU
CPU usage is 700% (according to
top
)GPU is not being used (according to
watch nvidia-smi
)The text was updated successfully, but these errors were encountered: