Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Closed
Yuval-Peled opened this issue Jun 1, 2023 · 8 comments

Comments

@Yuval-Peled
Copy link
Contributor

Hi all,
Apologies if this is the wrong place.

My goal is to reach token generation speed of 10+/second w/ a model of 30B params.

I've tried to follow the readme instructions precisely in order to run llama.cpp with GPU acceleration, but I can't seem to get any relevant generation speed. I'm currently at less than 1 token/minute.

my installation steps:

  1. Provisioned an A6000 machine from jarvislabs.ai. It has 48GB VRAM, 32 GB RAM, 100GB SSD. It comes preinstalled with CUDA toolkit, python3, git and anything needed to get started, as far as I'm aware
  2. Cloned latest llama.cpp with git clone https://github.com/ggerganov/llama.cpp
  3. Run make LLAMA_CUBLAS=1 since I have a CUDA enabled nVidia graphics card
  4. Downloaded a 30B Q4 GGML Vicuna model (It's called Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin)

My inference command

./main -m path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin -n 50 -ngl 2000000 -p "Hey, can you please "

Expected behavior

Inference works with at least 1 token / second (maybe even 10/second with this "beefy" machine?)

Actual behavior

Inference works, but token generation speed is about 1 token / minute.

llama.cpp claims that work is being offloaded to GPU

main: build = 607 (ffb06a3)
main: seed  = 1685616701
llama.cpp: loading model from path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 2532.67 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0

CPU usage is 700% (according to top)

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                       
 5762 root      20   0   48.4g  21.2g  20.3g R 675.7   8.4   5:04.59 main                                                          

GPU is not being used (according to watch nvidia-smi)

Thu Jun  1 10:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:B2:00.0 Off |                  Off |
| 30%   32C    P2    67W / 300W |  18750MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 1, 2023

Have you tried tweaking LLAMA_CUDA_DMMV_X and LLAMA_CUDA_DMMV_Y values (options to make but remember to clean first)?

The amount of threads (32) is excessive, the VM seems to come with 7 cores, which explains 700% CPU use. llama.cpp detects 32 from the physical CPU. Try running with -t 7 or whatever is best.

@Yuval-Peled
Copy link
Contributor Author

@SlyEcho Thank you for the reply.
I'll try running with 7 threads.
I'm not familiar with LLAMA_CUDA_DMMV_X and LLAMA_CUDA_DMMV_Y, and googling did not seem to find any documentation/explanation about them. Do you have a link for me to learn a little bit so I know which values I can play with and how?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 1, 2023

It was added in PR #1530

@Yuval-Peled
Copy link
Contributor Author

@SlyEcho
Running with 7 threads did the trick!

Current performance for Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin model:

command tokens/second
-ngl 2000000 N/A (less than 0.1)
-t 7 1.6
-t 7 -ngl 2000000 8.5

Thank you so much for your help. I'll close this issue now.

I'll also look into the build env params you've linked to at #1530 to see if it helps bring out even more tokens/s.

Do you think it makes sense for me to open a documentation PR that adds a short performance FAQ to the readme?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 1, 2023

I am testing myself right now and it does seem abnormally slow.

I had to use -t 4. 7 was even too much.

Do you think it makes sense for me to open a documentation PR that adds a short performance FAQ to the readme?

Sure.

@Yuval-Peled
Copy link
Contributor Author

@SlyEcho Will do this during the weekend.

Thank you for your help!

@Yuval-Peled
Copy link
Contributor Author

@SlyEcho
Opened a documentation PR here. Happy to hear your thoughts on it!

@sujantkumarkv
Copy link

sujantkumarkv commented Dec 25, 2023

command tokens/second
-ngl 2000000 N/A (less than 0.1)
-t 7 1.6
-t 7 -ngl 2000000 8.5

hey my doubt revolves in calculating that tokens/s metric? llama.cpp doesn't itself output it right? so how are you calculating that?
because eg. transformers API combines input & output tensors and to specifically count the generated tokens, I needed to find output tokens, divide by inference time and then I got the toks/s but for llama.cpp?

cc @SlyEcho @Yuval-Peled @tobi @sw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants