Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Yuval-Peled · 2023-06-01T10:55:37Z

Hi all,
Apologies if this is the wrong place.

My goal is to reach token generation speed of 10+/second w/ a model of 30B params.

I've tried to follow the readme instructions precisely in order to run llama.cpp with GPU acceleration, but I can't seem to get any relevant generation speed. I'm currently at less than 1 token/minute.

my installation steps:

Provisioned an A6000 machine from jarvislabs.ai. It has 48GB VRAM, 32 GB RAM, 100GB SSD. It comes preinstalled with CUDA toolkit, python3, git and anything needed to get started, as far as I'm aware
Cloned latest llama.cpp with git clone https://github.com/ggerganov/llama.cpp
Run make LLAMA_CUBLAS=1 since I have a CUDA enabled nVidia graphics card
Downloaded a 30B Q4 GGML Vicuna model (It's called Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin)

My inference command

./main -m path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin -n 50 -ngl 2000000 -p "Hey, can you please "

Expected behavior

Inference works with at least 1 token / second (maybe even 10/second with this "beefy" machine?)

Actual behavior

Inference works, but token generation speed is about 1 token / minute.

llama.cpp claims that work is being offloaded to GPU

main: build = 607 (ffb06a3)
main: seed  = 1685616701
llama.cpp: loading model from path/to/Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.13 MB
llama_model_load_internal: mem required  = 2532.67 MB (+ 3124.00 MB per state)
llama_model_load_internal: [cublas] offloading 60 layers to GPU
llama_model_load_internal: [cublas] offloading output layer to GPU
llama_model_load_internal: [cublas] total VRAM used: 17223 MB
....................................................................................................
llama_init_from_file: kv self size  =  780.00 MB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0

CPU usage is 700% (according to `top`)

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                       
 5762 root      20   0   48.4g  21.2g  20.3g R 675.7   8.4   5:04.59 main

GPU is not being used (according to `watch nvidia-smi`)

Thu Jun  1 10:53:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:B2:00.0 Off |                  Off |
| 30%   32C    P2    67W / 300W |  18750MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

SlyEcho · 2023-06-01T11:41:16Z

Have you tried tweaking LLAMA_CUDA_DMMV_X and LLAMA_CUDA_DMMV_Y values (options to make but remember to clean first)?

The amount of threads (32) is excessive, the VM seems to come with 7 cores, which explains 700% CPU use. llama.cpp detects 32 from the physical CPU. Try running with -t 7 or whatever is best.

Yuval-Peled · 2023-06-01T11:44:12Z

@SlyEcho Thank you for the reply.
I'll try running with 7 threads.
I'm not familiar with LLAMA_CUDA_DMMV_X and LLAMA_CUDA_DMMV_Y, and googling did not seem to find any documentation/explanation about them. Do you have a link for me to learn a little bit so I know which values I can play with and how?

SlyEcho · 2023-06-01T12:07:34Z

It was added in PR #1530

Yuval-Peled · 2023-06-01T12:12:13Z

@SlyEcho
Running with 7 threads did the trick!

Current performance for Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_0.bin model:

command	tokens/second
-ngl 2000000	N/A (less than 0.1)
-t 7	1.6
-t 7 -ngl 2000000	8.5

Thank you so much for your help. I'll close this issue now.

I'll also look into the build env params you've linked to at #1530 to see if it helps bring out even more tokens/s.

Do you think it makes sense for me to open a documentation PR that adds a short performance FAQ to the readme?

SlyEcho · 2023-06-01T12:26:39Z

~~I am testing myself right now and it does seem abnormally slow.~~

I had to use -t 4. 7 was even too much.

Do you think it makes sense for me to open a documentation PR that adds a short performance FAQ to the readme?

Sure.

Yuval-Peled · 2023-06-01T12:54:20Z

@SlyEcho Will do this during the weekend.

Thank you for your help!

Yuval-Peled · 2023-06-02T12:37:26Z

@SlyEcho
Opened a documentation PR here. Happy to hear your thoughts on it!

sujantkumarkv · 2023-12-25T17:45:21Z

command tokens/second
-ngl 2000000 N/A (less than 0.1)
-t 7 1.6
-t 7 -ngl 2000000 8.5

hey my doubt revolves in calculating that tokens/s metric? llama.cpp doesn't itself output it right? so how are you calculating that?
because eg. transformers API combines input & output tensors and to specifically count the generated tokens, I needed to find output tokens, divide by inference time and then I got the toks/s but for llama.cpp?

cc @SlyEcho @Yuval-Peled @tobi @sw

Yuval-Peled closed this as completed Jun 1, 2023

Yuval-Peled mentioned this issue Jun 2, 2023

docs(performance): Add performance troubleshoot + example benchmark documentation #1674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Yuval-Peled commented Jun 1, 2023

SlyEcho commented Jun 1, 2023 •

edited

Loading

Yuval-Peled commented Jun 1, 2023

SlyEcho commented Jun 1, 2023

Yuval-Peled commented Jun 1, 2023

SlyEcho commented Jun 1, 2023 •

edited

Loading

Yuval-Peled commented Jun 1, 2023

Yuval-Peled commented Jun 2, 2023

sujantkumarkv commented Dec 25, 2023 •

edited

Loading

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Comments

Yuval-Peled commented Jun 1, 2023

my installation steps:

My inference command

Expected behavior

Actual behavior

llama.cpp claims that work is being offloaded to GPU

CPU usage is 700% (according to top)

GPU is not being used (according to watch nvidia-smi)

SlyEcho commented Jun 1, 2023 • edited Loading

Yuval-Peled commented Jun 1, 2023

SlyEcho commented Jun 1, 2023

Yuval-Peled commented Jun 1, 2023

SlyEcho commented Jun 1, 2023 • edited Loading

Yuval-Peled commented Jun 1, 2023

Yuval-Peled commented Jun 2, 2023

sujantkumarkv commented Dec 25, 2023 • edited Loading

CPU usage is 700% (according to `top`)

GPU is not being used (according to `watch nvidia-smi`)

SlyEcho commented Jun 1, 2023 •

edited

Loading

SlyEcho commented Jun 1, 2023 •

edited

Loading

sujantkumarkv commented Dec 25, 2023 •

edited

Loading