CUDA performance optimizations #1530

JohannesGaessler · 2023-05-19T19:42:49Z

This PR adds performance optimizations for GPU accelerated token generation, mostly benefiting fast GPUs like the RTX 3090. Performance optimizations can be enabled via the options LLAMA_CUDA_BY=2 and LLAMA_CUDA_UNROLL=1 (make) or LLAMA_CUDA_UNROLL=ON (cmake) at compile time. These options degrade performance on my GTX 1070. Build instructions (Linux):

git clone https://github.com/JohannesGaessler/llama.cpp llama.cpp-johannesgaessler
cd llama.cpp-johannesgaessler                               
git fetch
git switch dfyz-xor-hack
make LLAMA_CUBLAS=1 LLAMA_CUDA_BY=2 LLAMA_CUDA_UNROLL=1

Implementation details

As suggested by @dfyz in a previous PR I have eliminated the shared memory from my CUDA kernel and am using black xor magic to sum up the partial sums at the ends. This was universally faster on my RTX 3090 and my GTX 1070. But because this breaks HIP compatibility for this PR it is only being used if GGML_USE_HIPBLAS is not defined.
Also suggested by @dfyz : larger blocks for the CUDA kernels by assigning two rows to each block. The option LLAMA_CUDA_BY sets the number of rows per block. On my RTX 3090 setting this option to 2 is faster but higher values have slightly worse performance. On my GTX 1070 a value of 2 or higher causes performance degradation.
Loop unrolling: The matrices used in llama.cpp always have the same size. So the loops used during inference can be unrolled if the compiler is told how large the matrices are. This is done via moving ncols from a regular argument to a template argument and adding a switch statement for the various matrix sizes (8 in total). On my RTX 3090 this is faster but on my GTX 1070 it's slower. Enabling this option significantly increases compile time.
Larger blocks in x direction: The option LLAMA_CUDA_BX can be set to determine the block size in x direction. Default value is 32, 64 was faster on RTX 3090.

Results

For the RTX 3090 I used LLAMA_CUDA_BY=2 LLAMA_CUDA_UNROLL=1, for the GTX 1070 I did not use these options.

GPU	Model	ms/t master	ms/t PR	ms/t no unroll
RTX 3090	7b q4_0	23.49	20.36	21.04
RTX 3090	7b q8_0	24.54	22.05	-
RTX 3090	13b q4_0	37.95	32.85	34.04
RTX 3090	33b q4_0	83.15	69.76	73.46
GTX 1070	7b q4_0	69.20	67.80	-
GTX 1070	13b q4_0	141.30	139.39	-

SlyEcho · 2023-05-19T20:05:32Z

But because this breaks HIP compatibility for #1087 it is only being used if GGML_USE_HIPBLAS is not defined.

I would not worry about HIP right now, it is still in draft, and this is one of the reasons, it's not sure if it is fully compatible with CUDA (so far it has been).

I was able to solve the issue with this kind of change:

- __shfl_xor_sync(0xffffffff, tmp, mask, 32)
+ __shfl_xor(tmp, mask, 32)

Well actually, the 32 I changed to 64, etc, because of the different warp size. It would be nice if it were a define or something, actually.

bluefireexplosion · 2023-05-19T20:09:29Z

Excellent work! Interesting analytics:

7B shows a 15.3% speedup in ms/t
13B shows a 15.5% speedup in ms/t
33B shows a 19.1% speedup in ms/t

It would be interesting to see if the performance gain for 65B follows the trend of slight percentage increases as the parameter count increases.

CMakeLists.txt

ggerganov · 2023-05-20T12:34:01Z

The unroll indeed makes the compilation super long. Not sure we want to support it.
Couldn't you achieve the same thing via:

for (int i = 0; i < n; ++i) {
...
}

replace with:

assert(n % 32 == 0);
for (int io = 0; io < n; io += 32) {
    #pragma unroll
    for (int i = 0; i < 32; ++i) {
    ....
    }
}

Or something along these lines

JohannesGaessler · 2023-05-20T12:38:11Z

The unroll makes the compilation longer but honestly I don't care about 2 minutes longer compilation if it means I get a few % more performance.

JohannesGaessler · 2023-05-20T12:46:04Z

I've added additional performance numbers to the OP. The difference from unrolling is ~5%. Keep in mind that as more performance optimizations are added this number will increase. I personally favor the current solution that just unrolls the big loop (unless there are performance differences). It's easier to maintain and for debugging purposes it's always possible to just compile without it. I can quickly test the proposed approach though.

JohannesGaessler · 2023-05-20T13:10:50Z

I've pushed an alternative version to this branch. The value for istride cannot be pushed very high; the value on the branch is the max value for 33b that is bug free. Notably unrolling the inner loop does nothing to improve performance. Unrolling both the inner and the outer loop gives you the same performance and compile time as this PR. I would argue that it's fundamentally impossible to get the performance uplift without the increase in compile time in the first place: the compiler spends that time reordering and optimizing the code which was previously impossible due to the loop.

howard0su · 2023-05-20T14:04:38Z

CMakeLists.txt

@@ -67,6 +67,8 @@ endif()
 option(LLAMA_ACCELERATE             "llama: enable Accelerate framework"                    ON)
 option(LLAMA_OPENBLAS               "llama: use OpenBLAS"                                   OFF)
 option(LLAMA_CUBLAS                 "llama: use cuBLAS"                                     OFF)
+set(LLAMA_CUDA_BY "1" CACHE STRING  "llama: y block size for dmmv CUDA kernels")


Can we avoid introduce more and more options? instead to check the compute capability to decide if we should enable these features dynamically.

Automating performance optimizations is something that I would like to do long-term but right now I don't think we have the data necessary to judge which options should be enabled under which circumstances.

I think there should be more options. There are plenty already for runtime tuning.

SlyEcho · 2023-05-20T16:16:18Z

I am gonna try to bring #1087 into compatibility with this.

The shuffle function seems the biggest hurdle right now, but it doesn't seem impossible.

ggml-cuda.cu

SlyEcho · 2023-05-20T19:18:16Z

For AMD this is roughly 20% faster:

#define GGML_CUDA_DMMV_BLOCK_X 64 // dmmv = dequantize_mul_mat_vec
// ...
for (int mask = GGML_CUDA_DMMV_BLOCK_X/2; mask > 0; mask >>= 1) {
    tmp += __shfl_xor_sync(0xffffffff, tmp, mask, GGML_CUDA_DMMV_BLOCK_X);
}

And LLAMA_CUDA_BY=4 LLAMA_CUDA_UNROLL=ON adds another couple percents. But the unrolled code is very slow to compile, especially if all architectures are enabled as they are by default.

EDIT: master with #define CUDA_DMMV_BLOCK_SIZE 128 is still faster that all of the above.

JohannesGaessler · 2023-05-21T10:11:59Z

I think I've found an optimization option that's 5% faster than loop unrolling on my RTX 3090 and where loop unrolling actually degrades performance. It essentially works by processing more values per iteration in the loop instead of unrolling the loop. Implementation could be a little tricky. So you can either wait until I've worked out the kinks or merge this PR without unrolling.

ggerganov · 2023-05-21T11:53:13Z

I think I've found an optimization option that's 5% faster than loop unrolling on my RTX 3090 and where loop unrolling actually degrades performance. It essentially works by processing more values per iteration in the loop instead of unrolling the loop. Implementation could be a little tricky. So you can either wait until I've worked out the kinks or merge this PR without unrolling.

Ok, I have mixed feelings about this unrolling anyway, so an alternative would be welcome

JohannesGaessler · 2023-05-21T14:56:59Z

I ended up doing an implementation kind of similar to what ggerganov proposed. It's now possible to set an option LLAMA_CUDA_BX that sets the how many data values in x direction are processed in each outer loop iteration. This then also increases the amount of data processed in the unrolled inner loop. The optimal parameters for my RTX 3090 are LLAMA_CUDA_BX=64 LLAMA_CUDA_BY=2 which result in 14.53 t/s for 33b q4_0.

In terms of features I think this PR is complete but I just noticed that "BLOCK_X" is maybe a bad name for this value since it does not actually set the block size; that value is always WARP_SIZE = 32. Suggestions for better naming?

JohannesGaessler · 2023-05-21T16:12:51Z

Maybe just LLAMA_CUDA_DMMV_X / LLAMA_CUDA_DMMV_Y?

ggerganov

Looks great. Will do some tests later

ggerganov

Minor fixes in Makefile

Works fine on 4080. LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=1 seems to be optimal on this card, but haven't done extensive testing

ggerganov · 2023-05-23T07:58:05Z

Makefile

+else
+	NVCCFLAGS += -DGGML_CUDA_DMMV_X=32
+endif # LLAMA_CUDA_DMMV_X
+ifdef LLAMA_CUDA_BY


Suggested change

ifdef LLAMA_CUDA_BY

ifdef LLAMA_CUDA_DMMV_Y

ggerganov · 2023-05-23T07:59:16Z

ggml-cuda.cu

+
+// dmmv = dequantize_mul_mat_vec
+#ifndef GGML_CUDA_DMMV_X
+#define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY
+#endif
+#ifndef GGML_CUDA_DMMV_Y
+#define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY
+#endif


Suggested change

// dmmv = dequantize_mul_mat_vec

#ifndef GGML_CUDA_DMMV_X

#define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY

#endif

#ifndef GGML_CUDA_DMMV_Y

#define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY

#endif

// dmmv = dequantize_mul_mat_vec

#ifndef GGML_CUDA_DMMV_X

#define GGML_CUDA_DMMV_X 32

#endif

#ifndef GGML_CUDA_DMMV_Y

#define GGML_CUDA_DMMV_Y 1

#endif

ggerganov · 2023-05-25T20:23:34Z

This looks ready to merge, correct?

JohannesGaessler · 2023-05-25T21:03:54Z

From my side yes.

KerfuffleV2 · 2023-05-25T21:13:16Z

(Sorry about that, I accidentally hit a random key with the page focused.)

SlyEcho · 2023-05-25T22:16:25Z

This is amazing now, <70 ms/t for 13b Q4_0 on my old graphics card.

The variables need to be tuned though for different systems.

JohannesGaessler added the performance Speed related topics label May 19, 2023

JohannesGaessler commented May 20, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

howard0su reviewed May 20, 2023

View reviewed changes

ggerganov reviewed May 20, 2023

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

JohannesGaessler force-pushed the dfyz-xor-hack branch from ba10833 to b00c58c Compare May 20, 2023 17:42

JohannesGaessler force-pushed the dfyz-xor-hack branch from 960ee21 to a06f7ec Compare May 21, 2023 14:46

ggerganov approved these changes May 21, 2023

View reviewed changes

JohannesGaessler added 7 commits May 23, 2023 09:10

xor hack

fbf5588

block y dim

1a78710

loop unrolling

82cf01f

Fixed cmake LLAMA_CUDA_BY option

17dc4c5

Removed hipblas compatibility code

5d0cf99

Define GGML_CUDA_DMMV_BLOCK_Y if not defined

e199938

Fewer iters, more ops per iter

98bfee0

JohannesGaessler force-pushed the dfyz-xor-hack branch from 10d9967 to 3698cd0 Compare May 23, 2023 07:13

ggerganov approved these changes May 23, 2023

View reviewed changes

Renamed DMMV X/Y compilation options

d45df1b

JohannesGaessler force-pushed the dfyz-xor-hack branch from 3698cd0 to d45df1b Compare May 23, 2023 08:16

ggerganov merged commit 1fcdcc2 into ggerganov:master May 25, 2023

KerfuffleV2 assigned KerfuffleV2 and unassigned KerfuffleV2 May 25, 2023

edp1096 mentioned this pull request May 26, 2023

CUDA/OpenCL error, out of memory when reload. #1456

Closed

SlyEcho mentioned this pull request Jun 1, 2023

Token generation speed less than 1/minute on A6000 machine (48GB VRAM, 32GB RAM) #1665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA performance optimizations #1530

CUDA performance optimizations #1530

JohannesGaessler commented May 19, 2023 •

edited

Loading

SlyEcho commented May 19, 2023

bluefireexplosion commented May 19, 2023

ggerganov commented May 20, 2023

JohannesGaessler commented May 20, 2023

JohannesGaessler commented May 20, 2023

JohannesGaessler commented May 20, 2023

howard0su May 20, 2023 •

edited

Loading

JohannesGaessler May 20, 2023

SlyEcho May 20, 2023

SlyEcho commented May 20, 2023

SlyEcho commented May 20, 2023 •

edited

Loading

JohannesGaessler commented May 21, 2023

ggerganov commented May 21, 2023

JohannesGaessler commented May 21, 2023 •

edited

Loading

JohannesGaessler commented May 21, 2023 •

edited

Loading

ggerganov left a comment

ggerganov left a comment

ggerganov May 23, 2023

ggerganov May 23, 2023 •

edited

Loading

ggerganov commented May 25, 2023

JohannesGaessler commented May 25, 2023

KerfuffleV2 commented May 25, 2023

SlyEcho commented May 25, 2023

CUDA performance optimizations #1530

CUDA performance optimizations #1530

Conversation

JohannesGaessler commented May 19, 2023 • edited Loading

Implementation details

Results

SlyEcho commented May 19, 2023

bluefireexplosion commented May 19, 2023

ggerganov commented May 20, 2023

JohannesGaessler commented May 20, 2023

JohannesGaessler commented May 20, 2023

JohannesGaessler commented May 20, 2023

howard0su May 20, 2023 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler May 20, 2023

Choose a reason for hiding this comment

SlyEcho May 20, 2023

Choose a reason for hiding this comment

SlyEcho commented May 20, 2023

SlyEcho commented May 20, 2023 • edited Loading

JohannesGaessler commented May 21, 2023

ggerganov commented May 21, 2023

JohannesGaessler commented May 21, 2023 • edited Loading

JohannesGaessler commented May 21, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov May 23, 2023

Choose a reason for hiding this comment

ggerganov May 23, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented May 25, 2023

JohannesGaessler commented May 25, 2023

KerfuffleV2 commented May 25, 2023

SlyEcho commented May 25, 2023

JohannesGaessler commented May 19, 2023 •

edited

Loading

howard0su May 20, 2023 •

edited

Loading

SlyEcho commented May 20, 2023 •

edited

Loading

JohannesGaessler commented May 21, 2023 •

edited

Loading

JohannesGaessler commented May 21, 2023 •

edited

Loading

ggerganov May 23, 2023 •

edited

Loading