Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA performance optimizations #1530

Merged
merged 8 commits into from
May 25, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 19, 2023

This PR adds performance optimizations for GPU accelerated token generation, mostly benefiting fast GPUs like the RTX 3090. Performance optimizations can be enabled via the options LLAMA_CUDA_BY=2 and LLAMA_CUDA_UNROLL=1 (make) or LLAMA_CUDA_UNROLL=ON (cmake) at compile time. These options degrade performance on my GTX 1070. Build instructions (Linux):

git clone https://github.com/JohannesGaessler/llama.cpp llama.cpp-johannesgaessler
cd llama.cpp-johannesgaessler                               
git fetch
git switch dfyz-xor-hack
make LLAMA_CUBLAS=1 LLAMA_CUDA_BY=2 LLAMA_CUDA_UNROLL=1

Implementation details

  • As suggested by @dfyz in a previous PR I have eliminated the shared memory from my CUDA kernel and am using black xor magic to sum up the partial sums at the ends. This was universally faster on my RTX 3090 and my GTX 1070. But because this breaks HIP compatibility for this PR it is only being used if GGML_USE_HIPBLAS is not defined.
  • Also suggested by @dfyz : larger blocks for the CUDA kernels by assigning two rows to each block. The option LLAMA_CUDA_BY sets the number of rows per block. On my RTX 3090 setting this option to 2 is faster but higher values have slightly worse performance. On my GTX 1070 a value of 2 or higher causes performance degradation.
  • Loop unrolling: The matrices used in llama.cpp always have the same size. So the loops used during inference can be unrolled if the compiler is told how large the matrices are. This is done via moving ncols from a regular argument to a template argument and adding a switch statement for the various matrix sizes (8 in total). On my RTX 3090 this is faster but on my GTX 1070 it's slower. Enabling this option significantly increases compile time.
  • Larger blocks in x direction: The option LLAMA_CUDA_BX can be set to determine the block size in x direction. Default value is 32, 64 was faster on RTX 3090.

Results

For the RTX 3090 I used LLAMA_CUDA_BY=2 LLAMA_CUDA_UNROLL=1, for the GTX 1070 I did not use these options.

GPU Model ms/t master ms/t PR ms/t no unroll
RTX 3090 7b q4_0 23.49 20.36 21.04
RTX 3090 7b q8_0 24.54 22.05 -
RTX 3090 13b q4_0 37.95 32.85 34.04
RTX 3090 33b q4_0 83.15 69.76 73.46
GTX 1070 7b q4_0 69.20 67.80 -
GTX 1070 13b q4_0 141.30 139.39 -

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 19, 2023

But because this breaks HIP compatibility for #1087 it is only being used if GGML_USE_HIPBLAS is not defined.

I would not worry about HIP right now, it is still in draft, and this is one of the reasons, it's not sure if it is fully compatible with CUDA (so far it has been).

I was able to solve the issue with this kind of change:

- __shfl_xor_sync(0xffffffff, tmp, mask, 32)
+ __shfl_xor(tmp, mask, 32)

Well actually, the 32 I changed to 64, etc, because of the different warp size. It would be nice if it were a define or something, actually.

@bluefireexplosion
Copy link

Excellent work! Interesting analytics:

  • 7B shows a 15.3% speedup in ms/t
  • 13B shows a 15.5% speedup in ms/t
  • 33B shows a 19.1% speedup in ms/t

It would be interesting to see if the performance gain for 65B follows the trend of slight percentage increases as the parameter count increases.

@JohannesGaessler JohannesGaessler added the performance Speed related topics label May 19, 2023
CMakeLists.txt Outdated Show resolved Hide resolved
@ggerganov
Copy link
Owner

The unroll indeed makes the compilation super long. Not sure we want to support it.
Couldn't you achieve the same thing via:

for (int i = 0; i < n; ++i) {
...
}

replace with:

assert(n % 32 == 0);
for (int io = 0; io < n; io += 32) {
    #pragma unroll
    for (int i = 0; i < 32; ++i) {
    ....
    }
}

Or something along these lines

@JohannesGaessler
Copy link
Collaborator Author

The unroll makes the compilation longer but honestly I don't care about 2 minutes longer compilation if it means I get a few % more performance.

@JohannesGaessler
Copy link
Collaborator Author

I've added additional performance numbers to the OP. The difference from unrolling is ~5%. Keep in mind that as more performance optimizations are added this number will increase. I personally favor the current solution that just unrolls the big loop (unless there are performance differences). It's easier to maintain and for debugging purposes it's always possible to just compile without it. I can quickly test the proposed approach though.

@JohannesGaessler
Copy link
Collaborator Author

I've pushed an alternative version to this branch. The value for istride cannot be pushed very high; the value on the branch is the max value for 33b that is bug free. Notably unrolling the inner loop does nothing to improve performance. Unrolling both the inner and the outer loop gives you the same performance and compile time as this PR. I would argue that it's fundamentally impossible to get the performance uplift without the increase in compile time in the first place: the compiler spends that time reordering and optimizing the code which was previously impossible due to the loop.

CMakeLists.txt Outdated
@@ -67,6 +67,8 @@ endif()
option(LLAMA_ACCELERATE "llama: enable Accelerate framework" ON)
option(LLAMA_OPENBLAS "llama: use OpenBLAS" OFF)
option(LLAMA_CUBLAS "llama: use cuBLAS" OFF)
set(LLAMA_CUDA_BY "1" CACHE STRING "llama: y block size for dmmv CUDA kernels")
Copy link
Collaborator

@howard0su howard0su May 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid introduce more and more options? instead to check the compute capability to decide if we should enable these features dynamically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automating performance optimizations is something that I would like to do long-term but right now I don't think we have the data necessary to judge which options should be enabled under which circumstances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be more options. There are plenty already for runtime tuning.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 20, 2023

I am gonna try to bring #1087 into compatibility with this.

The shuffle function seems the biggest hurdle right now, but it doesn't seem impossible.

ggml-cuda.cu Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Collaborator

SlyEcho commented May 20, 2023

For AMD this is roughly 20% faster:

#define GGML_CUDA_DMMV_BLOCK_X 64 // dmmv = dequantize_mul_mat_vec
// ...
for (int mask = GGML_CUDA_DMMV_BLOCK_X/2; mask > 0; mask >>= 1) {
    tmp += __shfl_xor_sync(0xffffffff, tmp, mask, GGML_CUDA_DMMV_BLOCK_X);
}

And LLAMA_CUDA_BY=4 LLAMA_CUDA_UNROLL=ON adds another couple percents. But the unrolled code is very slow to compile, especially if all architectures are enabled as they are by default.


EDIT: master with #define CUDA_DMMV_BLOCK_SIZE 128 is still faster that all of the above.

@JohannesGaessler
Copy link
Collaborator Author

I think I've found an optimization option that's 5% faster than loop unrolling on my RTX 3090 and where loop unrolling actually degrades performance. It essentially works by processing more values per iteration in the loop instead of unrolling the loop. Implementation could be a little tricky. So you can either wait until I've worked out the kinks or merge this PR without unrolling.

@ggerganov
Copy link
Owner

I think I've found an optimization option that's 5% faster than loop unrolling on my RTX 3090 and where loop unrolling actually degrades performance. It essentially works by processing more values per iteration in the loop instead of unrolling the loop. Implementation could be a little tricky. So you can either wait until I've worked out the kinks or merge this PR without unrolling.

Ok, I have mixed feelings about this unrolling anyway, so an alternative would be welcome

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 21, 2023

I ended up doing an implementation kind of similar to what ggerganov proposed. It's now possible to set an option LLAMA_CUDA_BX that sets the how many data values in x direction are processed in each outer loop iteration. This then also increases the amount of data processed in the unrolled inner loop. The optimal parameters for my RTX 3090 are LLAMA_CUDA_BX=64 LLAMA_CUDA_BY=2 which result in 14.53 t/s for 33b q4_0.

In terms of features I think this PR is complete but I just noticed that "BLOCK_X" is maybe a bad name for this value since it does not actually set the block size; that value is always WARP_SIZE = 32. Suggestions for better naming?

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 21, 2023

Maybe just LLAMA_CUDA_DMMV_X / LLAMA_CUDA_DMMV_Y?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Will do some tests later

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor fixes in Makefile

Works fine on 4080. LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=1 seems to be optimal on this card, but haven't done extensive testing

Makefile Outdated
else
NVCCFLAGS += -DGGML_CUDA_DMMV_X=32
endif # LLAMA_CUDA_DMMV_X
ifdef LLAMA_CUDA_BY
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ifdef LLAMA_CUDA_BY
ifdef LLAMA_CUDA_DMMV_Y

ggml-cuda.cu Outdated
Comment on lines 91 to 98

// dmmv = dequantize_mul_mat_vec
#ifndef GGML_CUDA_DMMV_X
#define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY
#endif
#ifndef GGML_CUDA_DMMV_Y
#define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY
#endif
Copy link
Owner

@ggerganov ggerganov May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// dmmv = dequantize_mul_mat_vec
#ifndef GGML_CUDA_DMMV_X
#define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY
#endif
#ifndef GGML_CUDA_DMMV_Y
#define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY
#endif
// dmmv = dequantize_mul_mat_vec
#ifndef GGML_CUDA_DMMV_X
#define GGML_CUDA_DMMV_X 32
#endif
#ifndef GGML_CUDA_DMMV_Y
#define GGML_CUDA_DMMV_Y 1
#endif

@ggerganov
Copy link
Owner

This looks ready to merge, correct?

@JohannesGaessler
Copy link
Collaborator Author

From my side yes.

@ggerganov ggerganov merged commit 1fcdcc2 into ggerganov:master May 25, 2023
@KerfuffleV2
Copy link
Collaborator

(Sorry about that, I accidentally hit a random key with the page focused.)

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 25, 2023

This is amazing now, <70 ms/t for 13b Q4_0 on my old graphics card.

The variables need to be tuned though for different systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants