-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA performance optimizations #1530
Conversation
I would not worry about HIP right now, it is still in draft, and this is one of the reasons, it's not sure if it is fully compatible with CUDA (so far it has been). I was able to solve the issue with this kind of change: - __shfl_xor_sync(0xffffffff, tmp, mask, 32)
+ __shfl_xor(tmp, mask, 32) Well actually, the 32 I changed to 64, etc, because of the different warp size. It would be nice if it were a define or something, actually. |
Excellent work! Interesting analytics:
It would be interesting to see if the performance gain for 65B follows the trend of slight percentage increases as the parameter count increases. |
The unroll indeed makes the compilation super long. Not sure we want to support it. for (int i = 0; i < n; ++i) {
...
} replace with: assert(n % 32 == 0);
for (int io = 0; io < n; io += 32) {
#pragma unroll
for (int i = 0; i < 32; ++i) {
....
}
} Or something along these lines |
The unroll makes the compilation longer but honestly I don't care about 2 minutes longer compilation if it means I get a few % more performance. |
I've added additional performance numbers to the OP. The difference from unrolling is ~5%. Keep in mind that as more performance optimizations are added this number will increase. I personally favor the current solution that just unrolls the big loop (unless there are performance differences). It's easier to maintain and for debugging purposes it's always possible to just compile without it. I can quickly test the proposed approach though. |
I've pushed an alternative version to this branch. The value for |
CMakeLists.txt
Outdated
@@ -67,6 +67,8 @@ endif() | |||
option(LLAMA_ACCELERATE "llama: enable Accelerate framework" ON) | |||
option(LLAMA_OPENBLAS "llama: use OpenBLAS" OFF) | |||
option(LLAMA_CUBLAS "llama: use cuBLAS" OFF) | |||
set(LLAMA_CUDA_BY "1" CACHE STRING "llama: y block size for dmmv CUDA kernels") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid introduce more and more options? instead to check the compute capability to decide if we should enable these features dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automating performance optimizations is something that I would like to do long-term but right now I don't think we have the data necessary to judge which options should be enabled under which circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be more options. There are plenty already for runtime tuning.
I am gonna try to bring #1087 into compatibility with this. The shuffle function seems the biggest hurdle right now, but it doesn't seem impossible. |
ba10833
to
b00c58c
Compare
For AMD this is roughly 20% faster: #define GGML_CUDA_DMMV_BLOCK_X 64 // dmmv = dequantize_mul_mat_vec
// ...
for (int mask = GGML_CUDA_DMMV_BLOCK_X/2; mask > 0; mask >>= 1) {
tmp += __shfl_xor_sync(0xffffffff, tmp, mask, GGML_CUDA_DMMV_BLOCK_X);
} And EDIT: master with |
I think I've found an optimization option that's 5% faster than loop unrolling on my RTX 3090 and where loop unrolling actually degrades performance. It essentially works by processing more values per iteration in the loop instead of unrolling the loop. Implementation could be a little tricky. So you can either wait until I've worked out the kinks or merge this PR without unrolling. |
Ok, I have mixed feelings about this unrolling anyway, so an alternative would be welcome |
960ee21
to
a06f7ec
Compare
I ended up doing an implementation kind of similar to what ggerganov proposed. It's now possible to set an option In terms of features I think this PR is complete but I just noticed that "BLOCK_X" is maybe a bad name for this value since it does not actually set the block size; that value is always |
Maybe just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Will do some tests later
10d9967
to
3698cd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor fixes in Makefile
Works fine on 4080. LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=1
seems to be optimal on this card, but haven't done extensive testing
Makefile
Outdated
else | ||
NVCCFLAGS += -DGGML_CUDA_DMMV_X=32 | ||
endif # LLAMA_CUDA_DMMV_X | ||
ifdef LLAMA_CUDA_BY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ifdef LLAMA_CUDA_BY | |
ifdef LLAMA_CUDA_DMMV_Y |
ggml-cuda.cu
Outdated
|
||
// dmmv = dequantize_mul_mat_vec | ||
#ifndef GGML_CUDA_DMMV_X | ||
#define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY | ||
#endif | ||
#ifndef GGML_CUDA_DMMV_Y | ||
#define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// dmmv = dequantize_mul_mat_vec | |
#ifndef GGML_CUDA_DMMV_X | |
#define GGML_CUDA_DMMV_X 32 // can by set by compiler option LLAMA_CUDA_BY | |
#endif | |
#ifndef GGML_CUDA_DMMV_Y | |
#define GGML_CUDA_DMMV_Y 1 // can by set by compiler option LLAMA_CUDA_BY | |
#endif | |
// dmmv = dequantize_mul_mat_vec | |
#ifndef GGML_CUDA_DMMV_X | |
#define GGML_CUDA_DMMV_X 32 | |
#endif | |
#ifndef GGML_CUDA_DMMV_Y | |
#define GGML_CUDA_DMMV_Y 1 | |
#endif |
3698cd0
to
d45df1b
Compare
This looks ready to merge, correct? |
From my side yes. |
(Sorry about that, I accidentally hit a random key with the page focused.) |
This is amazing now, <70 ms/t for 13b Q4_0 on my old graphics card. The variables need to be tuned though for different systems. |
This PR adds performance optimizations for GPU accelerated token generation, mostly benefiting fast GPUs like the RTX 3090. Performance optimizations can be enabled via the options
LLAMA_CUDA_BY=2
andLLAMA_CUDA_UNROLL=1
(make) orLLAMA_CUDA_UNROLL=ON
(cmake) at compile time. These options degrade performance on my GTX 1070. Build instructions (Linux):Implementation details
GGML_USE_HIPBLAS
is not defined.LLAMA_CUDA_BY
sets the number of rows per block. On my RTX 3090 setting this option to 2 is faster but higher values have slightly worse performance. On my GTX 1070 a value of 2 or higher causes performance degradation.Loop unrolling: The matrices used in llama.cpp always have the same size. So the loops used during inference can be unrolled if the compiler is told how large the matrices are. This is done via movingncols
from a regular argument to a template argument and adding a switch statement for the various matrix sizes (8 in total). On my RTX 3090 this is faster but on my GTX 1070 it's slower. Enabling this option significantly increases compile time.LLAMA_CUDA_BX
can be set to determine the block size in x direction. Default value is 32, 64 was faster on RTX 3090.Results
For the RTX 3090 I used
LLAMA_CUDA_BY=2 LLAMA_CUDA_UNROLL=1
, for the GTX 1070 I did not use these options.