Fix OpenCL kernels for the new formats #1422

SlyEcho · 2023-05-12T22:04:56Z

This should fix the CLBlast related errors with the new formats.

I also rewrote them to be almost identical to the CUDA versions, so future updates could be easier.

I also figured out the solution to the Q5_0 that required preconversion to a different format with f32 (and malloc!), the issue was, of course, an alignment issue which an __attribute__((packed)) as per the OpenCL 1.1 spec solved.

Test results

Test models:

417111a40c36bff7ae6c6b3f773ac6efdb1c46584ef1077a1f3404d668e3944f  llama-7b-q4_0.bin
0fc3f4925923cafe4681370e863319e8ff8f2d760e6b3f5435b415a407aa8d56  llama-7b-q4_1.bin
1226673013a28d61acb94d46eeb15d3623bf0f1472a99ecaf0da8076d680fdf8  llama-7b-q5_0.bin
72040d380ab1067dc08c28d5f16269453bf1d4d7172c24424d4300d8474b42b6  llama-7b-q5_1.bin
7dbee72e1b9d541ed75911488d305d239a0fc4fe86bd287ee002fe75f6423859  llama-7b-q8_0.bin
666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847  llama-7b-f16.bin

Test data:

head -n 102 wiki.test.raw > wiki.test.mini

Test command:

for q in q4_0 q4_1 q5_0 q5_1 q8_0 f16; do 
    ./build-clblast/bin/perplexity -m ./models/llama-7b-$q.bin --no-mmap -f ./models/wiki.test.mini;
done

Test outputs:

7B Q4_0

main: build = 534 (3243b99)
main: seed  = 1683929450
llama.cpp: loading model from ./models/llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 7.99 seconds per pass - ETA 1 minutes
[1]4.4536,[2]5.4657,[3]6.5626,[4]7.2850,[5]7.1903,[6]7.1784,[7]7.3922,[8]7.5547,[9]7.6881,[10]8.0088,[11]8.2590,[12]8.1878,

llama_print_timings:        load time =  9274.76 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 88911.93 ms /  6144 tokens (   14.47 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 90911.25 ms

7B Q4_1

main: build = 534 (3243b99)
main: seed  = 1683929541
llama.cpp: loading model from ./models/llama-7b-q4_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936276.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.20 seconds per pass - ETA 1 minutes
[1]4.4737,[2]5.3596,[3]6.4475,[4]7.1958,[5]7.1243,[6]7.1576,[7]7.3686,[8]7.4989,[9]7.6096,[10]7.9235,[11]8.1710,[12]8.1030,

llama_print_timings:        load time = 11045.48 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 90558.13 ms /  6144 tokens (   14.74 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 94128.86 ms

7B Q5_0

main: build = 534 (3243b99)
main: seed  = 1683938023
llama.cpp: loading model from ./models/llama-7b-q5_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4525012.20 KB
llama_model_load_internal: mem required  = 6210.96 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.31 seconds per pass - ETA 1 minutes
[1]4.2785,[2]5.1975,[3]6.3031,[4]7.0360,[5]6.9772,[6]7.0087,[7]7.2125,[8]7.3533,[9]7.4901,[10]7.7904,[11]8.0330,[12]7.9637,

llama_print_timings:        load time =  9311.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 89214.59 ms /  6144 tokens (   14.52 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 90805.86 ms

7B Q5_1

main: build = 534 (3243b99)
main: seed  = 1683929741
llama.cpp: loading model from ./models/llama-7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936276.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.23 seconds per pass - ETA 1 minutes
[1]4.3009,[2]5.1991,[3]6.3072,[4]7.0070,[5]6.9728,[6]6.9970,[7]7.2182,[8]7.3313,[9]7.4692,[10]7.7641,[11]8.0044,[12]7.9289,

llama_print_timings:        load time = 11141.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 94402.18 ms /  6144 tokens (   15.36 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 98093.45 ms

7B Q8_0

main: build = 534 (3243b99)
main: seed  = 1683929839
llama.cpp: loading model from ./models/llama-7b-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403860.20 KB
llama_model_load_internal: mem required  = 9022.33 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 9.32 seconds per pass - ETA 1 minutes
[1]4.2510,[2]5.1592,[3]6.2440,[4]6.9448,[5]6.8957,[6]6.9294,[7]7.1463,[8]7.2631,[9]7.3949,[10]7.6989,[11]7.9383,[12]7.8694,

llama_print_timings:        load time = 13512.95 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 96608.18 ms /  6144 tokens (   15.72 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 101527.87 ms

7B F16

main: build = 534 (3243b99)
main: seed  = 1683929941
llama.cpp: loading model from ./models/llama-7b-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 13161556.20 KB
llama_model_load_internal: mem required  = 14645.08 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 12.80 seconds per pass - ETA 2 minutes
[1]4.2555,[2]5.1663,[3]6.2488,[4]6.9504,[5]6.8991,[6]6.9306,[7]7.1476,[8]7.2655,[9]7.3974,[10]7.7015,[11]7.9415,[12]7.8721,

llama_print_timings:        load time = 47153.02 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 141875.01 ms /  6144 tokens (   23.09 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 176944.30 ms

skidd-level-100

Looks legit please merge!

edit: downloaded and compiled make with 'make LLAMA_CLBLAST=1'
it works!

FNsi · 2023-05-12T23:45:22Z

That works fine for me!

mqy · 2023-05-13T05:24:19Z

My device failed to create out of order queue, so I fallback to order queue with this patch:

    queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
    if (err == CL_INVALID_VALUE) {
         queue = clCreateCommandQueue(context, device, 0, &err);
    }

Now it fails with program source errors (10 long error blocks):

Using Platform: Apple Device: Intel(R) UHD Graphics 630
<program source>:1:133: error: variable length arrays are not supported in OpenCL
...
<program source>:1:635: error: automatic variable qualified with an address space
...

Seem these two statements matters:

const uint j = get_local_id(0);
const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d;

EDIT: platform and device info:

Platform: Apple
CL_DEVICE_NAME: Intel(R) UHD Graphics 630
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2
CL_DRIVER_VERSION: 1.2(Mar 14 2023 21:43:06)

0cc4m

Thanks, you found the q5_1 problem and fixed the kernels, nice. I tested it and didn't find any issues on AMD and Nvidia.

RTX 3060, Llama 7B:
q5_0: 8.18 ms per token on CLBlast, 4.63 ms per token on CuBLAS
q5_1: 8.25 ms per token on CLBlast, 4.54 ms per token on CuBLAS

LostRuins · 2023-05-13T16:00:13Z

@SlyEcho @0cc4m this works for me, but I have noticed a few people mentioning that they get the error regarding variable length arrays. #1429 (comment)

I also noticed that previously the array lengths are indeed hard coded with a constant. Perhaps this is a platform limitation?

SlyEcho · 2023-05-13T16:09:20Z

@LostRuins I will take care of it.

LostRuins · 2023-05-13T17:07:38Z

@SlyEcho Another thing to add - seems like the some people are reporting that the q8_0 dequant kernel is not working correctly - this seems to be the case for me too. Have you observed similar issues? It works correctly on OpenBLAS though, only Clblast is returning gibberish, and only for q8_0.

Fix OpenCL kernels for the new formats

3243b99

skidd-level-100 approved these changes May 12, 2023

View reviewed changes

Fix Q5_0 alignment issues.

cc798cc

SlyEcho force-pushed the clkernels branch from 780a49c to cc798cc Compare May 13, 2023 00:57

SlyEcho mentioned this pull request May 13, 2023

CLBlast: q5_0, q5_1, q8_0 dequant kernels #1225

Merged

mqy mentioned this pull request May 13, 2023

[BUG]The new master ClBLas generate garbage text #1417

Closed

ggerganov approved these changes May 13, 2023

View reviewed changes

ggerganov merged commit 699b1ad into ggerganov:master May 13, 2023

0cc4m approved these changes May 13, 2023

View reviewed changes

SlyEcho deleted the clkernels branch May 13, 2023 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenCL kernels for the new formats #1422

Fix OpenCL kernels for the new formats #1422

SlyEcho commented May 12, 2023 •

edited

Loading

skidd-level-100 left a comment

FNsi commented May 12, 2023

mqy commented May 13, 2023 •

edited

Loading

0cc4m left a comment

LostRuins commented May 13, 2023

SlyEcho commented May 13, 2023

LostRuins commented May 13, 2023

Fix OpenCL kernels for the new formats #1422

Fix OpenCL kernels for the new formats #1422

Conversation

SlyEcho commented May 12, 2023 • edited Loading

Test results

skidd-level-100 left a comment

Choose a reason for hiding this comment

FNsi commented May 12, 2023

mqy commented May 13, 2023 • edited Loading

0cc4m left a comment

Choose a reason for hiding this comment

LostRuins commented May 13, 2023

SlyEcho commented May 13, 2023

LostRuins commented May 13, 2023

SlyEcho commented May 12, 2023 •

edited

Loading

mqy commented May 13, 2023 •

edited

Loading