Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OpenCL kernels for the new formats #1422

Merged
merged 2 commits into from
May 13, 2023
Merged

Conversation

SlyEcho
Copy link
Collaborator

@SlyEcho SlyEcho commented May 12, 2023

This should fix the CLBlast related errors with the new formats.

I also rewrote them to be almost identical to the CUDA versions, so future updates could be easier.

Should fix #1417 #1415

I also figured out the solution to the Q5_0 that required preconversion to a different format with f32 (and malloc!), the issue was, of course, an alignment issue which an __attribute__((packed)) as per the OpenCL 1.1 spec solved.

Test results

Test models:

417111a40c36bff7ae6c6b3f773ac6efdb1c46584ef1077a1f3404d668e3944f  llama-7b-q4_0.bin
0fc3f4925923cafe4681370e863319e8ff8f2d760e6b3f5435b415a407aa8d56  llama-7b-q4_1.bin
1226673013a28d61acb94d46eeb15d3623bf0f1472a99ecaf0da8076d680fdf8  llama-7b-q5_0.bin
72040d380ab1067dc08c28d5f16269453bf1d4d7172c24424d4300d8474b42b6  llama-7b-q5_1.bin
7dbee72e1b9d541ed75911488d305d239a0fc4fe86bd287ee002fe75f6423859  llama-7b-q8_0.bin
666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847  llama-7b-f16.bin

Test data:

head -n 102 wiki.test.raw > wiki.test.mini

Test command:

for q in q4_0 q4_1 q5_0 q5_1 q8_0 f16; do 
    ./build-clblast/bin/perplexity -m ./models/llama-7b-$q.bin --no-mmap -f ./models/wiki.test.mini;
done

Test outputs:

7B Q4_0
main: build = 534 (3243b99)
main: seed  = 1683929450
llama.cpp: loading model from ./models/llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113748.20 KB
llama_model_load_internal: mem required  = 5809.33 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 7.99 seconds per pass - ETA 1 minutes
[1]4.4536,[2]5.4657,[3]6.5626,[4]7.2850,[5]7.1903,[6]7.1784,[7]7.3922,[8]7.5547,[9]7.6881,[10]8.0088,[11]8.2590,[12]8.1878,

llama_print_timings:        load time =  9274.76 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 88911.93 ms /  6144 tokens (   14.47 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 90911.25 ms
7B Q4_1
main: build = 534 (3243b99)
main: seed  = 1683929541
llama.cpp: loading model from ./models/llama-7b-q4_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936276.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.20 seconds per pass - ETA 1 minutes
[1]4.4737,[2]5.3596,[3]6.4475,[4]7.1958,[5]7.1243,[6]7.1576,[7]7.3686,[8]7.4989,[9]7.6096,[10]7.9235,[11]8.1710,[12]8.1030,

llama_print_timings:        load time = 11045.48 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 90558.13 ms /  6144 tokens (   14.74 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 94128.86 ms
7B Q5_0
main: build = 534 (3243b99)
main: seed  = 1683938023
llama.cpp: loading model from ./models/llama-7b-q5_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4525012.20 KB
llama_model_load_internal: mem required  = 6210.96 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.31 seconds per pass - ETA 1 minutes
[1]4.2785,[2]5.1975,[3]6.3031,[4]7.0360,[5]6.9772,[6]7.0087,[7]7.2125,[8]7.3533,[9]7.4901,[10]7.7904,[11]8.0330,[12]7.9637,

llama_print_timings:        load time =  9311.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 89214.59 ms /  6144 tokens (   14.52 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 90805.86 ms
7B Q5_1
main: build = 534 (3243b99)
main: seed  = 1683929741
llama.cpp: loading model from ./models/llama-7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936276.20 KB
llama_model_load_internal: mem required  = 6612.58 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 8.23 seconds per pass - ETA 1 minutes
[1]4.3009,[2]5.1991,[3]6.3072,[4]7.0070,[5]6.9728,[6]6.9970,[7]7.2182,[8]7.3313,[9]7.4692,[10]7.7641,[11]8.0044,[12]7.9289,

llama_print_timings:        load time = 11141.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 94402.18 ms /  6144 tokens (   15.36 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 98093.45 ms
7B Q8_0
main: build = 534 (3243b99)
main: seed  = 1683929839
llama.cpp: loading model from ./models/llama-7b-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403860.20 KB
llama_model_load_internal: mem required  = 9022.33 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 9.32 seconds per pass - ETA 1 minutes
[1]4.2510,[2]5.1592,[3]6.2440,[4]6.9448,[5]6.8957,[6]6.9294,[7]7.1463,[8]7.2631,[9]7.3949,[10]7.6989,[11]7.9383,[12]7.8694,

llama_print_timings:        load time = 13512.95 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 96608.18 ms /  6144 tokens (   15.72 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 101527.87 ms
7B F16
main: build = 534 (3243b99)
main: seed  = 1683929941
llama.cpp: loading model from ./models/llama-7b-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 13161556.20 KB
llama_model_load_internal: mem required  = 14645.08 MB (+ 1026.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx900:xnack-
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 12.80 seconds per pass - ETA 2 minutes
[1]4.2555,[2]5.1663,[3]6.2488,[4]6.9504,[5]6.8991,[6]6.9306,[7]7.1476,[8]7.2655,[9]7.3974,[10]7.7015,[11]7.9415,[12]7.8721,

llama_print_timings:        load time = 47153.02 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 141875.01 ms /  6144 tokens (   23.09 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 176944.30 ms

Copy link

@skidd-level-100 skidd-level-100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit please merge!

edit: downloaded and compiled make with 'make LLAMA_CLBLAST=1'
it works!

@FNsi
Copy link
Contributor

FNsi commented May 12, 2023

That works fine for me!

@mqy
Copy link
Contributor

mqy commented May 13, 2023

My device failed to create out of order queue, so I fallback to order queue with this patch:

    queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
    if (err == CL_INVALID_VALUE) {
         queue = clCreateCommandQueue(context, device, 0, &err);
    }

Now it fails with program source errors (10 long error blocks):

Using Platform: Apple Device: Intel(R) UHD Graphics 630
<program source>:1:133: error: variable length arrays are not supported in OpenCL
...
<program source>:1:635: error: automatic variable qualified with an address space
...

Seem these two statements matters:

const uint j = get_local_id(0);
const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; 

EDIT: platform and device info:

  • Platform: Apple
  • CL_DEVICE_NAME: Intel(R) UHD Graphics 630
  • CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2
  • CL_DRIVER_VERSION: 1.2(Mar 14 2023 21:43:06)

@ggerganov ggerganov merged commit 699b1ad into ggerganov:master May 13, 2023
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you found the q5_1 problem and fixed the kernels, nice. I tested it and didn't find any issues on AMD and Nvidia.

RTX 3060, Llama 7B:
q5_0: 8.18 ms per token on CLBlast, 4.63 ms per token on CuBLAS
q5_1: 8.25 ms per token on CLBlast, 4.54 ms per token on CuBLAS

@LostRuins
Copy link
Collaborator

@SlyEcho @0cc4m this works for me, but I have noticed a few people mentioning that they get the error regarding variable length arrays. #1429 (comment)

I also noticed that previously the array lengths are indeed hard coded with a constant. Perhaps this is a platform limitation?
image

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 13, 2023

@LostRuins I will take care of it.

@LostRuins
Copy link
Collaborator

@SlyEcho Another thing to add - seems like the some people are reporting that the q8_0 dequant kernel is not working correctly - this seems to be the case for me too. Have you observed similar issues? It works correctly on OpenBLAS though, only Clblast is returning gibberish, and only for q8_0.

@SlyEcho SlyEcho deleted the clkernels branch May 13, 2023 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]The new master ClBLas generate garbage text
7 participants