-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OpenCL kernels for the new formats #1422
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks legit please merge!
edit: downloaded and compiled make with 'make LLAMA_CLBLAST=1'
it works!
That works fine for me! |
My device failed to create out of order queue, so I fallback to order queue with this patch:
Now it fails with program source errors (10 long error blocks):
Seem these two statements matters:
EDIT: platform and device info:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, you found the q5_1 problem and fixed the kernels, nice. I tested it and didn't find any issues on AMD and Nvidia.
RTX 3060, Llama 7B:
q5_0: 8.18 ms per token
on CLBlast, 4.63 ms per token
on CuBLAS
q5_1: 8.25 ms per token
on CLBlast, 4.54 ms per token
on CuBLAS
@SlyEcho @0cc4m this works for me, but I have noticed a few people mentioning that they get the error regarding variable length arrays. #1429 (comment) I also noticed that previously the array lengths are indeed hard coded with a constant. Perhaps this is a platform limitation? |
@LostRuins I will take care of it. |
@SlyEcho Another thing to add - seems like the some people are reporting that the q8_0 dequant kernel is not working correctly - this seems to be the case for me too. Have you observed similar issues? It works correctly on OpenBLAS though, only Clblast is returning gibberish, and only for q8_0. |
This should fix the CLBlast related errors with the new formats.
I also rewrote them to be almost identical to the CUDA versions, so future updates could be easier.
Should fix #1417 #1415
I also figured out the solution to the Q5_0 that required preconversion to a different format with f32 (and
malloc
!), the issue was, of course, an alignment issue which an__attribute__((packed))
as per the OpenCL 1.1 spec solved.Test results
Test models:
Test data:
head -n 102 wiki.test.raw > wiki.test.mini
Test command:
Test outputs:
7B Q4_0
7B Q4_1
7B Q5_0
7B Q5_1
7B Q8_0
7B F16