Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference implementation for q4k and q5k #586

Merged
merged 7 commits into from
Aug 26, 2023

Conversation

LLukas22
Copy link
Contributor

With this k-quant models should be able to be executed, i tried to run some q4_k_M models but the results were gibberish. I then extended the unit tests to check against the results produced by the ggml matmul unit tests but everything seams fine, although the current implementation is off by one at the sixth or seventh decimal place, which probably shouldn't influence the model results.

@LaurentMazare
Copy link
Collaborator

Thanks, would probably be good to understand why the text generation does not work.
Which model file did you try out? If you don't have the time to try to line it up, I can probably take a stab at it at some point.

@LLukas22
Copy link
Contributor Author

I believe my vec-dot implementations are funky, but i don't know why as they pass ggml's unit tests and the random tensor mat-muls seam reasonable. The problem im facing is that the llama 2 k-quant models contain multiple different quant types per file meaning i can't test them in isolation.

Here are some results from TheBlokes llama-2 models:

working contains quants
llama-2-7b.ggmlv3.q2_K.bin X Q2K,Q4K,Q6K
llama-2-7b.ggmlv3.q3_K_M.bin X Q3K,Q4K,Q6K
llama-2-7b.ggmlv3.q4_K_M.bin X Q4K,Q6K
llama-2-7b.ggmlv3.q5_K_M.bin X Q5K,Q6K
llama-2-7b.ggmlv3.q6_K.bin ✔️ Q6K

From the test i would conclude that i probably got something wrong in the q4k and q5k implementations, but i don't know what 🤔

@LaurentMazare
Copy link
Collaborator

LaurentMazare commented Aug 25, 2023

I looked a bit at this and here is a patch that seems to fix things for llama-2-7b.ggmlv3.q4_K_M.bin.
Overall I would be in favor of never using transmute or transmute_copy and instead use a combination of LittleEndian::{read_u32,read_u32_into,write_u32,write_u32_into}. All these are safe ops and much easier to reason about.

Let me know if you have the time to do this on your patch or if you're already too much in armored core and if the latter I'll take a stab at fixing these and merge.

edit: I've merged #607 that convert most existing transmutes to byteorder versions so might give some good examples. Haven't done anything on the ones this PR introduces.

diff --git a/candle-core/src/quantized/k_quants.rs b/candle-core/src/quantized/k_quants.rs
index c5ce97f..3852b59 100644
--- a/candle-core/src/quantized/k_quants.rs
+++ b/candle-core/src/quantized/k_quants.rs
@@ -4,6 +4,7 @@ use super::utils::{
 };
 use super::GgmlDType;
 use crate::Result;
+use byteorder::{ByteOrder, LittleEndian};
 use half::f16;
 use rayon::prelude::*;
 
@@ -926,11 +927,7 @@ impl GgmlType for BlockQ4K {
                 q4 = &q4[32..];
             }
 
-            let utmp_raw = unsafe {
-                std::mem::transmute::<&mut [u8; 12], &mut [u32; 3]>(&mut x.scales.clone())
-            };
-
-            utmp[0..3].copy_from_slice(utmp_raw);
+            LittleEndian::read_u32_into(&x.scales, &mut utmp[0..3]);
 
             utmp[3] = ((utmp[2] >> 4) & KMASK2) | (((utmp[1] >> 6) & KMASK3) << 4);
             let uaux = utmp[1] & KMASK1;

@LLukas22
Copy link
Contributor Author

I looked a bit at this and here is a patch that seems to fix things for llama-2-7b.ggmlv3.q4_K_M.bin.
Overall I would be in favor of never using transmute or transmute_copy and instead use a combination of LittleEndian::{read_u32,read_u32_into,write_u32,write_u32_into}. All these are safe ops and much easier to reason about.

This change seems quite sensible. I initially used transmute mainly due to my limited familiarity with Rust 😅. Feel free to commit the patch directly to this PR since you should have write access. Alternatively, I can replace the transmutes on my end and test it with my collection of GGML models tomorrow. Just let me know what works best for you.

@LLukas22
Copy link
Contributor Author

Alright i replaced the transmutes and now everything seams to work just fine. 👍
I tested it again against my collection of models and the output looks reasonable, but i haven't compared it against llama.cpp's output.

@LaurentMazare
Copy link
Collaborator

Amazing, I'll have a quick look at the PR later today and hopefully merge this, I'll also try it vs llama.cpp just to check that everything is more or less in sync (from my experience so far with q4_0/q8_0, it's likely to be a bug on our side when it's not the case :) ).

@LaurentMazare LaurentMazare merged commit c72eb3d into huggingface:main Aug 26, 2023
@LaurentMazare
Copy link
Collaborator

Merged, thanks a lot for all the hard work!

@LaurentMazare
Copy link
Collaborator

Also just merged #609 that gets rid of the remaining transmute and of using intermediary values that you may find helpful. My feeling after this is that the main drawback of transmute is that they do both aliasing (possibly mutable) and typecasting whereas in this case we mostly care about the casting bit. Anyway, great to have all these quantizations available now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants