ggml : remove bit shuffling #1405

ggerganov · 2023-05-11T16:48:14Z

Drop Q4_2 support
Changed bit-order for Q4 and Q5 (breaking change)
Preplexity is perplexing as usual

New timings:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	ms/tok @ 4th	128	50	54	75	83	75
7B	ms/tok @ 8th	123	44	52	53	58	72
13B	ms/tok @ 4th	332*	93	101	150	164	141
13B	ms/tok @ 8th	308*	81	96	96	104	136

these numbers vary a lot since the model is on the 32GB limit of my MacBook

Old timings:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	ms/tok @ 4th	128	56	61	91	95	75
7B	ms/tok @ 8th	128	47	55	53	59	75
13B	ms/tok @ 4th	239	104	113	176	185	141
13B	ms/tok @ 8th	240	85	99	108	117	147

overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do

This reverts commit 948d124.

Excigma · 2023-05-11T22:54:46Z

README.md

 **Hot topics:**

+- Qauntization formats `Q4` and `Q5` have changed - requantize any old models [(info)](https://github.com/ggerganov/llama.cpp/pull/1405)


Is "qauntization" a typo? 🤔

redthing1 · 2023-05-11T23:02:27Z

Is there a script to upgrade the old models to new? I don't have the source models because they're huge.

sw · 2023-05-12T06:08:46Z

Well this was a bit rushed, I think for such a significant change it would have been nice to allow a discussion.

Ah, that's unfortunate. I don't think there's a way to fix this efficiently for the ARM_NEON branch

Fix what? Were the NEON implementations not working in #1305 / #1384?

ggerganov · 2023-05-12T06:59:12Z

#1384 does not work for NEON because when we remove the vzip calls, the high bits are no longer in the right place.

This is the relevant section before this PR:

llama.cpp/ggml.c

Lines 3335 to 3360 in b608b55

    
           // extract the 5th bit 
        
           uint32_t qh; 
        
           memcpy(&qh, x0->qh, sizeof(qh)); 
        
           tmp[0] = table_b2b_u[(qh >>  0) & 0xFF]; 
        
           tmp[1] = table_b2b_u[(qh >>  8) & 0xFF]; 
        
           tmp[2] = table_b2b_u[(qh >> 16) & 0xFF]; 
        
           tmp[3] = table_b2b_u[(qh >> 24)       ]; 
        
           const int8x16_t qhl = vld1q_s8((const int8_t *)(tmp + 0)); 
        
           const int8x16_t qhh = vld1q_s8((const int8_t *)(tmp + 2)); 
        
           const uint8x16_t v0 = vld1q_u8(x0->qs); 
        
           // 4-bit -> 8-bit 
        
           const int8x16_t v0l = vreinterpretq_s8_u8(vandq_u8  (v0, m4b)); 
        
           const int8x16_t v0h = vreinterpretq_s8_u8(vshrq_n_u8(v0, 4)); 
        
           // interleave 
        
           const int8x16_t v0lz = vzip1q_s8(v0l, v0h); 
        
           const int8x16_t v0hz = vzip2q_s8(v0l, v0h); 
        
           // add high bit and sub 16 
        
           const int8x16_t v0lf = vsubq_s8(vorrq_s8(v0lz, qhl), s16b); 
        
           const int8x16_t v0hf = vsubq_s8(vorrq_s8(v0hz, qhh), s16b);

We were ORing the 5th bit after the vzip.
I didn't see a way to fix this without shuffling the tables or the bits in some way.

M00N-MAN · 2023-05-12T14:07:16Z

Is there a script to upgrade the old models to new? I don't have the source models because they're huge.

Hello All

Could someone please share the way of requantizing?

At least on this example https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/README.md
What should be done with WizardLM-7B-uncensored.q5_1.bin or which tool need to by used for https://huggingface.co/ehartford/WizardLM-13B-Uncensored which is the source model.

upd:
./quantize --help
usage: ./quantize model-f32.bin [model-quant.bin] type [nthreads]
type = "q4_0" or 2
type = "q4_1" or 3
type = "q5_0" or 8
type = "q5_1" or 9
type = "q8_0" or 7

but seems man README.md has lack of !q4 variants and sheet matching -n value with quantization of selected model

LostRuins · 2023-05-12T17:24:29Z

Cool to see this is merged, i'm slightly confused though, seems like the converter is still writing the old file version (1) ?

philpax · 2023-05-12T17:45:13Z

Cool to see this is merged, i'm slightly confused though, seems like the converter is still writing the old file version (1) ?

I think that's intentional - the converter converts the f16 model to a GGJT v1 f16 model, which is then quantised to a GGJT v2 qX_Y model. (The saver always writes version 2, from what I can see.)

LostRuins · 2023-05-12T17:50:56Z

Doesn't seem like it though, I don't see references where File version 2 is set during quantization either.

Edit: i'm wrong. It's set in write_magic()

TheBloke · 2023-05-12T19:59:42Z

Could someone please share the way of requantizing? <img alt="image" width="594" src="https://user-example https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/blob/main/README.md What should be done with WizardLM-7B-uncensored.q5_1.bin or which tool need to by used for https://huggingface.co/ehartford/WizardLM-13B-Uncensored which is the source model.

Check my repos again. I've re-quantised all my GGMLs using the latest code, in q4_0, q5_0, q5_1 and q8_0 variants. So no need to do it yourself unless you want to.

ggerganov force-pushed the remove-vzip-2 branch from fe7b8df to f9bbbe3 Compare May 11, 2023 17:06

ggerganov marked this pull request as ready for review May 11, 2023 18:36

ggerganov requested a review from sw May 11, 2023 18:36

ggerganov mentioned this pull request May 11, 2023

ggml : remove bit shuffling #1305

Closed

18 tasks

ggerganov and others added 26 commits May 11, 2023 21:37

ggml : remove Q4_0 bit shufling (ARM NEON)

5fa47bf

ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

844d2af

ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

fd2a137

ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

9f3285f

ggml : remove Q5_0 bit shuffling (ARM NEON)

aa78dfe

ggml : 2x faster scalar implementations

b37a08f

ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

292a778

ggml : simplify scalar dot

caaacd5

ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

0add640

ggml : fix Q4_1 quantization

9472d0e

ggml : update cuBLAS + normalize variable names

cdc9607

ggml : remove Q4_2 mode

4bf1c8a

ggml : minor formatting

b08c39b

ggml : fix Q5_0 quantization

8367455

scripts : add script for measuring the time per token

928d2f3

AVX implementations (#1370)

9e49d20

ggml : uniform 5th bit extraction

489bd13

llama : produce error upon loading old model files

d52172a

llama : fix model magic/version write

09032e0

ggml : speed-up Q5_0 + Q5_1 at 4 threads

b7ad385

ggml : preserve old Q4 and Q5 formats

695f396

ggml : simplify Q8_1 - no need for low / high sums anymore

582a39f

ggml : fix Q8_0 and Q8_1 rounding

6680244

Revert "AVX implementations (#1370)"

bd5e373

This reverts commit 948d124.

ggml : fix AVX2 implementation

5bc286a

sha : update hashes for 7B and 13B

e038e01

Excigma reviewed May 11, 2023

View reviewed changes

philpax mentioned this pull request May 11, 2023

Support new quantization formats for quantizer rustformers/llm#209

Closed

ekryski mentioned this pull request May 13, 2023

Implement Together Computer's Red Pajama 3B Base/Chat model #1337

Closed

philpax mentioned this pull request May 13, 2023

Support the bit-shuffling changes from llama.cpp rustformers/llm#198

Closed

mudler mentioned this pull request May 13, 2023

⬆️ Update go-skynet/go-llama.cpp mudler/LocalAI#245

Merged

sw mentioned this pull request May 13, 2023

ggml : add AVX support based on AVX2 code #1430

Merged

nikhil-xb mentioned this pull request May 17, 2023

can't use mmap because of ggml? zylon-ai/private-gpt#190

Closed

valdesguefa mentioned this pull request May 17, 2023

Failed to execute script 'koboldcpp' due to unhandled exception! LostRuins/koboldcpp#180

Closed

awinml mentioned this pull request May 17, 2023

Not Compatible with Models quantized with updated llama.cpp q4 and q5 quantization. abetlen/llama-cpp-python#227

Closed

ChaoticByte mentioned this pull request May 18, 2023

Bump llama-cpp-python[server] from 0.1.48 to 0.1.50 ChaoticByte/Eucalyptus-Chat#2

Merged

sw mentioned this pull request May 18, 2023

make issue on sbc odroid #482

Closed

absadiki mentioned this pull request May 18, 2023

support new version of llamacpp absadiki/pyllamacpp#9

Open

mconsidine mentioned this pull request May 22, 2023

runtime error in example/server #1557

Closed

absadiki mentioned this pull request May 22, 2023

Can't import vicuna models : (bad f16 value 5) absadiki/pyllamacpp#7

Closed

matthiasgeihs mentioned this pull request Jun 2, 2023

llama.cpp: loading model ......terminate called after throwing an instance of 'std::runtime_error' abetlen/llama-cpp-python#303

Closed

Entretoize mentioned this pull request Jun 15, 2023

train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null) #1869

Closed

SkibaSAY mentioned this pull request Jul 7, 2023

[User] train-text-from-scratch.exe stop before"begin training" #2131

Closed

Green-Sky mentioned this pull request Aug 3, 2023

GGUF file format specification ggerganov/ggml#302

Merged

ikawrakow mentioned this pull request Aug 20, 2023

More efficient HellaSwag implementation #2677

Merged

xinchun-wang mentioned this pull request Sep 1, 2023

Add support for AMX instructions (bf16 and/or int8) #2555

Closed

4 tasks

BobbyKay1 approved these changes Dec 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : remove bit shuffling #1405

ggml : remove bit shuffling #1405

ggerganov commented May 11, 2023 •

edited

Loading

Excigma May 11, 2023

redthing1 commented May 11, 2023

sw commented May 12, 2023

ggerganov commented May 12, 2023

M00N-MAN commented May 12, 2023 •

edited

Loading

LostRuins commented May 12, 2023

philpax commented May 12, 2023

LostRuins commented May 12, 2023 •

edited

Loading

TheBloke commented May 12, 2023 •

edited

Loading

		Hot topics:

		- Qauntization formats `Q4` and `Q5` have changed - requantize any old models [(info)](https://github.com/ggerganov/llama.cpp/pull/1405)

ggml : remove bit shuffling #1405

ggml : remove bit shuffling #1405

Conversation

ggerganov commented May 11, 2023 • edited Loading

Excigma May 11, 2023

Choose a reason for hiding this comment

redthing1 commented May 11, 2023

sw commented May 12, 2023

ggerganov commented May 12, 2023

M00N-MAN commented May 12, 2023 • edited Loading

LostRuins commented May 12, 2023

philpax commented May 12, 2023

LostRuins commented May 12, 2023 • edited Loading

TheBloke commented May 12, 2023 • edited Loading

ggerganov commented May 11, 2023 •

edited

Loading

M00N-MAN commented May 12, 2023 •

edited

Loading

LostRuins commented May 12, 2023 •

edited

Loading

TheBloke commented May 12, 2023 •

edited

Loading