Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) #1179

Merged
merged 10 commits into from
Apr 25, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 25, 2023

8-bit integer quantization support

Perplexity: 5.9563

main: seed = 1682448271
llama.cpp: loading model from ../models/7B/ggml-model-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403851.11 KB
llama_model_load_internal: mem required  = 9022.32 MB (+ 1026.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 12 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.16 seconds per pass - ETA 23 minutes
[1]4.2384,[2]4.7355,[3]5.5889,[4]6.1733,[5]6.3007,[6]6.2664,[7]6.4584,[8]6.5534,[9]6.8821,[10]7.1241,[11]7.3343,[12]7.3544,[13]7.2708,[14]7.3211,[15]7.5611,[16]7.1921,[17]7.0824,[18]7.0290,[19]6.6805,[20]6.6714,[21]6.5799,[22]6.4063,[23]6.3776,[24]6.2865,[25]6.2832,[26]6.1237,[27]5.9534,[28]5.8555,[29]5.7683,[30]5.6163,[31]5.5870,[32]5.6074,[33]5.5519,[34]5.5823,[35]5.6052,[36]5.6417,[37]5.6421,[38]5.6530,[39]5.6856,[40]5.7357,[41]5.7442,[42]5.7817,[43]5.7443,[44]5.8007,[45]5.8034,[46]5.7780,[47]5.7980,[48]5.7735,[49]5.7746,[50]5.7359,[51]5.7319,[52]5.7225,[53]5.7672,[54]5.7516,[55]5.7301,[56]5.7587,[57]5.7784,[58]5.7973,[59]5.8144,[60]5.8550,[61]5.8479,[62]5.9050,[63]5.9359,[64]5.9492,[65]5.9908,[66]5.9990,[67]6.0162,[68]6.0307,[69]6.0542,[70]6.0843,[71]6.1051,[72]6.1363,[73]6.1946,[74]6.1988,[75]6.2121,[76]6.2241,[77]6.2353,[78]6.2208,[79]6.2482,[80]6.2417,[81]6.2527,[82]6.2567,[83]6.2067,[84]6.1891,[85]6.1764,[86]6.1553,[87]6.0913,[88]6.0665,[89]6.0472,[90]6.0334,[91]6.0560,[92]6.0502,[93]6.0510,[94]6.0486,[95]6.0757,[96]6.0756,[97]6.0701,[98]6.0644,[99]6.0516,[100]6.0506,[101]6.0743,[102]6.0695,[103]6.0895,[104]6.0968,[105]6.0968,[106]6.1132,[107]6.1125,[108]6.1257,[109]6.1208,[110]6.1173,[111]6.1394,[112]6.1595,[113]6.1616,[114]6.1578,[115]6.1637,[116]6.1548,[117]6.1596,[118]6.1877,[119]6.2092,[120]6.2432,[121]6.2577,[122]6.2818,[123]6.3180,[124]6.3350,[125]6.3260,[126]6.3639,[127]6.3994,[128]6.4290,[129]6.4143,[130]6.4225,[131]6.4188,[132]6.4116,[133]6.3988,[134]6.4088,[135]6.4048,[136]6.3944,[137]6.3872,[138]6.3698,[139]6.3597,[140]6.3562,[141]6.3275,[142]6.3241,[143]6.2945,[144]6.2745,[145]6.2656,[146]6.2541,[147]6.2576,[148]6.2580,[149]6.2528,[150]6.2486,[151]6.2506,[152]6.2410,[153]6.2254,[154]6.2170,[155]6.2237,[156]6.2190,[157]6.2355,[158]6.2396,[159]6.2440,[160]6.2466,[161]6.2584,[162]6.2308,[163]6.2197,[164]6.1968,[165]6.1670,[166]6.1406,[167]6.1045,[168]6.0748,[169]6.0614,[170]6.0508,[171]6.0248,[172]6.0083,[173]5.9923,[174]5.9631,[175]5.9419,[176]5.9307,[177]5.9112,[178]5.8890,[179]5.8725,[180]5.8632,[181]5.8422,[182]5.8248,[183]5.8115,[184]5.8107,[185]5.8036,[186]5.8047,[187]5.8108,[188]5.8071,[189]5.8239,[190]5.8247,[191]5.8453,[192]5.8610,[193]5.8772,[194]5.8879,[195]5.9087,[196]5.9240,[197]5.9445,[198]5.9593,[199]5.9623,[200]5.9671,[201]5.9619,[202]5.9801,[203]5.9872,[204]5.9855,[205]5.9956,[206]6.0024,[207]5.9986,[208]6.0069,[209]6.0108,[210]6.0159,[211]6.0265,[212]6.0334,[213]6.0436,[214]6.0458,[215]6.0483,[216]6.0622,[217]6.0800,[218]6.0927,[219]6.0925,[220]6.0890,[221]6.0840,[222]6.0818,[223]6.0725,[224]6.0655,[225]6.0617,[226]6.0818,[227]6.0896,[228]6.0948,[229]6.1008,[230]6.0976,[231]6.1139,[232]6.1026,[233]6.0866,[234]6.0722,[235]6.0523,[236]6.0459,[237]6.0366,[238]6.0393,[239]6.0249,[240]6.0151,[241]6.0169,[242]6.0206,[243]6.0189,[244]6.0079,[245]6.0050,[246]5.9942,[247]5.9829,[248]5.9759,[249]5.9735,[250]5.9781,[251]5.9713,[252]5.9681,[253]5.9587,[254]5.9536,[255]5.9429,[256]5.9255,[257]5.9139,[258]5.9060,[259]5.9038,[260]5.8959,[261]5.8918,[262]5.8864,[263]5.8813,[264]5.8592,[265]5.8587,[266]5.8569,[267]5.8505,[268]5.8591,[269]5.8572,[270]5.8583,[271]5.8658,[272]5.8691,[273]5.8694,[274]5.8719,[275]5.8799,[276]5.8858,[277]5.9012,[278]5.9110,[279]5.9202,[280]5.9230,[281]5.9326,[282]5.9383,[283]5.9527,[284]5.9605,[285]5.9688,[286]5.9823,[287]5.9818,[288]5.9875,[289]5.9795,[290]5.9642,[291]5.9497,[292]5.9353,[293]5.9224,[294]5.9246,[295]5.9238,[296]5.9284,[297]5.9271,[298]5.9300,[299]5.9276,[300]5.9171,[301]5.9172,[302]5.9096,[303]5.9013,[304]5.8931,[305]5.8897,[306]5.8775,[307]5.8797,[308]5.8827,[309]5.8674,[310]5.8621,[311]5.8559,[312]5.8581,[313]5.8526,[314]5.8510,[315]5.8357,[316]5.8306,[317]5.8148,[318]5.7952,[319]5.8067,[320]5.8187,[321]5.8231,[322]5.8192,[323]5.8126,[324]5.8099,[325]5.8199,[326]5.8201,[327]5.8222,[328]5.8260,[329]5.8318,[330]5.8344,[331]5.8465,[332]5.8437,[333]5.8504,[334]5.8451,[335]5.8393,[336]5.8430,[337]5.8409,[338]5.8402,[339]5.8352,[340]5.8311,[341]5.8389,[342]5.8417,[343]5.8464,[344]5.8465,[345]5.8470,[346]5.8446,[347]5.8487,[348]5.8520,[349]5.8543,[350]5.8511,[351]5.8520,[352]5.8520,[353]5.8463,[354]5.8464,[355]5.8514,[356]5.8544,[357]5.8510,[358]5.8598,[359]5.8624,[360]5.8591,[361]5.8587,[362]5.8656,[363]5.8765,[364]5.8824,[365]5.8875,[366]5.8888,[367]5.8972,[368]5.8949,[369]5.8958,[370]5.8972,[371]5.8921,[372]5.8968,[373]5.9013,[374]5.8998,[375]5.9000,[376]5.9065,[377]5.9022,[378]5.9049,[379]5.9106,[380]5.9029,[381]5.8996,[382]5.8946,[383]5.8940,[384]5.8935,[385]5.8925,[386]5.8920,[387]5.8919,[388]5.8883,[389]5.8833,[390]5.8766,[391]5.8692,[392]5.8654,[393]5.8638,[394]5.8663,[395]5.8651,[396]5.8581,[397]5.8649,[398]5.8686,[399]5.8762,[400]5.8764,[401]5.8777,[402]5.8787,[403]5.8806,[404]5.8870,[405]5.8776,[406]5.8744,[407]5.8740,[408]5.8757,[409]5.8869,[410]5.8976,[411]5.9087,[412]5.9241,[413]5.9349,[414]5.9423,[415]5.9477,[416]5.9553,[417]5.9671,[418]5.9705,[419]5.9771,[420]5.9857,[421]5.9970,[422]6.0010,[423]6.0080,[424]6.0184,[425]6.0268,[426]6.0331,[427]6.0375,[428]6.0456,[429]6.0505,[430]6.0586,[431]6.0723,[432]6.0760,[433]6.0753,[434]6.0713,[435]6.0722,[436]6.0747,[437]6.0841,[438]6.0914,[439]6.0883,[440]6.0875,[441]6.0826,[442]6.0811,[443]6.0824,[444]6.0829,[445]6.0811,[446]6.0834,[447]6.0863,[448]6.0904,[449]6.0881,[450]6.0889,[451]6.0850,[452]6.0716,[453]6.0632,[454]6.0576,[455]6.0586,[456]6.0632,[457]6.0651,[458]6.0630,[459]6.0636,[460]6.0720,[461]6.0693,[462]6.0679,[463]6.0717,[464]6.0706,[465]6.0680,[466]6.0604,[467]6.0606,[468]6.0603,[469]6.0623,[470]6.0627,[471]6.0580,[472]6.0623,[473]6.0572,[474]6.0583,[475]6.0523,[476]6.0539,[477]6.0468,[478]6.0457,[479]6.0512,[480]6.0556,[481]6.0573,[482]6.0530,[483]6.0489,[484]6.0509,[485]6.0488,[486]6.0431,[487]6.0428,[488]6.0405,[489]6.0359,[490]6.0336,[491]6.0307,[492]6.0252,[493]6.0226,[494]6.0209,[495]6.0204,[496]6.0166,[497]6.0111,[498]6.0094,[499]6.0053,[500]5.9962,[501]5.9897,[502]5.9899,[503]5.9893,[504]5.9808,[505]5.9829,[506]5.9837,[507]5.9780,[508]5.9741,[509]5.9735,[510]5.9769,[511]5.9815,[512]5.9849,[513]5.9869,[514]5.9930,[515]5.9877,[516]5.9867,[517]5.9878,[518]5.9875,[519]5.9904,[520]5.9929,[521]5.9941,[522]5.9967,[523]5.9974,[524]6.0030,[525]6.0061,[526]6.0070,[527]6.0088,[528]6.0038,[529]6.0043,[530]5.9995,[531]5.9984,[532]6.0029,[533]6.0052,[534]6.0036,[535]6.0057,[536]6.0004,[537]5.9984,[538]6.0033,[539]6.0044,[540]6.0080,[541]6.0083,[542]6.0094,[543]6.0110,[544]6.0121,[545]6.0102,[546]6.0110,[547]6.0070,[548]6.0024,[549]6.0025,[550]5.9996,[551]5.9963,[552]5.9941,[553]5.9906,[554]5.9886,[555]5.9856,[556]5.9852,[557]5.9875,[558]5.9837,[559]5.9834,[560]5.9833,[561]5.9835,[562]5.9814,[563]5.9810,[564]5.9853,[565]5.9873,[566]5.9871,[567]5.9850,[568]5.9856,[569]5.9843,[570]5.9871,[571]5.9876,[572]5.9886,[573]5.9886,[574]5.9851,[575]5.9844,[576]5.9843,[577]5.9829,[578]5.9811,[579]5.9816,[580]5.9753,[581]5.9717,[582]5.9707,[583]5.9715,[584]5.9718,[585]5.9644,[586]5.9577,[587]5.9583,[588]5.9631,[589]5.9682,[590]5.9712,[591]5.9733,[592]5.9721,[593]5.9690,[594]5.9700,[595]5.9677,[596]5.9709,[597]5.9689,[598]5.9660,[599]5.9681,[600]5.9676,[601]5.9661,[602]5.9670,[603]5.9697,[604]5.9705,[605]5.9739,[606]5.9759,[607]5.9742,[608]5.9710,[609]5.9718,[610]5.9753,[611]5.9736,[612]5.9762,[613]5.9727,[614]5.9678,[615]5.9608,[616]5.9635,[617]5.9577,[618]5.9530,[619]5.9478,[620]5.9345,[621]5.9280,[622]5.9264,[623]5.9280,[624]5.9285,[625]5.9287,[626]5.9276,[627]5.9298,[628]5.9299,[629]5.9295,[630]5.9327,[631]5.9382,[632]5.9438,[633]5.9424,[634]5.9458,[635]5.9464,[636]5.9431,[637]5.9396,[638]5.9420,[639]5.9390,[640]5.9399,[641]5.9401,[642]5.9466,[643]5.9487,[644]5.9499,[645]5.9481,[646]5.9520,[647]5.9480,[648]5.9489,[649]5.9492,[650]5.9529,[651]5.9581,[652]5.9592,[653]5.9631,[654]5.9569,[655]5.9563,
llama_print_timings:        load time =  5233.52 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1351323.55 ms / 335360 tokens (    4.03 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1385072.75 ms

@sw
Copy link
Contributor

sw commented Apr 25, 2023

For AVX2/AVX/scalar, we might want to keep ggml_vec_dot_q4_0_q8_0 and ggml_vec_dot_q4_2_q8_0, so as not to waste cycles and memory for s0 and s1, which aren't used.

I'm actually surprised that they're worth using on ARM NEON, as the alternative is simply subtracting 8 from the Q4 quants.

@ggerganov
Copy link
Owner Author

@sw there is no noticeable difference difference between the two. Still, changed to use Q8_0 as suggested

@ggerganov ggerganov added the generation quality Quality of model output label Apr 25, 2023
@ggerganov ggerganov self-assigned this Apr 25, 2023
@sw
Copy link
Contributor

sw commented Apr 25, 2023

I guess it's not finished? You're using block_q8_1 in ggml_vec_dot_q4_0_q8_0; it just happens to work but doesn't do what it should. Maybe we need a field in quantize_fns to indicate the quantization type for the dot product, which can then be used instead of hard-coding GGML_TYPE_SIZE[GGML_TYPE_Q8_1] etc.

@ggerganov
Copy link
Owner Author

Wow - this is difficult 😄 I keep messing up something

ggml.h Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
@sw
Copy link
Contributor

sw commented Apr 25, 2023

Looks good now; I think it's very slightly slower for Q4_0 and Q4_2 because we're now missing the SIMD optimizations for quantize_row_q8_0.

@ggerganov
Copy link
Owner Author

Ok, will merge now and we can finish the AVX stuff from master

@ggerganov ggerganov merged commit 7a32fcb into master Apr 25, 2023
@ggerganov ggerganov deleted the q8_0 branch April 25, 2023 20:40
@mofosyne mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes
Development

Successfully merging this pull request may close these issues.

3 participants