ggml : use 8-bit precision for Q4_1 intermediate results #1047

ggerganov · 2023-04-18T19:15:09Z

This is the same as #951 but for Q4_1

Also, in this PR we will retire the old ggml_vec_dot_q4_0() and ggml_vec_dot_q4_1() as they are no longer used.

Please send PRs with AVX implementations into this branch.
Will merge when we have:

Perplexity

Without BLAS 655 iter: 6.1299

$  make clean && LLAMA_NO_ACCELERATE=1 make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_1.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_1.txt
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
main: seed = 1681899573
llama.cpp: loading model from ./models/7B/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6612.57 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
28.30 seconds per pass - ETA 5.15 hours
[1]4.4313,[2]4.8883,[3]5.7758,[4]6.3821,[5]6.4915,[6]6.4625,[7]6.6557,[8]6.7585,[9]7.0848,[10]7.3396,[11]7.5613,[12]7.6055,[13]7.5332,[14]7.5968,[15]7.8425,[16]7.4423,[17]7.3195,[18]7.2629,[19]6.8898,[20]6.8689,[21]6.7725,[22]6.5999,[23]6.5705,[24]6.4810,[25]6.4837,[26]6.3192,[27]6.1387,[28]6.0324,[29]5.9425,[30]5.7805,[31]5.7512,[32]5.7688,[33]5.7112,[34]5.7445,[35]5.7666,[36]5.8064,[37]5.8099,[38]5.8132,[39]5.8480,[40]5.8961,[41]5.9090,[42]5.9498,[43]5.9092,[44]5.9677,[45]5.9746,[46]5.9474,[47]5.9666,[48]5.9401,[49]5.9386,[50]5.8976,[51]5.8918,[52]5.8798,[53]5.9284,[54]5.9114,[55]5.8900,[56]5.9181,[57]5.9368,[58]5.9557,[59]5.9743,[60]6.0175,[61]6.0063,[62]6.0634,[63]6.0934,[64]6.1038,[65]6.1485,[66]6.1585,[67]6.1774,[68]6.1907,[69]6.2145,[70]6.2428,[71]6.2632,[72]6.2947,[73]6.3515,[74]6.3548,[75]6.3700,[76]6.3830,[77]6.3949,[78]6.3815,[79]6.4100,[80]6.4041,[81]6.4208,[82]6.4263,[83]6.3739,[84]6.3583,[85]6.3460,[86]6.3247,[87]6.2633,[88]6.2405,[89]6.2192,[90]6.2036,[91]6.2271,[92]6.2207,[93]6.2198,[94]6.2165,[95]6.2455,[96]6.2445,[97]6.2390,[98]6.2326,[99]6.2180,[100]6.2162,[101]6.2413,[102]6.2352,[103]6.2548,[104]6.2631,[105]6.2622,[106]6.2788,[107]6.2791,[108]6.2908,[109]6.2846,[110]6.2808,[111]6.3024,[112]6.3227,[113]6.3264,[114]6.3228,[115]6.3284,[116]6.3189,[117]6.3242,[118]6.3518,[119]6.3745,[120]6.4104,[121]6.4254,[122]6.4494,[123]6.4867,[124]6.5053,[125]6.4956,[126]6.5352,[127]6.5719,[128]6.6040,[129]6.5878,[130]6.5977,[131]6.5938,[132]6.5850,[133]6.5724,[134]6.5820,[135]6.5786,[136]6.5672,[137]6.5602,[138]6.5438,[139]6.5329,[140]6.5288,[141]6.5002,[142]6.4980,[143]6.4698,[144]6.4487,[145]6.4407,[146]6.4290,[147]6.4331,[148]6.4334,[149]6.4290,[150]6.4250,[151]6.4278,[152]6.4168,[153]6.4012,[154]6.3925,[155]6.3989,[156]6.3939,[157]6.4102,[158]6.4135,[159]6.4189,[160]6.4221,[161]6.4343,[162]6.4067,[163]6.3953,[164]6.3718,[165]6.3405,[166]6.3133,[167]6.2753,[168]6.2451,[169]6.2323,[170]6.2209,[171]6.1949,[172]6.1779,[173]6.1621,[174]6.1323,[175]6.1114,[176]6.0998,[177]6.0801,[178]6.0570,[179]6.0405,[180]6.0304,[181]6.0089,[182]5.9916,[183]5.9782,[184]5.9776,[185]5.9704,[186]5.9712,[187]5.9768,[188]5.9726,[189]5.9906,[190]5.9922,[191]6.0134,[192]6.0290,[193]6.0458,[194]6.0574,[195]6.0792,[196]6.0952,[197]6.1161,[198]6.1320,[199]6.1352,[200]6.1404,[201]6.1358,[202]6.1547,[203]6.1625,[204]6.1615,[205]6.1724,[206]6.1794,[207]6.1761,[208]6.1847,[209]6.1892,[210]6.1935,[211]6.2044,[212]6.2125,[213]6.2226,[214]6.2259,[215]6.2284,[216]6.2425,[217]6.2612,[218]6.2752,[219]6.2757,[220]6.2718,[221]6.2658,[222]6.2637,[223]6.2535,[224]6.2466,[225]6.2427,[226]6.2631,[227]6.2719,[228]6.2777,[229]6.2835,[230]6.2804,[231]6.2968,[232]6.2851,[233]6.2684,[234]6.2531,[235]6.2343,[236]6.2281,[237]6.2183,[238]6.2205,[239]6.2054,[240]6.1947,[241]6.1971,[242]6.2000,[243]6.1979,[244]6.1868,[245]6.1838,[246]6.1730,[247]6.1610,[248]6.1535,[249]6.1500,[250]6.1543,[251]6.1474,[252]6.1433,[253]6.1340,[254]6.1294,[255]6.1184,[256]6.1005,[257]6.0879,[258]6.0795,[259]6.0773,[260]6.0690,[261]6.0645,[262]6.0590,[263]6.0531,[264]6.0325,[265]6.0318,[266]6.0305,[267]6.0238,[268]6.0318,[269]6.0307,[270]6.0305,[271]6.0384,[272]6.0418,[273]6.0421,[274]6.0445,[275]6.0532,[276]6.0589,[277]6.0740,[278]6.0839,[279]6.0926,[280]6.0952,[281]6.1055,[282]6.1113,[283]6.1264,[284]6.1340,[285]6.1419,[286]6.1548,[287]6.1541,[288]6.1600,[289]6.1512,[290]6.1353,[291]6.1197,[292]6.1047,[293]6.0918,[294]6.0938,[295]6.0931,[296]6.0981,[297]6.0975,[298]6.1011,[299]6.0988,[300]6.0877,[301]6.0873,[302]6.0796,[303]6.0707,[304]6.0621,[305]6.0585,[306]6.0461,[307]6.0482,[308]6.0511,[309]6.0351,[310]6.0293,[311]6.0231,[312]6.0253,[313]6.0194,[314]6.0178,[315]6.0021,[316]5.9974,[317]5.9811,[318]5.9607,[319]5.9725,[320]5.9848,[321]5.9889,[322]5.9848,[323]5.9779,[324]5.9746,[325]5.9855,[326]5.9855,[327]5.9876,[328]5.9909,[329]5.9966,[330]5.9994,[331]6.0116,[332]6.0088,[333]6.0160,[334]6.0103,[335]6.0039,[336]6.0071,[337]6.0048,[338]6.0038,[339]5.9985,[340]5.9945,[341]6.0024,[342]6.0052,[343]6.0099,[344]6.0101,[345]6.0102,[346]6.0072,[347]6.0112,[348]6.0149,[349]6.0172,[350]6.0143,[351]6.0152,[352]6.0152,[353]6.0090,[354]6.0094,[355]6.0148,[356]6.0178,[357]6.0147,[358]6.0240,[359]6.0264,[360]6.0234,[361]6.0229,[362]6.0298,[363]6.0409,[364]6.0473,[365]6.0523,[366]6.0541,[367]6.0628,[368]6.0601,[369]6.0613,[370]6.0631,[371]6.0578,[372]6.0629,[373]6.0674,[374]6.0660,[375]6.0661,[376]6.0728,[377]6.0683,[378]6.0707,[379]6.0766,[380]6.0688,[381]6.0654,[382]6.0609,[383]6.0600,[384]6.0596,[385]6.0585,[386]6.0582,[387]6.0583,[388]6.0546,[389]6.0495,[390]6.0430,[391]6.0353,[392]6.0309,[393]6.0295,[394]6.0323,[395]6.0309,[396]6.0235,[397]6.0301,[398]6.0340,[399]6.0416,[400]6.0412,[401]6.0427,[402]6.0439,[403]6.0458,[404]6.0521,[405]6.0431,[406]6.0400,[407]6.0397,[408]6.0415,[409]6.0531,[410]6.0642,[411]6.0756,[412]6.0916,[413]6.1027,[414]6.1103,[415]6.1154,[416]6.1232,[417]6.1354,[418]6.1388,[419]6.1462,[420]6.1553,[421]6.1667,[422]6.1707,[423]6.1776,[424]6.1881,[425]6.1968,[426]6.2035,[427]6.2081,[428]6.2163,[429]6.2218,[430]6.2298,[431]6.2436,[432]6.2477,[433]6.2469,[434]6.2424,[435]6.2434,[436]6.2459,[437]6.2558,[438]6.2633,[439]6.2601,[440]6.2590,[441]6.2542,[442]6.2522,[443]6.2532,[444]6.2538,[445]6.2517,[446]6.2539,[447]6.2569,[448]6.2612,[449]6.2588,[450]6.2596,[451]6.2557,[452]6.2435,[453]6.2352,[454]6.2294,[455]6.2301,[456]6.2354,[457]6.2376,[458]6.2356,[459]6.2362,[460]6.2447,[461]6.2420,[462]6.2406,[463]6.2449,[464]6.2436,[465]6.2409,[466]6.2335,[467]6.2341,[468]6.2339,[469]6.2362,[470]6.2367,[471]6.2320,[472]6.2370,[473]6.2317,[474]6.2330,[475]6.2273,[476]6.2292,[477]6.2222,[478]6.2213,[479]6.2270,[480]6.2313,[481]6.2331,[482]6.2285,[483]6.2244,[484]6.2261,[485]6.2242,[486]6.2181,[487]6.2178,[488]6.2158,[489]6.2110,[490]6.2088,[491]6.2061,[492]6.2005,[493]6.1977,[494]6.1958,[495]6.1956,[496]6.1919,[497]6.1862,[498]6.1847,[499]6.1802,[500]6.1708,[501]6.1644,[502]6.1644,[503]6.1639,[504]6.1550,[505]6.1573,[506]6.1581,[507]6.1528,[508]6.1489,[509]6.1483,[510]6.1519,[511]6.1567,[512]6.1605,[513]6.1624,[514]6.1687,[515]6.1634,[516]6.1626,[517]6.1636,[518]6.1632,[519]6.1664,[520]6.1685,[521]6.1699,[522]6.1727,[523]6.1736,[524]6.1794,[525]6.1827,[526]6.1836,[527]6.1851,[528]6.1801,[529]6.1807,[530]6.1755,[531]6.1738,[532]6.1787,[533]6.1810,[534]6.1795,[535]6.1816,[536]6.1764,[537]6.1742,[538]6.1793,[539]6.1802,[540]6.1838,[541]6.1840,[542]6.1847,[543]6.1863,[544]6.1874,[545]6.1853,[546]6.1861,[547]6.1823,[548]6.1774,[549]6.1771,[550]6.1744,[551]6.1707,[552]6.1684,[553]6.1647,[554]6.1625,[555]6.1594,[556]6.1589,[557]6.1611,[558]6.1573,[559]6.1571,[560]6.1570,[561]6.1575,[562]6.1551,[563]6.1548,[564]6.1594,[565]6.1616,[566]6.1616,[567]6.1597,[568]6.1601,[569]6.1586,[570]6.1614,[571]6.1617,[572]6.1623,[573]6.1619,[574]6.1585,[575]6.1580,[576]6.1579,[577]6.1560,[578]6.1538,[579]6.1539,[580]6.1477,[581]6.1439,[582]6.1431,[583]6.1440,[584]6.1442,[585]6.1368,[586]6.1299,[587]6.1305,[588]6.1352,[589]6.1408,[590]6.1437,[591]6.1459,[592]6.1446,[593]6.1413,[594]6.1424,[595]6.1400,[596]6.1434,[597]6.1412,[598]6.1387,[599]6.1409,[600]6.1409,[601]6.1397,[602]6.1416,[603]6.1440,[604]6.1450,[605]6.1487,[606]6.1508,[607]6.1492,[608]6.1456,[609]6.1461,[610]6.1497,[611]6.1482,[612]6.1508,[613]6.1471,[614]6.1424,[615]6.1349,[616]6.1375,[617]6.1314,[618]6.1265,[619]6.1210,[620]6.1071,[621]6.1003,[622]6.0988,[623]6.1005,[624]6.1010,[625]6.1011,[626]6.1002,[627]6.1028,[628]6.1029,[629]6.1025,[630]6.1055,[631]6.1111,[632]6.1169,[633]6.1153,[634]6.1188,[635]6.1193,[636]6.1159,[637]6.1124,[638]6.1150,[639]6.1118,[640]6.1128,[641]6.1129,[642]6.1194,[643]6.1213,[644]6.1224,[645]6.1206,[646]6.1249,[647]6.1211,[648]6.1223,[649]6.1224,[650]6.1265,[651]6.1319,[652]6.1331,[653]6.1369,[654]6.1306,[655]6.1299,

With BLAS 655 iter: 6.1286

$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_1.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_1-blas.txt
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681916726
llama.cpp: loading model from ./models/7B/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6612.57 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
7.91 seconds per pass - ETA 1.44 hours
[1]4.4308,[2]4.8847,[3]5.7740,[4]6.3800,[5]6.4911,[6]6.4631,[7]6.6541,[8]6.7572,[9]7.0836,[10]7.3390,[11]7.5610,[12]7.6048,[13]7.5321,[14]7.5962,[15]7.8405,[16]7.4406,[17]7.3183,[18]7.2617,[19]6.8885,[20]6.8674,[21]6.7698,[22]6.5975,[23]6.5677,[24]6.4785,[25]6.4809,[26]6.3162,[27]6.1360,[28]6.0295,[29]5.9399,[30]5.7778,[31]5.7484,[32]5.7660,[33]5.7090,[34]5.7425,[35]5.7645,[36]5.8048,[37]5.8079,[38]5.8115,[39]5.8458,[40]5.8938,[41]5.9065,[42]5.9472,[43]5.9068,[44]5.9654,[45]5.9723,[46]5.9451,[47]5.9644,[48]5.9380,[49]5.9366,[50]5.8957,[51]5.8900,[52]5.8782,[53]5.9265,[54]5.9097,[55]5.8882,[56]5.9164,[57]5.9353,[58]5.9541,[59]5.9727,[60]6.0158,[61]6.0046,[62]6.0618,[63]6.0918,[64]6.1022,[65]6.1470,[66]6.1571,[67]6.1761,[68]6.1893,[69]6.2129,[70]6.2411,[71]6.2615,[72]6.2929,[73]6.3495,[74]6.3530,[75]6.3681,[76]6.3810,[77]6.3931,[78]6.3793,[79]6.4079,[80]6.4020,[81]6.4183,[82]6.4240,[83]6.3717,[84]6.3560,[85]6.3436,[86]6.3223,[87]6.2610,[88]6.2381,[89]6.2170,[90]6.2015,[91]6.2251,[92]6.2187,[93]6.2179,[94]6.2147,[95]6.2436,[96]6.2427,[97]6.2371,[98]6.2307,[99]6.2162,[100]6.2145,[101]6.2394,[102]6.2334,[103]6.2530,[104]6.2612,[105]6.2602,[106]6.2769,[107]6.2771,[108]6.2885,[109]6.2822,[110]6.2784,[111]6.3000,[112]6.3202,[113]6.3238,[114]6.3201,[115]6.3257,[116]6.3163,[117]6.3216,[118]6.3493,[119]6.3719,[120]6.4077,[121]6.4227,[122]6.4468,[123]6.4841,[124]6.5027,[125]6.4930,[126]6.5325,[127]6.5694,[128]6.6014,[129]6.5853,[130]6.5951,[131]6.5913,[132]6.5826,[133]6.5700,[134]6.5797,[135]6.5762,[136]6.5648,[137]6.5579,[138]6.5414,[139]6.5304,[140]6.5264,[141]6.4978,[142]6.4955,[143]6.4675,[144]6.4465,[145]6.4386,[146]6.4269,[147]6.4311,[148]6.4314,[149]6.4270,[150]6.4231,[151]6.4259,[152]6.4150,[153]6.3994,[154]6.3907,[155]6.3970,[156]6.3920,[157]6.4084,[158]6.4117,[159]6.4171,[160]6.4203,[161]6.4325,[162]6.4049,[163]6.3934,[164]6.3699,[165]6.3385,[166]6.3115,[167]6.2735,[168]6.2433,[169]6.2304,[170]6.2190,[171]6.1931,[172]6.1760,[173]6.1603,[174]6.1305,[175]6.1096,[176]6.0980,[177]6.0784,[178]6.0552,[179]6.0386,[180]6.0286,[181]6.0071,[182]5.9898,[183]5.9765,[184]5.9758,[185]5.9686,[186]5.9693,[187]5.9750,[188]5.9708,[189]5.9889,[190]5.9905,[191]6.0117,[192]6.0273,[193]6.0441,[194]6.0557,[195]6.0774,[196]6.0933,[197]6.1143,[198]6.1301,[199]6.1332,[200]6.1385,[201]6.1340,[202]6.1529,[203]6.1606,[204]6.1597,[205]6.1705,[206]6.1775,[207]6.1743,[208]6.1828,[209]6.1873,[210]6.1917,[211]6.2026,[212]6.2107,[213]6.2208,[214]6.2241,[215]6.2266,[216]6.2407,[217]6.2594,[218]6.2734,[219]6.2738,[220]6.2700,[221]6.2641,[222]6.2619,[223]6.2516,[224]6.2448,[225]6.2409,[226]6.2612,[227]6.2700,[228]6.2757,[229]6.2816,[230]6.2784,[231]6.2948,[232]6.2832,[233]6.2666,[234]6.2512,[235]6.2324,[236]6.2262,[237]6.2164,[238]6.2186,[239]6.2035,[240]6.1928,[241]6.1952,[242]6.1982,[243]6.1961,[244]6.1850,[245]6.1820,[246]6.1712,[247]6.1593,[248]6.1518,[249]6.1483,[250]6.1526,[251]6.1456,[252]6.1416,[253]6.1323,[254]6.1277,[255]6.1167,[256]6.0988,[257]6.0862,[258]6.0778,[259]6.0755,[260]6.0673,[261]6.0628,[262]6.0573,[263]6.0514,[264]6.0309,[265]6.0302,[266]6.0290,[267]6.0222,[268]6.0302,[269]6.0290,[270]6.0289,[271]6.0368,[272]6.0402,[273]6.0405,[274]6.0428,[275]6.0515,[276]6.0572,[277]6.0723,[278]6.0822,[279]6.0909,[280]6.0936,[281]6.1039,[282]6.1096,[283]6.1247,[284]6.1323,[285]6.1403,[286]6.1531,[287]6.1523,[288]6.1583,[289]6.1495,[290]6.1336,[291]6.1180,[292]6.1030,[293]6.0901,[294]6.0921,[295]6.0914,[296]6.0964,[297]6.0957,[298]6.0993,[299]6.0970,[300]6.0860,[301]6.0855,[302]6.0779,[303]6.0689,[304]6.0603,[305]6.0568,[306]6.0444,[307]6.0465,[308]6.0493,[309]6.0333,[310]6.0275,[311]6.0213,[312]6.0236,[313]6.0177,[314]6.0161,[315]6.0005,[316]5.9958,[317]5.9795,[318]5.9590,[319]5.9709,[320]5.9831,[321]5.9872,[322]5.9831,[323]5.9762,[324]5.9729,[325]5.9839,[326]5.9838,[327]5.9859,[328]5.9892,[329]5.9949,[330]5.9978,[331]6.0099,[332]6.0072,[333]6.0143,[334]6.0086,[335]6.0023,[336]6.0055,[337]6.0032,[338]6.0022,[339]5.9969,[340]5.9928,[341]6.0007,[342]6.0036,[343]6.0082,[344]6.0084,[345]6.0085,[346]6.0055,[347]6.0095,[348]6.0133,[349]6.0155,[350]6.0127,[351]6.0135,[352]6.0135,[353]6.0073,[354]6.0077,[355]6.0130,[356]6.0160,[357]6.0129,[358]6.0221,[359]6.0245,[360]6.0216,[361]6.0211,[362]6.0280,[363]6.0391,[364]6.0455,[365]6.0505,[366]6.0523,[367]6.0610,[368]6.0583,[369]6.0595,[370]6.0613,[371]6.0560,[372]6.0611,[373]6.0657,[374]6.0642,[375]6.0644,[376]6.0711,[377]6.0666,[378]6.0690,[379]6.0749,[380]6.0671,[381]6.0638,[382]6.0593,[383]6.0584,[384]6.0579,[385]6.0569,[386]6.0566,[387]6.0567,[388]6.0531,[389]6.0479,[390]6.0414,[391]6.0337,[392]6.0294,[393]6.0280,[394]6.0308,[395]6.0294,[396]6.0220,[397]6.0287,[398]6.0325,[399]6.0402,[400]6.0398,[401]6.0412,[402]6.0425,[403]6.0444,[404]6.0507,[405]6.0416,[406]6.0386,[407]6.0382,[408]6.0401,[409]6.0516,[410]6.0627,[411]6.0741,[412]6.0901,[413]6.1011,[414]6.1088,[415]6.1139,[416]6.1217,[417]6.1339,[418]6.1373,[419]6.1446,[420]6.1538,[421]6.1652,[422]6.1692,[423]6.1761,[424]6.1866,[425]6.1953,[426]6.2019,[427]6.2066,[428]6.2148,[429]6.2203,[430]6.2282,[431]6.2421,[432]6.2461,[433]6.2453,[434]6.2408,[435]6.2418,[436]6.2443,[437]6.2542,[438]6.2618,[439]6.2585,[440]6.2574,[441]6.2526,[442]6.2506,[443]6.2516,[444]6.2523,[445]6.2501,[446]6.2523,[447]6.2553,[448]6.2596,[449]6.2572,[450]6.2580,[451]6.2541,[452]6.2419,[453]6.2336,[454]6.2278,[455]6.2285,[456]6.2337,[457]6.2359,[458]6.2339,[459]6.2346,[460]6.2430,[461]6.2403,[462]6.2390,[463]6.2433,[464]6.2420,[465]6.2393,[466]6.2319,[467]6.2326,[468]6.2324,[469]6.2347,[470]6.2352,[471]6.2306,[472]6.2356,[473]6.2303,[474]6.2316,[475]6.2259,[476]6.2278,[477]6.2208,[478]6.2198,[479]6.2255,[480]6.2299,[481]6.2317,[482]6.2271,[483]6.2230,[484]6.2247,[485]6.2228,[486]6.2167,[487]6.2164,[488]6.2145,[489]6.2096,[490]6.2075,[491]6.2048,[492]6.1992,[493]6.1964,[494]6.1945,[495]6.1943,[496]6.1906,[497]6.1850,[498]6.1835,[499]6.1790,[500]6.1696,[501]6.1632,[502]6.1632,[503]6.1627,[504]6.1538,[505]6.1561,[506]6.1569,[507]6.1516,[508]6.1477,[509]6.1471,[510]6.1507,[511]6.1555,[512]6.1592,[513]6.1611,[514]6.1675,[515]6.1621,[516]6.1614,[517]6.1623,[518]6.1619,[519]6.1652,[520]6.1672,[521]6.1686,[522]6.1714,[523]6.1722,[524]6.1781,[525]6.1814,[526]6.1822,[527]6.1838,[528]6.1788,[529]6.1794,[530]6.1741,[531]6.1725,[532]6.1774,[533]6.1796,[534]6.1781,[535]6.1803,[536]6.1751,[537]6.1729,[538]6.1780,[539]6.1789,[540]6.1825,[541]6.1827,[542]6.1834,[543]6.1850,[544]6.1860,[545]6.1840,[546]6.1848,[547]6.1809,[548]6.1761,[549]6.1758,[550]6.1731,[551]6.1693,[552]6.1671,[553]6.1634,[554]6.1612,[555]6.1581,[556]6.1576,[557]6.1598,[558]6.1560,[559]6.1558,[560]6.1557,[561]6.1562,[562]6.1538,[563]6.1535,[564]6.1581,[565]6.1603,[566]6.1603,[567]6.1584,[568]6.1587,[569]6.1572,[570]6.1601,[571]6.1604,[572]6.1610,[573]6.1606,[574]6.1572,[575]6.1567,[576]6.1566,[577]6.1547,[578]6.1525,[579]6.1526,[580]6.1464,[581]6.1426,[582]6.1418,[583]6.1427,[584]6.1430,[585]6.1355,[586]6.1287,[587]6.1293,[588]6.1339,[589]6.1395,[590]6.1425,[591]6.1446,[592]6.1434,[593]6.1401,[594]6.1411,[595]6.1387,[596]6.1421,[597]6.1400,[598]6.1375,[599]6.1396,[600]6.1396,[601]6.1384,[602]6.1402,[603]6.1427,[604]6.1437,[605]6.1474,[606]6.1495,[607]6.1479,[608]6.1443,[609]6.1448,[610]6.1484,[611]6.1469,[612]6.1495,[613]6.1458,[614]6.1411,[615]6.1335,[616]6.1362,[617]6.1301,[618]6.1252,[619]6.1197,[620]6.1058,[621]6.0990,[622]6.0975,[623]6.0992,[624]6.0997,[625]6.0998,[626]6.0989,[627]6.1015,[628]6.1016,[629]6.1012,[630]6.1042,[631]6.1098,[632]6.1156,[633]6.1140,[634]6.1175,[635]6.1180,[636]6.1145,[637]6.1111,[638]6.1137,[639]6.1104,[640]6.1115,[641]6.1116,[642]6.1181,[643]6.1200,[644]6.1211,[645]6.1193,[646]6.1236,[647]6.1198,[648]6.1209,[649]6.1211,[650]6.1252,[651]6.1306,[652]6.1317,[653]6.1356,[654]6.1293,[655]6.1286,

llama_print_timings:        load time =  8467.50 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 4915033.47 ms / 335360 tokens (   14.66 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 4948330.37 ms

real	82m28.799s
user	125m40.303s
sys	4m5.800s

dfyz · 2023-04-18T23:28:49Z

@ggerganov

AVX512 (optional)

As demonstrated by ggml_vec_dot_q4_0(), I think that an AVX-512 implementation of quantized dot product can provide a significant boost over AVX2, but it might require some careful tuning (for instance, the results from my initial experiments with ggml_vec_dot_q4_0_q8_0() are not as promising as they were for ggml_vec_dot_q4_0()). It also add a non-trivial maintenance burden, since lots of smart people are constantly optimizing the AVX2 code paths. So they either have to find a way to port their improvements to AVX-512 implementation every time (which might be hard to test because AVX-512 hardware is relatively rare), or the AVX-512 implementation will lag behind (this is exactly what happened to ggml_vec_dot_q4_0()).

So I think AVX-512-wise it might be better to focus on the "default" quantized dot product function (e.g., the one used in the quantization method recommended in the README file for converting models), so that most users get the speedup, and the maintenance burden is not too bad.

Which quantization method do you think is more likely to become the new default? I think we now have Q4_0, Q4_1, Q4_2, and at this point I'm not really sure which one is the best in the long run.

dfyz · 2023-04-18T23:34:30Z

in this PR we will retire the old ggml_vec_dot_q4_0()

Do you think that maybe this one can be dropped right now in a separate PR, without waiting for SIMD implementations of ggml_vec_dot_q4_1_q8_0()? ggml_vec_dot_q4_0() is currently already unused and IMO only clutters the code.

ggerganov · 2023-04-19T05:03:00Z

in this PR we will retire the old ggml_vec_dot_q4_0()

Do you think that maybe this one can be dropped right now in a separate PR, without waiting for SIMD implementations of ggml_vec_dot_q4_1_q8_0()? ggml_vec_dot_q4_0() is currently already unused and IMO only clutters the code.

Is the suggestion to keep ggml_vec_dot_q4_0() so that AVX512 implementation can be improved more easily and later ported to the q4_x_q8_0() calls?

Because, atm AVX512 is not used since ggml_vec_dot_q4_0() is no longer used (even on master).

Which quantization method do you think is more likely to become the new default?

I think it is likely that we will end up using Q4_3 (when it gets added), but need to implement it and get the perplexity numbers first

dfyz · 2023-04-19T10:55:01Z

Is the suggestion to keep ggml_vec_dot_q4_0() so that AVX512 implementation can be improved more easily and later ported to the q4_x_q8_0() calls?

Uh, no, the exact opposite actually. :) The suggestion is to remove ggml_vec_dot_q4_0() along with the AVX-512 implementation, because, as you say, ggml_vec_dot_q4_0() is completely unused on master. You have already removed ggml_vec_dot_q4_0() in this PR, I'm just saying the removal is unrelated to Q4_1 quantization and can be done in a separate PR right now.

After the removal of ggml_vec_dot_q4_0(), a new AVX-512 implementation for the best q4_x_q8_0() method available can be added whenever it is ready (and shows improvements over AVX2).

I think it is likely that we will end up using Q4_3 (when it gets added)

Thanks! So a Q4_3 block would look like this?

#define QK4_3 16
typedef struct {
    ggml_fp16_t   d;
    ggml_fp16_t   m;
    uint8_t qs[QK4_3 / 2];
} block_q4_3;

56 ms/token with Q4_1 !

ggerganov · 2023-04-19T16:40:13Z

Sorry - had to rebase on top of master. It was becoming a bit messy
Hope didn't make you work harder if someone has been working on this

So, shall we merge it like this and proceed with Q4_3 from master?

teaalltr · 2023-04-19T18:09:26Z

This is the AVX version, if you trust ChatGPT 😄 (need to add that into an #if defined of course)

static void ggml_vec_dot_q4_1_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int nb = n / QK8_0;

    //assert(n % QK8_0 == 0);
    //assert(nb % 2 == 0);

    const block_q4_1 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    float sumf = 0.0;

    __m256 sum = _mm256_setzero_ps();

    for (int i = 0; i < nb; i++) {
        const float d0 = x[i].d;
        const float m0 = x[i].m;
        const float d1 = y[i].d;

        const uint8_t * restrict p0 = x[i].qs;
        const int8_t * restrict p1 = y[i].qs;

        for (int j = 0; j < QK8_0/16; j++) {
            const __m128i v0 = _mm_loadu_si128((__m128i const*)(p0 + j*16));
            const __m256 f0 = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(v0 & _mm_set1_epi32(0x0f)));
            const __m256 f1 = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(_mm_srli_epi32(v0, 4)));

            const __m256 f2 = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(_mm_loadu_si128((__m128i const*)(p1 + 2*j*8))));
            const __m256 f3 = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(_mm_loadu_si128((__m128i const*)(p1 + (2*j+1)*8))));

            const __m256 f23 = _mm256_permute2f128_ps(f2, f3, 0x21);
            const __m256 p = _mm256_mul_ps(_mm256_add_ps(_mm256_mul_ps(f0, _mm256_set1_ps(d0)), _mm256_set1_ps(m0)), _mm256_mul_ps(_mm256_set1_ps(d1), f23));
            sum = _mm256_add_ps(sum, p);
        }
    }

    sum = _mm256_hadd_ps(sum, sum);
    sum = _mm256_hadd_ps(sum, sum);
    sum = _mm256_hadd_ps(sum, sum);
    sumf = _mm_cvtss_f32(_mm256_extractf128_ps(sum, 0));

    *s = sumf;
}

Godbolt for that: https://godbolt.org/z/TjK51b7vd

ggerganov added high priority Very important issue generation quality Quality of model output labels Apr 18, 2023

ggerganov and others added 4 commits April 19, 2023 19:37

ggml : use 8-bit precision for Q4_1 intermediate results (ARM)

e9c07f7

ggml : optimize ggml_vec_dot_q4_1_q8_0() via vmalq_n_f32

4262305

56 ms/token with Q4_1 !

ggml : AVX2 implementation of ggml_vec_dot_q4_1_q8_0 (#1051)

ad7007a

gitignore : ignore ppl-*.txt files

e582f2a

ggerganov force-pushed the q4_1xq8_0 branch from dcb187a to e582f2a Compare April 19, 2023 16:38

ggerganov merged commit 884e7d7 into master Apr 19, 2023

ggerganov deleted the q4_1xq8_0 branch April 19, 2023 17:10

ggerganov assigned ggerganov and slaren Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : use 8-bit precision for Q4_1 intermediate results #1047

ggml : use 8-bit precision for Q4_1 intermediate results #1047

ggerganov commented Apr 18, 2023 •

edited

Loading

dfyz commented Apr 18, 2023

dfyz commented Apr 18, 2023

ggerganov commented Apr 19, 2023 •

edited

Loading

dfyz commented Apr 19, 2023

ggerganov commented Apr 19, 2023

teaalltr commented Apr 19, 2023 •

edited

Loading

ggml : use 8-bit precision for Q4_1 intermediate results #1047

ggml : use 8-bit precision for Q4_1 intermediate results #1047

Conversation

ggerganov commented Apr 18, 2023 • edited Loading

Perplexity

dfyz commented Apr 18, 2023

dfyz commented Apr 18, 2023

ggerganov commented Apr 19, 2023 • edited Loading

dfyz commented Apr 19, 2023

ggerganov commented Apr 19, 2023

teaalltr commented Apr 19, 2023 • edited Loading

ggerganov commented Apr 18, 2023 •

edited

Loading

ggerganov commented Apr 19, 2023 •

edited

Loading

teaalltr commented Apr 19, 2023 •

edited

Loading