Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : use 8-bit precision for Q4_1 intermediate results #1047

Merged
merged 4 commits into from
Apr 19, 2023
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 18, 2023

This is the same as #951 but for Q4_1

Also, in this PR we will retire the old ggml_vec_dot_q4_0() and ggml_vec_dot_q4_1() as they are no longer used.

Please send PRs with AVX implementations into this branch.
Will merge when we have:

  • Reference
  • ARM NEON
  • AVX
  • AVX2
  • AVX512 (optional)
  • WASM SIMD (optional)

Perplexity

Without BLAS 655 iter: 6.1299
$  make clean && LLAMA_NO_ACCELERATE=1 make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_1.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_1.txt
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
main: seed = 1681899573
llama.cpp: loading model from ./models/7B/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6612.57 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
28.30 seconds per pass - ETA 5.15 hours
[1]4.4313,[2]4.8883,[3]5.7758,[4]6.3821,[5]6.4915,[6]6.4625,[7]6.6557,[8]6.7585,[9]7.0848,[10]7.3396,[11]7.5613,[12]7.6055,[13]7.5332,[14]7.5968,[15]7.8425,[16]7.4423,[17]7.3195,[18]7.2629,[19]6.8898,[20]6.8689,[21]6.7725,[22]6.5999,[23]6.5705,[24]6.4810,[25]6.4837,[26]6.3192,[27]6.1387,[28]6.0324,[29]5.9425,[30]5.7805,[31]5.7512,[32]5.7688,[33]5.7112,[34]5.7445,[35]5.7666,[36]5.8064,[37]5.8099,[38]5.8132,[39]5.8480,[40]5.8961,[41]5.9090,[42]5.9498,[43]5.9092,[44]5.9677,[45]5.9746,[46]5.9474,[47]5.9666,[48]5.9401,[49]5.9386,[50]5.8976,[51]5.8918,[52]5.8798,[53]5.9284,[54]5.9114,[55]5.8900,[56]5.9181,[57]5.9368,[58]5.9557,[59]5.9743,[60]6.0175,[61]6.0063,[62]6.0634,[63]6.0934,[64]6.1038,[65]6.1485,[66]6.1585,[67]6.1774,[68]6.1907,[69]6.2145,[70]6.2428,[71]6.2632,[72]6.2947,[73]6.3515,[74]6.3548,[75]6.3700,[76]6.3830,[77]6.3949,[78]6.3815,[79]6.4100,[80]6.4041,[81]6.4208,[82]6.4263,[83]6.3739,[84]6.3583,[85]6.3460,[86]6.3247,[87]6.2633,[88]6.2405,[89]6.2192,[90]6.2036,[91]6.2271,[92]6.2207,[93]6.2198,[94]6.2165,[95]6.2455,[96]6.2445,[97]6.2390,[98]6.2326,[99]6.2180,[100]6.2162,[101]6.2413,[102]6.2352,[103]6.2548,[104]6.2631,[105]6.2622,[106]6.2788,[107]6.2791,[108]6.2908,[109]6.2846,[110]6.2808,[111]6.3024,[112]6.3227,[113]6.3264,[114]6.3228,[115]6.3284,[116]6.3189,[117]6.3242,[118]6.3518,[119]6.3745,[120]6.4104,[121]6.4254,[122]6.4494,[123]6.4867,[124]6.5053,[125]6.4956,[126]6.5352,[127]6.5719,[128]6.6040,[129]6.5878,[130]6.5977,[131]6.5938,[132]6.5850,[133]6.5724,[134]6.5820,[135]6.5786,[136]6.5672,[137]6.5602,[138]6.5438,[139]6.5329,[140]6.5288,[141]6.5002,[142]6.4980,[143]6.4698,[144]6.4487,[145]6.4407,[146]6.4290,[147]6.4331,[148]6.4334,[149]6.4290,[150]6.4250,[151]6.4278,[152]6.4168,[153]6.4012,[154]6.3925,[155]6.3989,[156]6.3939,[157]6.4102,[158]6.4135,[159]6.4189,[160]6.4221,[161]6.4343,[162]6.4067,[163]6.3953,[164]6.3718,[165]6.3405,[166]6.3133,[167]6.2753,[168]6.2451,[169]6.2323,[170]6.2209,[171]6.1949,[172]6.1779,[173]6.1621,[174]6.1323,[175]6.1114,[176]6.0998,[177]6.0801,[178]6.0570,[179]6.0405,[180]6.0304,[181]6.0089,[182]5.9916,[183]5.9782,[184]5.9776,[185]5.9704,[186]5.9712,[187]5.9768,[188]5.9726,[189]5.9906,[190]5.9922,[191]6.0134,[192]6.0290,[193]6.0458,[194]6.0574,[195]6.0792,[196]6.0952,[197]6.1161,[198]6.1320,[199]6.1352,[200]6.1404,[201]6.1358,[202]6.1547,[203]6.1625,[204]6.1615,[205]6.1724,[206]6.1794,[207]6.1761,[208]6.1847,[209]6.1892,[210]6.1935,[211]6.2044,[212]6.2125,[213]6.2226,[214]6.2259,[215]6.2284,[216]6.2425,[217]6.2612,[218]6.2752,[219]6.2757,[220]6.2718,[221]6.2658,[222]6.2637,[223]6.2535,[224]6.2466,[225]6.2427,[226]6.2631,[227]6.2719,[228]6.2777,[229]6.2835,[230]6.2804,[231]6.2968,[232]6.2851,[233]6.2684,[234]6.2531,[235]6.2343,[236]6.2281,[237]6.2183,[238]6.2205,[239]6.2054,[240]6.1947,[241]6.1971,[242]6.2000,[243]6.1979,[244]6.1868,[245]6.1838,[246]6.1730,[247]6.1610,[248]6.1535,[249]6.1500,[250]6.1543,[251]6.1474,[252]6.1433,[253]6.1340,[254]6.1294,[255]6.1184,[256]6.1005,[257]6.0879,[258]6.0795,[259]6.0773,[260]6.0690,[261]6.0645,[262]6.0590,[263]6.0531,[264]6.0325,[265]6.0318,[266]6.0305,[267]6.0238,[268]6.0318,[269]6.0307,[270]6.0305,[271]6.0384,[272]6.0418,[273]6.0421,[274]6.0445,[275]6.0532,[276]6.0589,[277]6.0740,[278]6.0839,[279]6.0926,[280]6.0952,[281]6.1055,[282]6.1113,[283]6.1264,[284]6.1340,[285]6.1419,[286]6.1548,[287]6.1541,[288]6.1600,[289]6.1512,[290]6.1353,[291]6.1197,[292]6.1047,[293]6.0918,[294]6.0938,[295]6.0931,[296]6.0981,[297]6.0975,[298]6.1011,[299]6.0988,[300]6.0877,[301]6.0873,[302]6.0796,[303]6.0707,[304]6.0621,[305]6.0585,[306]6.0461,[307]6.0482,[308]6.0511,[309]6.0351,[310]6.0293,[311]6.0231,[312]6.0253,[313]6.0194,[314]6.0178,[315]6.0021,[316]5.9974,[317]5.9811,[318]5.9607,[319]5.9725,[320]5.9848,[321]5.9889,[322]5.9848,[323]5.9779,[324]5.9746,[325]5.9855,[326]5.9855,[327]5.9876,[328]5.9909,[329]5.9966,[330]5.9994,[331]6.0116,[332]6.0088,[333]6.0160,[334]6.0103,[335]6.0039,[336]6.0071,[337]6.0048,[338]6.0038,[339]5.9985,[340]5.9945,[341]6.0024,[342]6.0052,[343]6.0099,[344]6.0101,[345]6.0102,[346]6.0072,[347]6.0112,[348]6.0149,[349]6.0172,[350]6.0143,[351]6.0152,[352]6.0152,[353]6.0090,[354]6.0094,[355]6.0148,[356]6.0178,[357]6.0147,[358]6.0240,[359]6.0264,[360]6.0234,[361]6.0229,[362]6.0298,[363]6.0409,[364]6.0473,[365]6.0523,[366]6.0541,[367]6.0628,[368]6.0601,[369]6.0613,[370]6.0631,[371]6.0578,[372]6.0629,[373]6.0674,[374]6.0660,[375]6.0661,[376]6.0728,[377]6.0683,[378]6.0707,[379]6.0766,[380]6.0688,[381]6.0654,[382]6.0609,[383]6.0600,[384]6.0596,[385]6.0585,[386]6.0582,[387]6.0583,[388]6.0546,[389]6.0495,[390]6.0430,[391]6.0353,[392]6.0309,[393]6.0295,[394]6.0323,[395]6.0309,[396]6.0235,[397]6.0301,[398]6.0340,[399]6.0416,[400]6.0412,[401]6.0427,[402]6.0439,[403]6.0458,[404]6.0521,[405]6.0431,[406]6.0400,[407]6.0397,[408]6.0415,[409]6.0531,[410]6.0642,[411]6.0756,[412]6.0916,[413]6.1027,[414]6.1103,[415]6.1154,[416]6.1232,[417]6.1354,[418]6.1388,[419]6.1462,[420]6.1553,[421]6.1667,[422]6.1707,[423]6.1776,[424]6.1881,[425]6.1968,[426]6.2035,[427]6.2081,[428]6.2163,[429]6.2218,[430]6.2298,[431]6.2436,[432]6.2477,[433]6.2469,[434]6.2424,[435]6.2434,[436]6.2459,[437]6.2558,[438]6.2633,[439]6.2601,[440]6.2590,[441]6.2542,[442]6.2522,[443]6.2532,[444]6.2538,[445]6.2517,[446]6.2539,[447]6.2569,[448]6.2612,[449]6.2588,[450]6.2596,[451]6.2557,[452]6.2435,[453]6.2352,[454]6.2294,[455]6.2301,[456]6.2354,[457]6.2376,[458]6.2356,[459]6.2362,[460]6.2447,[461]6.2420,[462]6.2406,[463]6.2449,[464]6.2436,[465]6.2409,[466]6.2335,[467]6.2341,[468]6.2339,[469]6.2362,[470]6.2367,[471]6.2320,[472]6.2370,[473]6.2317,[474]6.2330,[475]6.2273,[476]6.2292,[477]6.2222,[478]6.2213,[479]6.2270,[480]6.2313,[481]6.2331,[482]6.2285,[483]6.2244,[484]6.2261,[485]6.2242,[486]6.2181,[487]6.2178,[488]6.2158,[489]6.2110,[490]6.2088,[491]6.2061,[492]6.2005,[493]6.1977,[494]6.1958,[495]6.1956,[496]6.1919,[497]6.1862,[498]6.1847,[499]6.1802,[500]6.1708,[501]6.1644,[502]6.1644,[503]6.1639,[504]6.1550,[505]6.1573,[506]6.1581,[507]6.1528,[508]6.1489,[509]6.1483,[510]6.1519,[511]6.1567,[512]6.1605,[513]6.1624,[514]6.1687,[515]6.1634,[516]6.1626,[517]6.1636,[518]6.1632,[519]6.1664,[520]6.1685,[521]6.1699,[522]6.1727,[523]6.1736,[524]6.1794,[525]6.1827,[526]6.1836,[527]6.1851,[528]6.1801,[529]6.1807,[530]6.1755,[531]6.1738,[532]6.1787,[533]6.1810,[534]6.1795,[535]6.1816,[536]6.1764,[537]6.1742,[538]6.1793,[539]6.1802,[540]6.1838,[541]6.1840,[542]6.1847,[543]6.1863,[544]6.1874,[545]6.1853,[546]6.1861,[547]6.1823,[548]6.1774,[549]6.1771,[550]6.1744,[551]6.1707,[552]6.1684,[553]6.1647,[554]6.1625,[555]6.1594,[556]6.1589,[557]6.1611,[558]6.1573,[559]6.1571,[560]6.1570,[561]6.1575,[562]6.1551,[563]6.1548,[564]6.1594,[565]6.1616,[566]6.1616,[567]6.1597,[568]6.1601,[569]6.1586,[570]6.1614,[571]6.1617,[572]6.1623,[573]6.1619,[574]6.1585,[575]6.1580,[576]6.1579,[577]6.1560,[578]6.1538,[579]6.1539,[580]6.1477,[581]6.1439,[582]6.1431,[583]6.1440,[584]6.1442,[585]6.1368,[586]6.1299,[587]6.1305,[588]6.1352,[589]6.1408,[590]6.1437,[591]6.1459,[592]6.1446,[593]6.1413,[594]6.1424,[595]6.1400,[596]6.1434,[597]6.1412,[598]6.1387,[599]6.1409,[600]6.1409,[601]6.1397,[602]6.1416,[603]6.1440,[604]6.1450,[605]6.1487,[606]6.1508,[607]6.1492,[608]6.1456,[609]6.1461,[610]6.1497,[611]6.1482,[612]6.1508,[613]6.1471,[614]6.1424,[615]6.1349,[616]6.1375,[617]6.1314,[618]6.1265,[619]6.1210,[620]6.1071,[621]6.1003,[622]6.0988,[623]6.1005,[624]6.1010,[625]6.1011,[626]6.1002,[627]6.1028,[628]6.1029,[629]6.1025,[630]6.1055,[631]6.1111,[632]6.1169,[633]6.1153,[634]6.1188,[635]6.1193,[636]6.1159,[637]6.1124,[638]6.1150,[639]6.1118,[640]6.1128,[641]6.1129,[642]6.1194,[643]6.1213,[644]6.1224,[645]6.1206,[646]6.1249,[647]6.1211,[648]6.1223,[649]6.1224,[650]6.1265,[651]6.1319,[652]6.1331,[653]6.1369,[654]6.1306,[655]6.1299,
With BLAS 655 iter: 6.1286
$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_1.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_1-blas.txt
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681916726
llama.cpp: loading model from ./models/7B/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 6612.57 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
7.91 seconds per pass - ETA 1.44 hours
[1]4.4308,[2]4.8847,[3]5.7740,[4]6.3800,[5]6.4911,[6]6.4631,[7]6.6541,[8]6.7572,[9]7.0836,[10]7.3390,[11]7.5610,[12]7.6048,[13]7.5321,[14]7.5962,[15]7.8405,[16]7.4406,[17]7.3183,[18]7.2617,[19]6.8885,[20]6.8674,[21]6.7698,[22]6.5975,[23]6.5677,[24]6.4785,[25]6.4809,[26]6.3162,[27]6.1360,[28]6.0295,[29]5.9399,[30]5.7778,[31]5.7484,[32]5.7660,[33]5.7090,[34]5.7425,[35]5.7645,[36]5.8048,[37]5.8079,[38]5.8115,[39]5.8458,[40]5.8938,[41]5.9065,[42]5.9472,[43]5.9068,[44]5.9654,[45]5.9723,[46]5.9451,[47]5.9644,[48]5.9380,[49]5.9366,[50]5.8957,[51]5.8900,[52]5.8782,[53]5.9265,[54]5.9097,[55]5.8882,[56]5.9164,[57]5.9353,[58]5.9541,[59]5.9727,[60]6.0158,[61]6.0046,[62]6.0618,[63]6.0918,[64]6.1022,[65]6.1470,[66]6.1571,[67]6.1761,[68]6.1893,[69]6.2129,[70]6.2411,[71]6.2615,[72]6.2929,[73]6.3495,[74]6.3530,[75]6.3681,[76]6.3810,[77]6.3931,[78]6.3793,[79]6.4079,[80]6.4020,[81]6.4183,[82]6.4240,[83]6.3717,[84]6.3560,[85]6.3436,[86]6.3223,[87]6.2610,[88]6.2381,[89]6.2170,[90]6.2015,[91]6.2251,[92]6.2187,[93]6.2179,[94]6.2147,[95]6.2436,[96]6.2427,[97]6.2371,[98]6.2307,[99]6.2162,[100]6.2145,[101]6.2394,[102]6.2334,[103]6.2530,[104]6.2612,[105]6.2602,[106]6.2769,[107]6.2771,[108]6.2885,[109]6.2822,[110]6.2784,[111]6.3000,[112]6.3202,[113]6.3238,[114]6.3201,[115]6.3257,[116]6.3163,[117]6.3216,[118]6.3493,[119]6.3719,[120]6.4077,[121]6.4227,[122]6.4468,[123]6.4841,[124]6.5027,[125]6.4930,[126]6.5325,[127]6.5694,[128]6.6014,[129]6.5853,[130]6.5951,[131]6.5913,[132]6.5826,[133]6.5700,[134]6.5797,[135]6.5762,[136]6.5648,[137]6.5579,[138]6.5414,[139]6.5304,[140]6.5264,[141]6.4978,[142]6.4955,[143]6.4675,[144]6.4465,[145]6.4386,[146]6.4269,[147]6.4311,[148]6.4314,[149]6.4270,[150]6.4231,[151]6.4259,[152]6.4150,[153]6.3994,[154]6.3907,[155]6.3970,[156]6.3920,[157]6.4084,[158]6.4117,[159]6.4171,[160]6.4203,[161]6.4325,[162]6.4049,[163]6.3934,[164]6.3699,[165]6.3385,[166]6.3115,[167]6.2735,[168]6.2433,[169]6.2304,[170]6.2190,[171]6.1931,[172]6.1760,[173]6.1603,[174]6.1305,[175]6.1096,[176]6.0980,[177]6.0784,[178]6.0552,[179]6.0386,[180]6.0286,[181]6.0071,[182]5.9898,[183]5.9765,[184]5.9758,[185]5.9686,[186]5.9693,[187]5.9750,[188]5.9708,[189]5.9889,[190]5.9905,[191]6.0117,[192]6.0273,[193]6.0441,[194]6.0557,[195]6.0774,[196]6.0933,[197]6.1143,[198]6.1301,[199]6.1332,[200]6.1385,[201]6.1340,[202]6.1529,[203]6.1606,[204]6.1597,[205]6.1705,[206]6.1775,[207]6.1743,[208]6.1828,[209]6.1873,[210]6.1917,[211]6.2026,[212]6.2107,[213]6.2208,[214]6.2241,[215]6.2266,[216]6.2407,[217]6.2594,[218]6.2734,[219]6.2738,[220]6.2700,[221]6.2641,[222]6.2619,[223]6.2516,[224]6.2448,[225]6.2409,[226]6.2612,[227]6.2700,[228]6.2757,[229]6.2816,[230]6.2784,[231]6.2948,[232]6.2832,[233]6.2666,[234]6.2512,[235]6.2324,[236]6.2262,[237]6.2164,[238]6.2186,[239]6.2035,[240]6.1928,[241]6.1952,[242]6.1982,[243]6.1961,[244]6.1850,[245]6.1820,[246]6.1712,[247]6.1593,[248]6.1518,[249]6.1483,[250]6.1526,[251]6.1456,[252]6.1416,[253]6.1323,[254]6.1277,[255]6.1167,[256]6.0988,[257]6.0862,[258]6.0778,[259]6.0755,[260]6.0673,[261]6.0628,[262]6.0573,[263]6.0514,[264]6.0309,[265]6.0302,[266]6.0290,[267]6.0222,[268]6.0302,[269]6.0290,[270]6.0289,[271]6.0368,[272]6.0402,[273]6.0405,[274]6.0428,[275]6.0515,[276]6.0572,[277]6.0723,[278]6.0822,[279]6.0909,[280]6.0936,[281]6.1039,[282]6.1096,[283]6.1247,[284]6.1323,[285]6.1403,[286]6.1531,[287]6.1523,[288]6.1583,[289]6.1495,[290]6.1336,[291]6.1180,[292]6.1030,[293]6.0901,[294]6.0921,[295]6.0914,[296]6.0964,[297]6.0957,[298]6.0993,[299]6.0970,[300]6.0860,[301]6.0855,[302]6.0779,[303]6.0689,[304]6.0603,[305]6.0568,[306]6.0444,[307]6.0465,[308]6.0493,[309]6.0333,[310]6.0275,[311]6.0213,[312]6.0236,[313]6.0177,[314]6.0161,[315]6.0005,[316]5.9958,[317]5.9795,[318]5.9590,[319]5.9709,[320]5.9831,[321]5.9872,[322]5.9831,[323]5.9762,[324]5.9729,[325]5.9839,[326]5.9838,[327]5.9859,[328]5.9892,[329]5.9949,[330]5.9978,[331]6.0099,[332]6.0072,[333]6.0143,[334]6.0086,[335]6.0023,[336]6.0055,[337]6.0032,[338]6.0022,[339]5.9969,[340]5.9928,[341]6.0007,[342]6.0036,[343]6.0082,[344]6.0084,[345]6.0085,[346]6.0055,[347]6.0095,[348]6.0133,[349]6.0155,[350]6.0127,[351]6.0135,[352]6.0135,[353]6.0073,[354]6.0077,[355]6.0130,[356]6.0160,[357]6.0129,[358]6.0221,[359]6.0245,[360]6.0216,[361]6.0211,[362]6.0280,[363]6.0391,[364]6.0455,[365]6.0505,[366]6.0523,[367]6.0610,[368]6.0583,[369]6.0595,[370]6.0613,[371]6.0560,[372]6.0611,[373]6.0657,[374]6.0642,[375]6.0644,[376]6.0711,[377]6.0666,[378]6.0690,[379]6.0749,[380]6.0671,[381]6.0638,[382]6.0593,[383]6.0584,[384]6.0579,[385]6.0569,[386]6.0566,[387]6.0567,[388]6.0531,[389]6.0479,[390]6.0414,[391]6.0337,[392]6.0294,[393]6.0280,[394]6.0308,[395]6.0294,[396]6.0220,[397]6.0287,[398]6.0325,[399]6.0402,[400]6.0398,[401]6.0412,[402]6.0425,[403]6.0444,[404]6.0507,[405]6.0416,[406]6.0386,[407]6.0382,[408]6.0401,[409]6.0516,[410]6.0627,[411]6.0741,[412]6.0901,[413]6.1011,[414]6.1088,[415]6.1139,[416]6.1217,[417]6.1339,[418]6.1373,[419]6.1446,[420]6.1538,[421]6.1652,[422]6.1692,[423]6.1761,[424]6.1866,[425]6.1953,[426]6.2019,[427]6.2066,[428]6.2148,[429]6.2203,[430]6.2282,[431]6.2421,[432]6.2461,[433]6.2453,[434]6.2408,[435]6.2418,[436]6.2443,[437]6.2542,[438]6.2618,[439]6.2585,[440]6.2574,[441]6.2526,[442]6.2506,[443]6.2516,[444]6.2523,[445]6.2501,[446]6.2523,[447]6.2553,[448]6.2596,[449]6.2572,[450]6.2580,[451]6.2541,[452]6.2419,[453]6.2336,[454]6.2278,[455]6.2285,[456]6.2337,[457]6.2359,[458]6.2339,[459]6.2346,[460]6.2430,[461]6.2403,[462]6.2390,[463]6.2433,[464]6.2420,[465]6.2393,[466]6.2319,[467]6.2326,[468]6.2324,[469]6.2347,[470]6.2352,[471]6.2306,[472]6.2356,[473]6.2303,[474]6.2316,[475]6.2259,[476]6.2278,[477]6.2208,[478]6.2198,[479]6.2255,[480]6.2299,[481]6.2317,[482]6.2271,[483]6.2230,[484]6.2247,[485]6.2228,[486]6.2167,[487]6.2164,[488]6.2145,[489]6.2096,[490]6.2075,[491]6.2048,[492]6.1992,[493]6.1964,[494]6.1945,[495]6.1943,[496]6.1906,[497]6.1850,[498]6.1835,[499]6.1790,[500]6.1696,[501]6.1632,[502]6.1632,[503]6.1627,[504]6.1538,[505]6.1561,[506]6.1569,[507]6.1516,[508]6.1477,[509]6.1471,[510]6.1507,[511]6.1555,[512]6.1592,[513]6.1611,[514]6.1675,[515]6.1621,[516]6.1614,[517]6.1623,[518]6.1619,[519]6.1652,[520]6.1672,[521]6.1686,[522]6.1714,[523]6.1722,[524]6.1781,[525]6.1814,[526]6.1822,[527]6.1838,[528]6.1788,[529]6.1794,[530]6.1741,[531]6.1725,[532]6.1774,[533]6.1796,[534]6.1781,[535]6.1803,[536]6.1751,[537]6.1729,[538]6.1780,[539]6.1789,[540]6.1825,[541]6.1827,[542]6.1834,[543]6.1850,[544]6.1860,[545]6.1840,[546]6.1848,[547]6.1809,[548]6.1761,[549]6.1758,[550]6.1731,[551]6.1693,[552]6.1671,[553]6.1634,[554]6.1612,[555]6.1581,[556]6.1576,[557]6.1598,[558]6.1560,[559]6.1558,[560]6.1557,[561]6.1562,[562]6.1538,[563]6.1535,[564]6.1581,[565]6.1603,[566]6.1603,[567]6.1584,[568]6.1587,[569]6.1572,[570]6.1601,[571]6.1604,[572]6.1610,[573]6.1606,[574]6.1572,[575]6.1567,[576]6.1566,[577]6.1547,[578]6.1525,[579]6.1526,[580]6.1464,[581]6.1426,[582]6.1418,[583]6.1427,[584]6.1430,[585]6.1355,[586]6.1287,[587]6.1293,[588]6.1339,[589]6.1395,[590]6.1425,[591]6.1446,[592]6.1434,[593]6.1401,[594]6.1411,[595]6.1387,[596]6.1421,[597]6.1400,[598]6.1375,[599]6.1396,[600]6.1396,[601]6.1384,[602]6.1402,[603]6.1427,[604]6.1437,[605]6.1474,[606]6.1495,[607]6.1479,[608]6.1443,[609]6.1448,[610]6.1484,[611]6.1469,[612]6.1495,[613]6.1458,[614]6.1411,[615]6.1335,[616]6.1362,[617]6.1301,[618]6.1252,[619]6.1197,[620]6.1058,[621]6.0990,[622]6.0975,[623]6.0992,[624]6.0997,[625]6.0998,[626]6.0989,[627]6.1015,[628]6.1016,[629]6.1012,[630]6.1042,[631]6.1098,[632]6.1156,[633]6.1140,[634]6.1175,[635]6.1180,[636]6.1145,[637]6.1111,[638]6.1137,[639]6.1104,[640]6.1115,[641]6.1116,[642]6.1181,[643]6.1200,[644]6.1211,[645]6.1193,[646]6.1236,[647]6.1198,[648]6.1209,[649]6.1211,[650]6.1252,[651]6.1306,[652]6.1317,[653]6.1356,[654]6.1293,[655]6.1286,

llama_print_timings:        load time =  8467.50 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 4915033.47 ms / 335360 tokens (   14.66 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 4948330.37 ms

real	82m28.799s
user	125m40.303s
sys	4m5.800s

@ggerganov ggerganov added high priority Very important issue generation quality Quality of model output labels Apr 18, 2023
@dfyz
Copy link
Collaborator

dfyz commented Apr 18, 2023

@ggerganov

AVX512 (optional)

As demonstrated by ggml_vec_dot_q4_0(), I think that an AVX-512 implementation of quantized dot product can provide a significant boost over AVX2, but it might require some careful tuning (for instance, the results from my initial experiments with ggml_vec_dot_q4_0_q8_0() are not as promising as they were for ggml_vec_dot_q4_0()). It also add a non-trivial maintenance burden, since lots of smart people are constantly optimizing the AVX2 code paths. So they either have to find a way to port their improvements to AVX-512 implementation every time (which might be hard to test because AVX-512 hardware is relatively rare), or the AVX-512 implementation will lag behind (this is exactly what happened to ggml_vec_dot_q4_0()).

So I think AVX-512-wise it might be better to focus on the "default" quantized dot product function (e.g., the one used in the quantization method recommended in the README file for converting models), so that most users get the speedup, and the maintenance burden is not too bad.

Which quantization method do you think is more likely to become the new default? I think we now have Q4_0, Q4_1, Q4_2, and at this point I'm not really sure which one is the best in the long run.

@dfyz
Copy link
Collaborator

dfyz commented Apr 18, 2023

in this PR we will retire the old ggml_vec_dot_q4_0()

Do you think that maybe this one can be dropped right now in a separate PR, without waiting for SIMD implementations of ggml_vec_dot_q4_1_q8_0()? ggml_vec_dot_q4_0() is currently already unused and IMO only clutters the code.

@ggerganov
Copy link
Owner Author

ggerganov commented Apr 19, 2023

in this PR we will retire the old ggml_vec_dot_q4_0()

Do you think that maybe this one can be dropped right now in a separate PR, without waiting for SIMD implementations of ggml_vec_dot_q4_1_q8_0()? ggml_vec_dot_q4_0() is currently already unused and IMO only clutters the code.

Is the suggestion to keep ggml_vec_dot_q4_0() so that AVX512 implementation can be improved more easily and later ported to the q4_x_q8_0() calls?

Because, atm AVX512 is not used since ggml_vec_dot_q4_0() is no longer used (even on master).

Which quantization method do you think is more likely to become the new default?

I think it is likely that we will end up using Q4_3 (when it gets added), but need to implement it and get the perplexity numbers first

@dfyz
Copy link
Collaborator

dfyz commented Apr 19, 2023

Is the suggestion to keep ggml_vec_dot_q4_0() so that AVX512 implementation can be improved more easily and later ported to the q4_x_q8_0() calls?

Uh, no, the exact opposite actually. :) The suggestion is to remove ggml_vec_dot_q4_0() along with the AVX-512 implementation, because, as you say, ggml_vec_dot_q4_0() is completely unused on master. You have already removed ggml_vec_dot_q4_0() in this PR, I'm just saying the removal is unrelated to Q4_1 quantization and can be done in a separate PR right now.

After the removal of ggml_vec_dot_q4_0(), a new AVX-512 implementation for the best q4_x_q8_0() method available can be added whenever it is ready (and shows improvements over AVX2).

I think it is likely that we will end up using Q4_3 (when it gets added)

Thanks! So a Q4_3 block would look like this?

#define QK4_3 16
typedef struct {
    ggml_fp16_t   d;
    ggml_fp16_t   m;
    uint8_t qs[QK4_3 / 2];
} block_q4_3;

@ggerganov
Copy link
Owner Author

Sorry - had to rebase on top of master. It was becoming a bit messy
Hope didn't make you work harder if someone has been working on this

So, shall we merge it like this and proceed with Q4_3 from master?

@ggerganov ggerganov merged commit 884e7d7 into master Apr 19, 2023
@ggerganov ggerganov deleted the q4_1xq8_0 branch April 19, 2023 17:10
@teaalltr
Copy link

teaalltr commented Apr 19, 2023

This is the AVX version, if you trust ChatGPT 😄 (need to add that into an #if defined of course)

static void ggml_vec_dot_q4_1_q8_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy) {
    const int nb = n / QK8_0;

    //assert(n % QK8_0 == 0);
    //assert(nb % 2 == 0);

    const block_q4_1 * restrict x = vx;
    const block_q8_0 * restrict y = vy;

    float sumf = 0.0;

    __m256 sum = _mm256_setzero_ps();

    for (int i = 0; i < nb; i++) {
        const float d0 = x[i].d;
        const float m0 = x[i].m;
        const float d1 = y[i].d;

        const uint8_t * restrict p0 = x[i].qs;
        const int8_t * restrict p1 = y[i].qs;

        for (int j = 0; j < QK8_0/16; j++) {
            const __m128i v0 = _mm_loadu_si128((__m128i const*)(p0 + j*16));
            const __m256 f0 = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(v0 & _mm_set1_epi32(0x0f)));
            const __m256 f1 = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(_mm_srli_epi32(v0, 4)));

            const __m256 f2 = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(_mm_loadu_si128((__m128i const*)(p1 + 2*j*8))));
            const __m256 f3 = _mm256_cvtepi32_ps(_mm256_cvtepi8_epi32(_mm_loadu_si128((__m128i const*)(p1 + (2*j+1)*8))));

            const __m256 f23 = _mm256_permute2f128_ps(f2, f3, 0x21);
            const __m256 p = _mm256_mul_ps(_mm256_add_ps(_mm256_mul_ps(f0, _mm256_set1_ps(d0)), _mm256_set1_ps(m0)), _mm256_mul_ps(_mm256_set1_ps(d1), f23));
            sum = _mm256_add_ps(sum, p);
        }
    }

    sum = _mm256_hadd_ps(sum, sum);
    sum = _mm256_hadd_ps(sum, sum);
    sum = _mm256_hadd_ps(sum, sum);
    sumf = _mm_cvtss_f32(_mm256_extractf128_ps(sum, 0));

    *s = sumf;
}

Godbolt for that: https://godbolt.org/z/TjK51b7vd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output high priority Very important issue
Development

Successfully merging this pull request may close these issues.

4 participants