Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

Closed
wants to merge 4 commits into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 17, 2023

ref #959

ARM NEON only implementation

Timing

Time per token ~55 ms
Up from ~50 ms on Q4_0 master

Perplexity

Without BLAS
25 iters: 6.5251
$  make clean && LLAMA_NO_ACCELERATE=1 make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-new.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
main: seed = 1681742115
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
27.17 seconds per pass - ETA 4.94 hours
[1]4.3877,[2]4.9469,[3]5.8130,[4]6.4349,[5]6.5214,[6]6.4978,[7]6.6796,[8]6.7642,[9]7.1418,[10]7.3900,[11]7.6105,[12]7.6448,[13]7.5666,[14]7.6375,[15]7.8840,[16]7.4854,[17]7.3630,[18]7.3180,[19]6.9464,[20]6.9355,[21]6.8325,[22]6.6521,[23]6.6150,[24]6.5219,[25]6.5251,^C

real	11m26.727s
user	88m33.729s
sys	0m14.893s
655 iters: 6.2319
$  make clean && LLAMA_NO_ACCELERATE=1 make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_2.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_2.txt
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
main: seed = 1681883628
llama.cpp: loading model from ./models/7B/ggml-model-q4_2.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 5 (mostly Q4_2)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
22.42 seconds per pass - ETA 4.08 hours
[1]4.3745,[2]4.9398,[3]5.8008,[4]6.4193,[5]6.5062,[6]6.4855,[7]6.6679,[8]6.7545,[9]7.1310,[10]7.3804,[11]7.6023,[12]7.6349,[13]7.5577,[14]7.6301,[15]7.8752,[16]7.4754,[17]7.3526,[18]7.3060,[19]6.9352,[20]6.9239,[21]6.8201,[22]6.6404,[23]6.6030,[24]6.5102,[25]6.5136,[26]6.3446,[27]6.1694,[28]6.0698,[29]5.9837,[30]5.8245,[31]5.7918,[32]5.8126,[33]5.7556,[34]5.7925,[35]5.8202,[36]5.8612,[37]5.8618,[38]5.8801,[39]5.9167,[40]5.9750,[41]5.9870,[42]6.0249,[43]5.9856,[44]6.0397,[45]6.0454,[46]6.0177,[47]6.0385,[48]6.0104,[49]6.0145,[50]5.9742,[51]5.9711,[52]5.9625,[53]6.0058,[54]5.9891,[55]5.9664,[56]6.0016,[57]6.0255,[58]6.0502,[59]6.0655,[60]6.1103,[61]6.1038,[62]6.1642,[63]6.1973,[64]6.2145,[65]6.2606,[66]6.2698,[67]6.2882,[68]6.3044,[69]6.3303,[70]6.3634,[71]6.3840,[72]6.4145,[73]6.4762,[74]6.4788,[75]6.4927,[76]6.5085,[77]6.5235,[78]6.5093,[79]6.5382,[80]6.5304,[81]6.5408,[82]6.5461,[83]6.4918,[84]6.4744,[85]6.4620,[86]6.4400,[87]6.3742,[88]6.3461,[89]6.3259,[90]6.3113,[91]6.3355,[92]6.3325,[93]6.3358,[94]6.3329,[95]6.3608,[96]6.3607,[97]6.3555,[98]6.3516,[99]6.3391,[100]6.3405,[101]6.3666,[102]6.3600,[103]6.3813,[104]6.3887,[105]6.3873,[106]6.4048,[107]6.4035,[108]6.4167,[109]6.4117,[110]6.4084,[111]6.4315,[112]6.4512,[113]6.4536,[114]6.4506,[115]6.4587,[116]6.4512,[117]6.4564,[118]6.4858,[119]6.5073,[120]6.5439,[121]6.5603,[122]6.5848,[123]6.6226,[124]6.6394,[125]6.6295,[126]6.6675,[127]6.7036,[128]6.7324,[129]6.7146,[130]6.7252,[131]6.7211,[132]6.7121,[133]6.6990,[134]6.7099,[135]6.7051,[136]6.6927,[137]6.6850,[138]6.6674,[139]6.6557,[140]6.6514,[141]6.6211,[142]6.6161,[143]6.5874,[144]6.5673,[145]6.5568,[146]6.5444,[147]6.5513,[148]6.5527,[149]6.5469,[150]6.5420,[151]6.5444,[152]6.5355,[153]6.5186,[154]6.5097,[155]6.5163,[156]6.5112,[157]6.5291,[158]6.5339,[159]6.5376,[160]6.5404,[161]6.5519,[162]6.5221,[163]6.5095,[164]6.4847,[165]6.4534,[166]6.4245,[167]6.3864,[168]6.3544,[169]6.3409,[170]6.3296,[171]6.3012,[172]6.2827,[173]6.2639,[174]6.2327,[175]6.2113,[176]6.2011,[177]6.1802,[178]6.1567,[179]6.1399,[180]6.1300,[181]6.1078,[182]6.0897,[183]6.0757,[184]6.0758,[185]6.0683,[186]6.0694,[187]6.0751,[188]6.0713,[189]6.0894,[190]6.0902,[191]6.1112,[192]6.1281,[193]6.1455,[194]6.1569,[195]6.1783,[196]6.1941,[197]6.2159,[198]6.2312,[199]6.2341,[200]6.2387,[201]6.2343,[202]6.2541,[203]6.2615,[204]6.2595,[205]6.2704,[206]6.2775,[207]6.2734,[208]6.2827,[209]6.2871,[210]6.2926,[211]6.3020,[212]6.3096,[213]6.3204,[214]6.3238,[215]6.3274,[216]6.3426,[217]6.3601,[218]6.3734,[219]6.3729,[220]6.3695,[221]6.3638,[222]6.3607,[223]6.3497,[224]6.3434,[225]6.3394,[226]6.3606,[227]6.3693,[228]6.3740,[229]6.3807,[230]6.3768,[231]6.3936,[232]6.3812,[233]6.3639,[234]6.3489,[235]6.3316,[236]6.3238,[237]6.3136,[238]6.3169,[239]6.3012,[240]6.2912,[241]6.2945,[242]6.2983,[243]6.2967,[244]6.2850,[245]6.2824,[246]6.2709,[247]6.2587,[248]6.2516,[249]6.2497,[250]6.2547,[251]6.2470,[252]6.2433,[253]6.2333,[254]6.2292,[255]6.2176,[256]6.1992,[257]6.1879,[258]6.1793,[259]6.1778,[260]6.1704,[261]6.1664,[262]6.1606,[263]6.1558,[264]6.1351,[265]6.1349,[266]6.1339,[267]6.1271,[268]6.1367,[269]6.1346,[270]6.1355,[271]6.1430,[272]6.1471,[273]6.1472,[274]6.1487,[275]6.1577,[276]6.1630,[277]6.1790,[278]6.1906,[279]6.1994,[280]6.2028,[281]6.2122,[282]6.2187,[283]6.2338,[284]6.2412,[285]6.2505,[286]6.2641,[287]6.2634,[288]6.2693,[289]6.2597,[290]6.2442,[291]6.2288,[292]6.2133,[293]6.2000,[294]6.2016,[295]6.2014,[296]6.2060,[297]6.2042,[298]6.2078,[299]6.2048,[300]6.1938,[301]6.1939,[302]6.1864,[303]6.1782,[304]6.1703,[305]6.1676,[306]6.1548,[307]6.1577,[308]6.1618,[309]6.1454,[310]6.1393,[311]6.1327,[312]6.1351,[313]6.1299,[314]6.1286,[315]6.1121,[316]6.1072,[317]6.0908,[318]6.0694,[319]6.0818,[320]6.0946,[321]6.0987,[322]6.0942,[323]6.0878,[324]6.0853,[325]6.0956,[326]6.0956,[327]6.0975,[328]6.1017,[329]6.1074,[330]6.1101,[331]6.1229,[332]6.1195,[333]6.1262,[334]6.1203,[335]6.1139,[336]6.1176,[337]6.1149,[338]6.1139,[339]6.1086,[340]6.1045,[341]6.1126,[342]6.1152,[343]6.1206,[344]6.1206,[345]6.1204,[346]6.1174,[347]6.1225,[348]6.1261,[349]6.1278,[350]6.1244,[351]6.1253,[352]6.1258,[353]6.1203,[354]6.1204,[355]6.1256,[356]6.1284,[357]6.1251,[358]6.1344,[359]6.1376,[360]6.1336,[361]6.1329,[362]6.1394,[363]6.1508,[364]6.1572,[365]6.1630,[366]6.1641,[367]6.1729,[368]6.1704,[369]6.1713,[370]6.1725,[371]6.1665,[372]6.1713,[373]6.1769,[374]6.1755,[375]6.1752,[376]6.1827,[377]6.1776,[378]6.1801,[379]6.1858,[380]6.1776,[381]6.1734,[382]6.1681,[383]6.1671,[384]6.1663,[385]6.1649,[386]6.1645,[387]6.1639,[388]6.1596,[389]6.1541,[390]6.1469,[391]6.1391,[392]6.1349,[393]6.1330,[394]6.1355,[395]6.1339,[396]6.1262,[397]6.1337,[398]6.1374,[399]6.1455,[400]6.1448,[401]6.1464,[402]6.1471,[403]6.1490,[404]6.1554,[405]6.1458,[406]6.1424,[407]6.1420,[408]6.1434,[409]6.1553,[410]6.1666,[411]6.1783,[412]6.1942,[413]6.2055,[414]6.2132,[415]6.2186,[416]6.2265,[417]6.2391,[418]6.2427,[419]6.2497,[420]6.2585,[421]6.2706,[422]6.2757,[423]6.2826,[424]6.2942,[425]6.3033,[426]6.3100,[427]6.3145,[428]6.3228,[429]6.3279,[430]6.3367,[431]6.3512,[432]6.3551,[433]6.3543,[434]6.3495,[435]6.3502,[436]6.3524,[437]6.3618,[438]6.3695,[439]6.3663,[440]6.3655,[441]6.3602,[442]6.3591,[443]6.3604,[444]6.3608,[445]6.3588,[446]6.3612,[447]6.3640,[448]6.3688,[449]6.3665,[450]6.3670,[451]6.3626,[452]6.3504,[453]6.3417,[454]6.3357,[455]6.3366,[456]6.3416,[457]6.3436,[458]6.3416,[459]6.3422,[460]6.3508,[461]6.3482,[462]6.3469,[463]6.3514,[464]6.3505,[465]6.3476,[466]6.3396,[467]6.3399,[468]6.3396,[469]6.3421,[470]6.3426,[471]6.3379,[472]6.3425,[473]6.3369,[474]6.3382,[475]6.3322,[476]6.3340,[477]6.3265,[478]6.3254,[479]6.3317,[480]6.3367,[481]6.3389,[482]6.3344,[483]6.3303,[484]6.3323,[485]6.3301,[486]6.3247,[487]6.3246,[488]6.3226,[489]6.3177,[490]6.3152,[491]6.3124,[492]6.3066,[493]6.3035,[494]6.3017,[495]6.3015,[496]6.2978,[497]6.2921,[498]6.2904,[499]6.2856,[500]6.2759,[501]6.2691,[502]6.2693,[503]6.2687,[504]6.2595,[505]6.2619,[506]6.2627,[507]6.2569,[508]6.2527,[509]6.2517,[510]6.2555,[511]6.2600,[512]6.2639,[513]6.2660,[514]6.2723,[515]6.2669,[516]6.2658,[517]6.2668,[518]6.2670,[519]6.2699,[520]6.2725,[521]6.2741,[522]6.2770,[523]6.2779,[524]6.2837,[525]6.2873,[526]6.2884,[527]6.2905,[528]6.2856,[529]6.2861,[530]6.2812,[531]6.2799,[532]6.2850,[533]6.2872,[534]6.2857,[535]6.2882,[536]6.2829,[537]6.2805,[538]6.2853,[539]6.2862,[540]6.2902,[541]6.2909,[542]6.2920,[543]6.2933,[544]6.2946,[545]6.2924,[546]6.2932,[547]6.2887,[548]6.2836,[549]6.2835,[550]6.2808,[551]6.2770,[552]6.2753,[553]6.2711,[554]6.2686,[555]6.2660,[556]6.2654,[557]6.2675,[558]6.2635,[559]6.2633,[560]6.2628,[561]6.2630,[562]6.2609,[563]6.2612,[564]6.2657,[565]6.2677,[566]6.2674,[567]6.2652,[568]6.2658,[569]6.2640,[570]6.2666,[571]6.2671,[572]6.2681,[573]6.2681,[574]6.2647,[575]6.2642,[576]6.2643,[577]6.2631,[578]6.2612,[579]6.2620,[580]6.2552,[581]6.2514,[582]6.2502,[583]6.2510,[584]6.2514,[585]6.2438,[586]6.2371,[587]6.2374,[588]6.2425,[589]6.2481,[590]6.2512,[591]6.2534,[592]6.2518,[593]6.2484,[594]6.2494,[595]6.2470,[596]6.2505,[597]6.2482,[598]6.2448,[599]6.2469,[600]6.2465,[601]6.2450,[602]6.2464,[603]6.2495,[604]6.2505,[605]6.2539,[606]6.2558,[607]6.2540,[608]6.2506,[609]6.2513,[610]6.2548,[611]6.2527,[612]6.2552,[613]6.2516,[614]6.2466,[615]6.2390,[616]6.2418,[617]6.2354,[618]6.2299,[619]6.2244,[620]6.2101,[621]6.2027,[622]6.2009,[623]6.2024,[624]6.2027,[625]6.2028,[626]6.2013,[627]6.2034,[628]6.2036,[629]6.2032,[630]6.2067,[631]6.2127,[632]6.2182,[633]6.2165,[634]6.2197,[635]6.2202,[636]6.2174,[637]6.2141,[638]6.2170,[639]6.2139,[640]6.2149,[641]6.2151,[642]6.2218,[643]6.2240,[644]6.2250,[645]6.2229,[646]6.2271,[647]6.2234,[648]6.2243,[649]6.2243,[650]6.2280,[651]6.2336,[652]6.2345,[653]6.2389,[654]6.2324,[655]6.2319,

llama_print_timings:        load time = 22958.01 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14822479.15 ms / 335360 tokens (   44.20 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 14855793.77 ms

real	247m35.909s
user	1971m45.730s
sys	2m32.664s
With BLAS:
25 iters: 6.5146
$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-new.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681742938
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
8.29 seconds per pass - ETA 1.51 hours
[1]4.3868,[2]4.9427,[3]5.8069,[4]6.4258,[5]6.5128,[6]6.4886,[7]6.6714,[8]6.7577,[9]7.1337,[10]7.3815,[11]7.6024,[12]7.6349,[13]7.5573,[14]7.6293,[15]7.8738,[16]7.4744,[17]7.3524,[18]7.3061,[19]6.9353,[20]6.9242,[21]6.8210,[22]6.6417,[23]6.6042,[24]6.5112,[25]6.5146
655 iters: 6.2316
$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-new.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681742938
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
8.29 seconds per pass - ETA 1.51 hours
[1]4.3868,[2]4.9427,[3]5.8069,[4]6.4258,[5]6.5128,[6]6.4886,[7]6.6714,[8]6.7577,[9]7.1337,[10]7.3815,[11]7.6024,[12]7.6349,[13]7.5573,[14]7.6293,[15]7.8738,[16]7.4744,[17]7.3524,[18]7.3061,[19]6.9353,[20]6.9242,[21]6.8210,[22]6.6417,[23]6.6042,[24]6.5112,[25]6.5146,[26]6.3455,[27]6.1702,[28]6.0707,[29]5.9841,[30]5.8252,[31]5.7927,[32]5.8130,[33]5.7557,[34]5.7925,[35]5.8209,[36]5.8621,[37]5.8633,[38]5.8810,[39]5.9175,[40]5.9753,[41]5.9870,[42]6.0252,[43]5.9860,[44]6.0400,[45]6.0459,[46]6.0185,[47]6.0396,[48]6.0117,[49]6.0160,[50]5.9756,[51]5.9725,[52]5.9638,[53]6.0067,[54]5.9900,[55]5.9673,[56]6.0024,[57]6.0263,[58]6.0514,[59]6.0671,[60]6.1116,[61]6.1053,[62]6.1657,[63]6.1989,[64]6.2161,[65]6.2625,[66]6.2718,[67]6.2902,[68]6.3064,[69]6.3323,[70]6.3654,[71]6.3862,[72]6.4166,[73]6.4783,[74]6.4809,[75]6.4948,[76]6.5103,[77]6.5254,[78]6.5113,[79]6.5401,[80]6.5323,[81]6.5428,[82]6.5482,[83]6.4938,[84]6.4764,[85]6.4641,[86]6.4420,[87]6.3761,[88]6.3478,[89]6.3277,[90]6.3130,[91]6.3371,[92]6.3342,[93]6.3375,[94]6.3345,[95]6.3625,[96]6.3623,[97]6.3573,[98]6.3533,[99]6.3408,[100]6.3421,[101]6.3681,[102]6.3615,[103]6.3826,[104]6.3899,[105]6.3886,[106]6.4062,[107]6.4050,[108]6.4183,[109]6.4131,[110]6.4098,[111]6.4330,[112]6.4526,[113]6.4550,[114]6.4521,[115]6.4602,[116]6.4528,[117]6.4579,[118]6.4873,[119]6.5089,[120]6.5456,[121]6.5619,[122]6.5864,[123]6.6243,[124]6.6413,[125]6.6315,[126]6.6695,[127]6.7057,[128]6.7347,[129]6.7170,[130]6.7274,[131]6.7234,[132]6.7143,[133]6.7012,[134]6.7123,[135]6.7076,[136]6.6951,[137]6.6873,[138]6.6698,[139]6.6582,[140]6.6540,[141]6.6235,[142]6.6185,[143]6.5900,[144]6.5697,[145]6.5594,[146]6.5468,[147]6.5536,[148]6.5548,[149]6.5489,[150]6.5440,[151]6.5464,[152]6.5375,[153]6.5206,[154]6.5118,[155]6.5186,[156]6.5134,[157]6.5312,[158]6.5359,[159]6.5396,[160]6.5423,[161]6.5539,[162]6.5241,[163]6.5114,[164]6.4866,[165]6.4552,[166]6.4263,[167]6.3882,[168]6.3562,[169]6.3427,[170]6.3315,[171]6.3030,[172]6.2845,[173]6.2657,[174]6.2346,[175]6.2133,[176]6.2031,[177]6.1821,[178]6.1585,[179]6.1417,[180]6.1317,[181]6.1095,[182]6.0915,[183]6.0775,[184]6.0777,[185]6.0702,[186]6.0713,[187]6.0770,[188]6.0731,[189]6.0911,[190]6.0920,[191]6.1130,[192]6.1300,[193]6.1475,[194]6.1589,[195]6.1803,[196]6.1960,[197]6.2177,[198]6.2331,[199]6.2360,[200]6.2406,[201]6.2362,[202]6.2559,[203]6.2632,[204]6.2611,[205]6.2721,[206]6.2791,[207]6.2750,[208]6.2843,[209]6.2888,[210]6.2942,[211]6.3036,[212]6.3112,[213]6.3220,[214]6.3253,[215]6.3289,[216]6.3440,[217]6.3616,[218]6.3749,[219]6.3743,[220]6.3709,[221]6.3653,[222]6.3622,[223]6.3511,[224]6.3449,[225]6.3407,[226]6.3619,[227]6.3707,[228]6.3753,[229]6.3820,[230]6.3782,[231]6.3948,[232]6.3824,[233]6.3652,[234]6.3503,[235]6.3329,[236]6.3251,[237]6.3149,[238]6.3181,[239]6.3024,[240]6.2924,[241]6.2956,[242]6.2994,[243]6.2978,[244]6.2861,[245]6.2834,[246]6.2719,[247]6.2596,[248]6.2524,[249]6.2505,[250]6.2555,[251]6.2478,[252]6.2441,[253]6.2341,[254]6.2299,[255]6.2183,[256]6.2000,[257]6.1887,[258]6.1800,[259]6.1786,[260]6.1711,[261]6.1671,[262]6.1613,[263]6.1565,[264]6.1356,[265]6.1354,[266]6.1345,[267]6.1277,[268]6.1373,[269]6.1352,[270]6.1361,[271]6.1436,[272]6.1477,[273]6.1478,[274]6.1493,[275]6.1582,[276]6.1635,[277]6.1796,[278]6.1911,[279]6.1999,[280]6.2034,[281]6.2128,[282]6.2192,[283]6.2343,[284]6.2417,[285]6.2510,[286]6.2646,[287]6.2639,[288]6.2698,[289]6.2603,[290]6.2448,[291]6.2293,[292]6.2138,[293]6.2005,[294]6.2021,[295]6.2018,[296]6.2064,[297]6.2047,[298]6.2081,[299]6.2052,[300]6.1941,[301]6.1943,[302]6.1868,[303]6.1786,[304]6.1706,[305]6.1679,[306]6.1552,[307]6.1580,[308]6.1621,[309]6.1457,[310]6.1397,[311]6.1331,[312]6.1355,[313]6.1303,[314]6.1289,[315]6.1124,[316]6.1075,[317]6.0911,[318]6.0697,[319]6.0821,[320]6.0949,[321]6.0990,[322]6.0946,[323]6.0881,[324]6.0855,[325]6.0958,[326]6.0958,[327]6.0977,[328]6.1019,[329]6.1076,[330]6.1102,[331]6.1231,[332]6.1197,[333]6.1264,[334]6.1204,[335]6.1141,[336]6.1177,[337]6.1151,[338]6.1141,[339]6.1088,[340]6.1046,[341]6.1128,[342]6.1154,[343]6.1208,[344]6.1208,[345]6.1206,[346]6.1176,[347]6.1226,[348]6.1262,[349]6.1279,[350]6.1245,[351]6.1253,[352]6.1258,[353]6.1204,[354]6.1204,[355]6.1257,[356]6.1284,[357]6.1251,[358]6.1344,[359]6.1376,[360]6.1336,[361]6.1329,[362]6.1394,[363]6.1508,[364]6.1572,[365]6.1631,[366]6.1641,[367]6.1730,[368]6.1705,[369]6.1714,[370]6.1726,[371]6.1666,[372]6.1714,[373]6.1769,[374]6.1755,[375]6.1751,[376]6.1827,[377]6.1776,[378]6.1801,[379]6.1858,[380]6.1776,[381]6.1734,[382]6.1681,[383]6.1671,[384]6.1663,[385]6.1650,[386]6.1646,[387]6.1640,[388]6.1597,[389]6.1542,[390]6.1470,[391]6.1392,[392]6.1350,[393]6.1330,[394]6.1355,[395]6.1339,[396]6.1262,[397]6.1337,[398]6.1374,[399]6.1455,[400]6.1448,[401]6.1464,[402]6.1471,[403]6.1489,[404]6.1553,[405]6.1458,[406]6.1424,[407]6.1420,[408]6.1433,[409]6.1553,[410]6.1664,[411]6.1782,[412]6.1941,[413]6.2054,[414]6.2131,[415]6.2185,[416]6.2264,[417]6.2390,[418]6.2427,[419]6.2497,[420]6.2586,[421]6.2705,[422]6.2757,[423]6.2825,[424]6.2942,[425]6.3032,[426]6.3099,[427]6.3144,[428]6.3227,[429]6.3277,[430]6.3366,[431]6.3511,[432]6.3551,[433]6.3542,[434]6.3494,[435]6.3501,[436]6.3524,[437]6.3618,[438]6.3695,[439]6.3662,[440]6.3654,[441]6.3602,[442]6.3590,[443]6.3603,[444]6.3607,[445]6.3587,[446]6.3611,[447]6.3640,[448]6.3688,[449]6.3665,[450]6.3669,[451]6.3626,[452]6.3503,[453]6.3416,[454]6.3357,[455]6.3366,[456]6.3416,[457]6.3435,[458]6.3416,[459]6.3422,[460]6.3508,[461]6.3481,[462]6.3468,[463]6.3513,[464]6.3505,[465]6.3475,[466]6.3396,[467]6.3399,[468]6.3396,[469]6.3420,[470]6.3426,[471]6.3378,[472]6.3424,[473]6.3369,[474]6.3382,[475]6.3322,[476]6.3339,[477]6.3264,[478]6.3254,[479]6.3317,[480]6.3368,[481]6.3389,[482]6.3344,[483]6.3303,[484]6.3324,[485]6.3302,[486]6.3248,[487]6.3247,[488]6.3227,[489]6.3178,[490]6.3153,[491]6.3124,[492]6.3067,[493]6.3036,[494]6.3018,[495]6.3016,[496]6.2979,[497]6.2922,[498]6.2904,[499]6.2856,[500]6.2760,[501]6.2692,[502]6.2694,[503]6.2688,[504]6.2595,[505]6.2619,[506]6.2628,[507]6.2570,[508]6.2528,[509]6.2518,[510]6.2556,[511]6.2601,[512]6.2640,[513]6.2661,[514]6.2724,[515]6.2669,[516]6.2659,[517]6.2668,[518]6.2671,[519]6.2699,[520]6.2725,[521]6.2741,[522]6.2770,[523]6.2779,[524]6.2836,[525]6.2872,[526]6.2883,[527]6.2905,[528]6.2855,[529]6.2861,[530]6.2812,[531]6.2798,[532]6.2849,[533]6.2871,[534]6.2856,[535]6.2881,[536]6.2828,[537]6.2804,[538]6.2852,[539]6.2861,[540]6.2901,[541]6.2908,[542]6.2918,[543]6.2931,[544]6.2945,[545]6.2922,[546]6.2930,[547]6.2886,[548]6.2834,[549]6.2834,[550]6.2806,[551]6.2769,[552]6.2751,[553]6.2710,[554]6.2684,[555]6.2658,[556]6.2653,[557]6.2674,[558]6.2634,[559]6.2632,[560]6.2627,[561]6.2629,[562]6.2608,[563]6.2611,[564]6.2656,[565]6.2675,[566]6.2673,[567]6.2650,[568]6.2656,[569]6.2638,[570]6.2665,[571]6.2670,[572]6.2681,[573]6.2680,[574]6.2646,[575]6.2641,[576]6.2643,[577]6.2630,[578]6.2611,[579]6.2619,[580]6.2551,[581]6.2513,[582]6.2502,[583]6.2510,[584]6.2514,[585]6.2438,[586]6.2370,[587]6.2373,[588]6.2424,[589]6.2480,[590]6.2511,[591]6.2532,[592]6.2516,[593]6.2482,[594]6.2492,[595]6.2468,[596]6.2503,[597]6.2479,[598]6.2445,[599]6.2467,[600]6.2463,[601]6.2447,[602]6.2462,[603]6.2492,[604]6.2502,[605]6.2535,[606]6.2554,[607]6.2536,[608]6.2503,[609]6.2509,[610]6.2545,[611]6.2524,[612]6.2549,[613]6.2513,[614]6.2462,[615]6.2387,[616]6.2415,[617]6.2351,[618]6.2296,[619]6.2241,[620]6.2099,[621]6.2025,[622]6.2006,[623]6.2022,[624]6.2024,[625]6.2026,[626]6.2011,[627]6.2031,[628]6.2033,[629]6.2029,[630]6.2064,[631]6.2124,[632]6.2178,[633]6.2162,[634]6.2194,[635]6.2199,[636]6.2171,[637]6.2138,[638]6.2166,[639]6.2135,[640]6.2145,[641]6.2147,[642]6.2215,[643]6.2236,[644]6.2246,[645]6.2226,[646]6.2267,[647]6.2230,[648]6.2239,[649]6.2240,[650]6.2277,[651]6.2333,[652]6.2342,[653]6.2386,[654]6.2322,[655]6.2316,

llama_print_timings:        load time =  8857.77 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 5035069.54 ms / 335360 tokens (   15.01 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 5069457.11 ms

real	84m29.811s
user	128m20.366s
sys	4m12.799s

The new 7B perplexity on this branch with BLAS enabled is: 6.2316
We can expect similar value without BLAS thanks to the #951

The perplexity on master for the same setup is: 6.2897

Therefore we observe a delta of -0.0581 thanks to the 2x F16 scale factors in Q4_0


Somehow was hoping for a value closer to the Q4_1 6.0863 reported in #896
The current RMSE is:

q4_0 : rmse 0.00194636, maxerr 0.18359375, 95pct<0.0038, median<0.0016

Which is much higher than the reported one in #896 for this approach:

# this value is after RMSE optimization
rmse 0.00159265, maxerr 0.17480469, 95pct<0.0030, median<0.0012

Either the claim in #896 that RMSE optimization brings only 0.02 ppl is not entirely correct or my expectation that 2x F16 Q4_0 would be similar to Q4_1 on master was not correct.


Took the 2x F16 model from #896 and running perplexity with the current branch. The result is:

655 iters: 6.2039
$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-ik.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
ggml_extra.o
llama.o
main
quantize
quantize-stats
perplexity
embedding
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681751064
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-ik.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
9.84 seconds per pass - ETA 1.79 hours
[1]4.4431,[2]4.8785,[3]5.7704,[4]6.3945,[5]6.5144,[6]6.4809,[7]6.6729,[8]6.7852,[9]7.1349,[10]7.3725,[11]7.5874,[12]7.6104,[13]7.5305,[14]7.6091,[15]7.8651,[16]7.4754,[17]7.3523,[18]7.3179,[19]6.9540,[20]6.9445,[21]6.8514,[22]6.6759,[23]6.6396,[24]6.5458,[25]6.5466,[26]6.3792,[27]6.1966,[28]6.0971,[29]6.0075,[30]5.8487,[31]5.8170,[32]5.8392,[33]5.7809,[34]5.8179,[35]5.8429,[36]5.8853,[37]5.8890,[38]5.9038,[39]5.9398,[40]5.9989,[41]6.0119,[42]6.0517,[43]6.0092,[44]6.0643,[45]6.0674,[46]6.0436,[47]6.0647,[48]6.0361,[49]6.0390,[50]5.9974,[51]5.9926,[52]5.9816,[53]6.0245,[54]6.0068,[55]5.9818,[56]6.0093,[57]6.0292,[58]6.0502,[59]6.0678,[60]6.1125,[61]6.1039,[62]6.1622,[63]6.1986,[64]6.2147,[65]6.2615,[66]6.2691,[67]6.2873,[68]6.3050,[69]6.3318,[70]6.3648,[71]6.3871,[72]6.4174,[73]6.4797,[74]6.4852,[75]6.4994,[76]6.5124,[77]6.5248,[78]6.5100,[79]6.5375,[80]6.5285,[81]6.5380,[82]6.5416,[83]6.4870,[84]6.4696,[85]6.4588,[86]6.4359,[87]6.3708,[88]6.3421,[89]6.3222,[90]6.3079,[91]6.3315,[92]6.3251,[93]6.3275,[94]6.3242,[95]6.3516,[96]6.3496,[97]6.3442,[98]6.3370,[99]6.3219,[100]6.3228,[101]6.3487,[102]6.3433,[103]6.3635,[104]6.3710,[105]6.3701,[106]6.3853,[107]6.3827,[108]6.3958,[109]6.3890,[110]6.3849,[111]6.4074,[112]6.4279,[113]6.4307,[114]6.4274,[115]6.4346,[116]6.4257,[117]6.4314,[118]6.4607,[119]6.4811,[120]6.5165,[121]6.5342,[122]6.5598,[123]6.5977,[124]6.6162,[125]6.6058,[126]6.6455,[127]6.6822,[128]6.7127,[129]6.6963,[130]6.7065,[131]6.7028,[132]6.6942,[133]6.6814,[134]6.6921,[135]6.6877,[136]6.6750,[137]6.6664,[138]6.6512,[139]6.6407,[140]6.6354,[141]6.6050,[142]6.6013,[143]6.5713,[144]6.5505,[145]6.5415,[146]6.5282,[147]6.5346,[148]6.5347,[149]6.5284,[150]6.5233,[151]6.5249,[152]6.5128,[153]6.4960,[154]6.4871,[155]6.4938,[156]6.4887,[157]6.5066,[158]6.5098,[159]6.5152,[160]6.5171,[161]6.5299,[162]6.4997,[163]6.4865,[164]6.4615,[165]6.4297,[166]6.4018,[167]6.3636,[168]6.3318,[169]6.3185,[170]6.3072,[171]6.2792,[172]6.2620,[173]6.2446,[174]6.2141,[175]6.1922,[176]6.1822,[177]6.1617,[178]6.1379,[179]6.1210,[180]6.1120,[181]6.0904,[182]6.0726,[183]6.0586,[184]6.0583,[185]6.0509,[186]6.0516,[187]6.0579,[188]6.0534,[189]6.0706,[190]6.0717,[191]6.0935,[192]6.1097,[193]6.1271,[194]6.1384,[195]6.1595,[196]6.1754,[197]6.1966,[198]6.2116,[199]6.2157,[200]6.2203,[201]6.2156,[202]6.2364,[203]6.2445,[204]6.2432,[205]6.2538,[206]6.2604,[207]6.2567,[208]6.2652,[209]6.2696,[210]6.2751,[211]6.2850,[212]6.2924,[213]6.3028,[214]6.3051,[215]6.3089,[216]6.3242,[217]6.3422,[218]6.3554,[219]6.3561,[220]6.3516,[221]6.3467,[222]6.3440,[223]6.3334,[224]6.3260,[225]6.3220,[226]6.3430,[227]6.3517,[228]6.3564,[229]6.3624,[230]6.3588,[231]6.3756,[232]6.3627,[233]6.3457,[234]6.3305,[235]6.3137,[236]6.3064,[237]6.2961,[238]6.2990,[239]6.2834,[240]6.2733,[241]6.2758,[242]6.2795,[243]6.2775,[244]6.2659,[245]6.2630,[246]6.2511,[247]6.2385,[248]6.2311,[249]6.2291,[250]6.2330,[251]6.2261,[252]6.2226,[253]6.2127,[254]6.2086,[255]6.1976,[256]6.1796,[257]6.1675,[258]6.1591,[259]6.1573,[260]6.1500,[261]6.1459,[262]6.1406,[263]6.1354,[264]6.1159,[265]6.1149,[266]6.1135,[267]6.1070,[268]6.1165,[269]6.1146,[270]6.1156,[271]6.1235,[272]6.1265,[273]6.1263,[274]6.1282,[275]6.1360,[276]6.1418,[277]6.1573,[278]6.1674,[279]6.1761,[280]6.1790,[281]6.1883,[282]6.1945,[283]6.2092,[284]6.2168,[285]6.2255,[286]6.2393,[287]6.2393,[288]6.2449,[289]6.2360,[290]6.2202,[291]6.2047,[292]6.1894,[293]6.1755,[294]6.1774,[295]6.1771,[296]6.1811,[297]6.1795,[298]6.1823,[299]6.1793,[300]6.1679,[301]6.1682,[302]6.1605,[303]6.1524,[304]6.1442,[305]6.1418,[306]6.1290,[307]6.1314,[308]6.1349,[309]6.1191,[310]6.1130,[311]6.1068,[312]6.1097,[313]6.1039,[314]6.1023,[315]6.0859,[316]6.0810,[317]6.0647,[318]6.0435,[319]6.0557,[320]6.0683,[321]6.0726,[322]6.0683,[323]6.0615,[324]6.0589,[325]6.0693,[326]6.0690,[327]6.0707,[328]6.0744,[329]6.0804,[330]6.0830,[331]6.0955,[332]6.0926,[333]6.0997,[334]6.0940,[335]6.0869,[336]6.0902,[337]6.0875,[338]6.0873,[339]6.0817,[340]6.0775,[341]6.0853,[342]6.0876,[343]6.0925,[344]6.0923,[345]6.0922,[346]6.0892,[347]6.0935,[348]6.0968,[349]6.0988,[350]6.0953,[351]6.0958,[352]6.0960,[353]6.0900,[354]6.0905,[355]6.0959,[356]6.0986,[357]6.0953,[358]6.1045,[359]6.1075,[360]6.1038,[361]6.1035,[362]6.1103,[363]6.1219,[364]6.1282,[365]6.1340,[366]6.1351,[367]6.1442,[368]6.1416,[369]6.1420,[370]6.1434,[371]6.1376,[372]6.1425,[373]6.1477,[374]6.1462,[375]6.1461,[376]6.1533,[377]6.1485,[378]6.1510,[379]6.1568,[380]6.1487,[381]6.1449,[382]6.1395,[383]6.1386,[384]6.1381,[385]6.1374,[386]6.1372,[387]6.1367,[388]6.1326,[389]6.1275,[390]6.1205,[391]6.1128,[392]6.1086,[393]6.1069,[394]6.1094,[395]6.1080,[396]6.1003,[397]6.1081,[398]6.1122,[399]6.1205,[400]6.1203,[401]6.1220,[402]6.1228,[403]6.1250,[404]6.1316,[405]6.1215,[406]6.1180,[407]6.1173,[408]6.1186,[409]6.1305,[410]6.1414,[411]6.1528,[412]6.1688,[413]6.1806,[414]6.1881,[415]6.1932,[416]6.2010,[417]6.2134,[418]6.2172,[419]6.2244,[420]6.2332,[421]6.2448,[422]6.2497,[423]6.2566,[424]6.2682,[425]6.2771,[426]6.2838,[427]6.2882,[428]6.2966,[429]6.3017,[430]6.3102,[431]6.3244,[432]6.3286,[433]6.3275,[434]6.3230,[435]6.3238,[436]6.3260,[437]6.3356,[438]6.3431,[439]6.3398,[440]6.3395,[441]6.3344,[442]6.3330,[443]6.3343,[444]6.3345,[445]6.3327,[446]6.3354,[447]6.3384,[448]6.3430,[449]6.3405,[450]6.3417,[451]6.3375,[452]6.3247,[453]6.3159,[454]6.3102,[455]6.3114,[456]6.3160,[457]6.3180,[458]6.3159,[459]6.3162,[460]6.3247,[461]6.3218,[462]6.3201,[463]6.3249,[464]6.3239,[465]6.3207,[466]6.3128,[467]6.3126,[468]6.3123,[469]6.3143,[470]6.3146,[471]6.3097,[472]6.3146,[473]6.3090,[474]6.3098,[475]6.3035,[476]6.3055,[477]6.2983,[478]6.2971,[479]6.3031,[480]6.3076,[481]6.3094,[482]6.3048,[483]6.3006,[484]6.3029,[485]6.3014,[486]6.2959,[487]6.2960,[488]6.2937,[489]6.2889,[490]6.2866,[491]6.2837,[492]6.2777,[493]6.2747,[494]6.2731,[495]6.2734,[496]6.2699,[497]6.2643,[498]6.2624,[499]6.2577,[500]6.2480,[501]6.2413,[502]6.2415,[503]6.2409,[504]6.2320,[505]6.2345,[506]6.2355,[507]6.2299,[508]6.2260,[509]6.2252,[510]6.2290,[511]6.2338,[512]6.2371,[513]6.2391,[514]6.2454,[515]6.2398,[516]6.2389,[517]6.2398,[518]6.2398,[519]6.2429,[520]6.2454,[521]6.2471,[522]6.2501,[523]6.2510,[524]6.2565,[525]6.2602,[526]6.2614,[527]6.2632,[528]6.2582,[529]6.2585,[530]6.2537,[531]6.2526,[532]6.2576,[533]6.2599,[534]6.2583,[535]6.2608,[536]6.2552,[537]6.2529,[538]6.2575,[539]6.2585,[540]6.2624,[541]6.2626,[542]6.2637,[543]6.2651,[544]6.2663,[545]6.2639,[546]6.2647,[547]6.2603,[548]6.2554,[549]6.2552,[550]6.2521,[551]6.2486,[552]6.2464,[553]6.2425,[554]6.2401,[555]6.2371,[556]6.2366,[557]6.2388,[558]6.2351,[559]6.2345,[560]6.2343,[561]6.2342,[562]6.2321,[563]6.2321,[564]6.2366,[565]6.2387,[566]6.2384,[567]6.2364,[568]6.2369,[569]6.2352,[570]6.2378,[571]6.2384,[572]6.2394,[573]6.2396,[574]6.2364,[575]6.2358,[576]6.2357,[577]6.2344,[578]6.2324,[579]6.2330,[580]6.2261,[581]6.2222,[582]6.2210,[583]6.2219,[584]6.2221,[585]6.2147,[586]6.2080,[587]6.2084,[588]6.2132,[589]6.2186,[590]6.2213,[591]6.2235,[592]6.2221,[593]6.2186,[594]6.2194,[595]6.2171,[596]6.2206,[597]6.2184,[598]6.2154,[599]6.2175,[600]6.2170,[601]6.2154,[602]6.2171,[603]6.2204,[604]6.2214,[605]6.2248,[606]6.2267,[607]6.2250,[608]6.2217,[609]6.2224,[610]6.2259,[611]6.2240,[612]6.2267,[613]6.2229,[614]6.2177,[615]6.2104,[616]6.2132,[617]6.2070,[618]6.2018,[619]6.1962,[620]6.1820,[621]6.1749,[622]6.1732,[623]6.1747,[624]6.1753,[625]6.1755,[626]6.1742,[627]6.1762,[628]6.1763,[629]6.1757,[630]6.1789,[631]6.1847,[632]6.1902,[633]6.1885,[634]6.1918,[635]6.1925,[636]6.1895,[637]6.1862,[638]6.1890,[639]6.1860,[640]6.1869,[641]6.1871,[642]6.1937,[643]6.1958,[644]6.1969,[645]6.1950,[646]6.1990,[647]6.1952,[648]6.1962,[649]6.1964,[650]6.2005,[651]6.2060,[652]6.2069,[653]6.2108,[654]6.2045,[655]6.2039,

llama_print_timings:        load time = 10403.47 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 6083050.18 ms / 335360 tokens (   18.14 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 6118917.59 ms

real	101m59.169s
user	145m24.421s
sys	4m22.143s

So indeed, RMSE optimization leads to just about -0.03 perplexity gain.

I guess, I don't feel so confident about dropping Q4_1 given these results.


Conclusions

The new 2x F16 Q4_0 format is viable: improves 7B perplexity by -0.0581 and has almost same inference speed as the original format (50ms per token vs 55ms per token on M1)

Next steps

  • Reimplement this new format as Q4_2
#define QK4_2 16
typedef struct {
    ggml_fp16_t d;          // delta
    uint8_t qs[QK4_2 / 2];  // nibbles / quants
} block_q4_2;

@ggerganov ggerganov added generation quality Quality of model output breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels Apr 17, 2023
@ggerganov ggerganov force-pushed the q4_0-f16 branch 3 times, most recently from ec870de to b4c74b7 Compare April 17, 2023 14:34
@ggerganov ggerganov linked an issue Apr 17, 2023 that may be closed by this pull request
3 tasks
@ggerganov ggerganov self-assigned this Apr 17, 2023
@ggerganov ggerganov closed this Apr 17, 2023
@ggerganov ggerganov reopened this Apr 17, 2023
@sw
Copy link
Contributor

sw commented Apr 17, 2023

I would have hoped the new format would be defined like this:

#define QK4_2 16
typedef struct {
    ggml_fp16_t d;          // delta
    uint8_t qs[QK4_2 / 2];  // nibbles / quants
} block_q4_2;

That is, don't force what are essentially two blocks into one struct, and also define a new version number.

Pros:

  • simplifies the scalar/reference implementation
  • doesn't break anything for Q4_0 users
  • SIMD instruction can still operate on two blocks at a time, if it makes sense

Cons:

  • needs two loads of the quants for SIMD
  • larger code base?

Of course, in the long run, we might decide to stop supporting Q4_0 or Q4_1.

What do you think? Would that really be slower?

@ggerganov
Copy link
Owner Author

ggerganov commented Apr 17, 2023

@sw
Yes, let's do that. I will reimplement this using Q4_2 as suggested.

Btw, I'm again reconsidering keeping the SIMD quantize / dequantize. Too much code for little gains.
In #951 we considered keeping support due to LoRA, but I think we have to measure how much it takes to do LoRA with and without SIMD quantization. It might turn out that the difference is very small compared to overall computation. Also, integration of RMSE-optimized quantization similar to #896 would be much simpler. cc @slaren

Dropping SIMD quantization support would make changes as in this PR much simpler.

@ggerganov ggerganov removed the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Apr 17, 2023
@slaren
Copy link
Collaborator

slaren commented Apr 17, 2023

Here is a quick test of the performance impact of quantization on LoRA:

perf_total_per_op_us[             ADD] =  12.776 ms
perf_total_per_op_us[         MUL_MAT] =  47.818 ms
perf_total_per_op_us[           SCALE] =   9.319 ms
perf_total_per_op_us[             CPY] =  45.780 ms

The quantization happens on the CPY operation, and currently this represents about 40% of the time to apply a layer.

Changing ggml_cpy to use quantize_row_q_reference instead:

perf_total_per_op_us[             ADD] =  13.698 ms
perf_total_per_op_us[         MUL_MAT] =  46.508 ms
perf_total_per_op_us[           SCALE] =   9.733 ms
perf_total_per_op_us[             CPY] = 384.500 ms

Without SIMD, the CPY takes 80% of the time.

Caveat: ggml_cpy is still not multithreaded, so we could expect significant gains by doing that. However, I think that the difference is still going to be significant.

@slaren
Copy link
Collaborator

slaren commented Apr 17, 2023

That said, I absolutely agree that ggml.c is too big and any simplification would be good. If you not opposed to splitting ggml into multiple files, we could look into moving all the SIMD implementations to a different file for each platform. So we would have ggml.c with the core code and the C implementations, ggml-avx.c with the AVX implementations, ggml-neon.c for NEON, and so on.

@ggerganov
Copy link
Owner Author

ggerganov commented Apr 17, 2023

If we change the roundf to:

            const uint8_t vi0 = (uint8_t)(v0 + 8.5f);
            const uint8_t vi1 = (uint8_t)(v1 + 8.5f);

The timing should be much closer I think.
But in any case, 0.5s for applying a LoRA adapter does not sound bad at all.

We can reduce ggml.c significantly with a few macros. For example:

#define GGML_TENSOR_SIZES_SRC0(x) // defines ne00, ne01, ..., nb00, nb01, etc..
#define GGML_TENSOR_SIZES_SRC1(x) // defines ne10, ne11, ..., nb10, nb11, etc..
#define GGML_TENSOR_SIZES_DST(x)  // defines ne0, ne1, ..., nb0, nb1, etc..

#define GGML_GET_PTR_ROW(x, i) // get ptr to ith row using strides

etc..

We can do this refactoring soon, let's say after the quantization work is done.
It should make ggml.c much more compact.

But, the SIMD remains a problem because even if you had multiple small files or ggml.c wasn't so big, you still need to implement all the missing routines for each new experiment. And the quantize / dequantize routines are no longer needed at inference run-time. Only for model generation and LoRA.

Edit:

you still need to implement all the missing routines for each new experiment

Hm, actually that's not true. I guess because I was editing over the original Q4_0 I felt like reimplementing everything.
With the new Q4_2 I can implement just the dot product and have the reference implementations for the quantization for now.

@slaren
Copy link
Collaborator

slaren commented Apr 17, 2023

But in any case, 0.5s for applying a LoRA adapter does not sound bad at all.

This is just for a single tensor, applying it to the entire model can take a lot of time, especially with the larger models. Applying the entire LoRA (this is baize-lora-7B) takes 30 seconds on my machine. This is a bit of a pathological case since it also modifies the feed-forward tensors (usually LoRAs only modify the attention tensors), but still, this is very slow for something that has to be done every time.

But, the SIMD remains a problem because even if you had multiple small files or ggml.c wasn't so big, you still need to implement all the missing routines for each new experiment.

We can consider the SIMD implementations of functions like quantize and dequantize very low priority and just plainly ignore them in experiments, and only if the experiment is successful, we can allow other people to implement them later on in other PRs. I think this is what we are already doing in practice.

Edit: removing the roundf from quantize is indeed much faster:

quantize fn overall time
AVX2 32148.85 ms
ref without roundf 41659.51 ms
ref 76086.71 ms

@howard0su
Copy link
Collaborator

@sw Yes, let's do that. I will reimplement this using Q4_2 as suggested.

Btw, I'm again reconsidering keeping the SIMD quantize / dequantize. Too much code for little gains. In #951 we considered keeping support due to LoRA, but I think we have to measure how much it takes to do LoRA with and without SIMD quantization. It might turn out that the difference is very small compared to overall computation. Also, integration of RMSE-optimized quantization similar to #896 would be much simpler. cc @slaren

Dropping SIMD quantization support would make changes as in this PR much simpler.

One thing to consider is openmp which supports multi-thread and SIMD today. It will keep the code simple, portable while leveraging latest hardware features (multi core and AVX).

@ggerganov
Copy link
Owner Author

Wow - we handle running out of disk space so gracefully!

image

😄

@ggerganov ggerganov mentioned this pull request Apr 18, 2023
@ggerganov
Copy link
Owner Author

Reimplementation continues in #1046

@ggerganov ggerganov closed this Apr 18, 2023
@ggerganov ggerganov deleted the q4_0-f16 branch April 24, 2023 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output
Development

Successfully merging this pull request may close these issues.

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors
4 participants