New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

ggerganov · 2023-04-17T14:07:54Z

ARM NEON only implementation

Timing

Time per token ~55 ms
Up from ~50 ms on Q4_0 master

Perplexity

Without `BLAS`

25 iters: 6.5251

$  make clean && LLAMA_NO_ACCELERATE=1 make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-new.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
main: seed = 1681742115
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
27.17 seconds per pass - ETA 4.94 hours
[1]4.3877,[2]4.9469,[3]5.8130,[4]6.4349,[5]6.5214,[6]6.4978,[7]6.6796,[8]6.7642,[9]7.1418,[10]7.3900,[11]7.6105,[12]7.6448,[13]7.5666,[14]7.6375,[15]7.8840,[16]7.4854,[17]7.3630,[18]7.3180,[19]6.9464,[20]6.9355,[21]6.8325,[22]6.6521,[23]6.6150,[24]6.5219,[25]6.5251,^C

real	11m26.727s
user	88m33.729s
sys	0m14.893s

655 iters: 6.2319

$  make clean && LLAMA_NO_ACCELERATE=1 make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_2.bin -f ./build/wiki.test.raw -t 8 > ppl-q4_2.txt
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:  
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity 
main: seed = 1681883628
llama.cpp: loading model from ./models/7B/ggml-model-q4_2.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 5 (mostly Q4_2)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
22.42 seconds per pass - ETA 4.08 hours
[1]4.3745,[2]4.9398,[3]5.8008,[4]6.4193,[5]6.5062,[6]6.4855,[7]6.6679,[8]6.7545,[9]7.1310,[10]7.3804,[11]7.6023,[12]7.6349,[13]7.5577,[14]7.6301,[15]7.8752,[16]7.4754,[17]7.3526,[18]7.3060,[19]6.9352,[20]6.9239,[21]6.8201,[22]6.6404,[23]6.6030,[24]6.5102,[25]6.5136,[26]6.3446,[27]6.1694,[28]6.0698,[29]5.9837,[30]5.8245,[31]5.7918,[32]5.8126,[33]5.7556,[34]5.7925,[35]5.8202,[36]5.8612,[37]5.8618,[38]5.8801,[39]5.9167,[40]5.9750,[41]5.9870,[42]6.0249,[43]5.9856,[44]6.0397,[45]6.0454,[46]6.0177,[47]6.0385,[48]6.0104,[49]6.0145,[50]5.9742,[51]5.9711,[52]5.9625,[53]6.0058,[54]5.9891,[55]5.9664,[56]6.0016,[57]6.0255,[58]6.0502,[59]6.0655,[60]6.1103,[61]6.1038,[62]6.1642,[63]6.1973,[64]6.2145,[65]6.2606,[66]6.2698,[67]6.2882,[68]6.3044,[69]6.3303,[70]6.3634,[71]6.3840,[72]6.4145,[73]6.4762,[74]6.4788,[75]6.4927,[76]6.5085,[77]6.5235,[78]6.5093,[79]6.5382,[80]6.5304,[81]6.5408,[82]6.5461,[83]6.4918,[84]6.4744,[85]6.4620,[86]6.4400,[87]6.3742,[88]6.3461,[89]6.3259,[90]6.3113,[91]6.3355,[92]6.3325,[93]6.3358,[94]6.3329,[95]6.3608,[96]6.3607,[97]6.3555,[98]6.3516,[99]6.3391,[100]6.3405,[101]6.3666,[102]6.3600,[103]6.3813,[104]6.3887,[105]6.3873,[106]6.4048,[107]6.4035,[108]6.4167,[109]6.4117,[110]6.4084,[111]6.4315,[112]6.4512,[113]6.4536,[114]6.4506,[115]6.4587,[116]6.4512,[117]6.4564,[118]6.4858,[119]6.5073,[120]6.5439,[121]6.5603,[122]6.5848,[123]6.6226,[124]6.6394,[125]6.6295,[126]6.6675,[127]6.7036,[128]6.7324,[129]6.7146,[130]6.7252,[131]6.7211,[132]6.7121,[133]6.6990,[134]6.7099,[135]6.7051,[136]6.6927,[137]6.6850,[138]6.6674,[139]6.6557,[140]6.6514,[141]6.6211,[142]6.6161,[143]6.5874,[144]6.5673,[145]6.5568,[146]6.5444,[147]6.5513,[148]6.5527,[149]6.5469,[150]6.5420,[151]6.5444,[152]6.5355,[153]6.5186,[154]6.5097,[155]6.5163,[156]6.5112,[157]6.5291,[158]6.5339,[159]6.5376,[160]6.5404,[161]6.5519,[162]6.5221,[163]6.5095,[164]6.4847,[165]6.4534,[166]6.4245,[167]6.3864,[168]6.3544,[169]6.3409,[170]6.3296,[171]6.3012,[172]6.2827,[173]6.2639,[174]6.2327,[175]6.2113,[176]6.2011,[177]6.1802,[178]6.1567,[179]6.1399,[180]6.1300,[181]6.1078,[182]6.0897,[183]6.0757,[184]6.0758,[185]6.0683,[186]6.0694,[187]6.0751,[188]6.0713,[189]6.0894,[190]6.0902,[191]6.1112,[192]6.1281,[193]6.1455,[194]6.1569,[195]6.1783,[196]6.1941,[197]6.2159,[198]6.2312,[199]6.2341,[200]6.2387,[201]6.2343,[202]6.2541,[203]6.2615,[204]6.2595,[205]6.2704,[206]6.2775,[207]6.2734,[208]6.2827,[209]6.2871,[210]6.2926,[211]6.3020,[212]6.3096,[213]6.3204,[214]6.3238,[215]6.3274,[216]6.3426,[217]6.3601,[218]6.3734,[219]6.3729,[220]6.3695,[221]6.3638,[222]6.3607,[223]6.3497,[224]6.3434,[225]6.3394,[226]6.3606,[227]6.3693,[228]6.3740,[229]6.3807,[230]6.3768,[231]6.3936,[232]6.3812,[233]6.3639,[234]6.3489,[235]6.3316,[236]6.3238,[237]6.3136,[238]6.3169,[239]6.3012,[240]6.2912,[241]6.2945,[242]6.2983,[243]6.2967,[244]6.2850,[245]6.2824,[246]6.2709,[247]6.2587,[248]6.2516,[249]6.2497,[250]6.2547,[251]6.2470,[252]6.2433,[253]6.2333,[254]6.2292,[255]6.2176,[256]6.1992,[257]6.1879,[258]6.1793,[259]6.1778,[260]6.1704,[261]6.1664,[262]6.1606,[263]6.1558,[264]6.1351,[265]6.1349,[266]6.1339,[267]6.1271,[268]6.1367,[269]6.1346,[270]6.1355,[271]6.1430,[272]6.1471,[273]6.1472,[274]6.1487,[275]6.1577,[276]6.1630,[277]6.1790,[278]6.1906,[279]6.1994,[280]6.2028,[281]6.2122,[282]6.2187,[283]6.2338,[284]6.2412,[285]6.2505,[286]6.2641,[287]6.2634,[288]6.2693,[289]6.2597,[290]6.2442,[291]6.2288,[292]6.2133,[293]6.2000,[294]6.2016,[295]6.2014,[296]6.2060,[297]6.2042,[298]6.2078,[299]6.2048,[300]6.1938,[301]6.1939,[302]6.1864,[303]6.1782,[304]6.1703,[305]6.1676,[306]6.1548,[307]6.1577,[308]6.1618,[309]6.1454,[310]6.1393,[311]6.1327,[312]6.1351,[313]6.1299,[314]6.1286,[315]6.1121,[316]6.1072,[317]6.0908,[318]6.0694,[319]6.0818,[320]6.0946,[321]6.0987,[322]6.0942,[323]6.0878,[324]6.0853,[325]6.0956,[326]6.0956,[327]6.0975,[328]6.1017,[329]6.1074,[330]6.1101,[331]6.1229,[332]6.1195,[333]6.1262,[334]6.1203,[335]6.1139,[336]6.1176,[337]6.1149,[338]6.1139,[339]6.1086,[340]6.1045,[341]6.1126,[342]6.1152,[343]6.1206,[344]6.1206,[345]6.1204,[346]6.1174,[347]6.1225,[348]6.1261,[349]6.1278,[350]6.1244,[351]6.1253,[352]6.1258,[353]6.1203,[354]6.1204,[355]6.1256,[356]6.1284,[357]6.1251,[358]6.1344,[359]6.1376,[360]6.1336,[361]6.1329,[362]6.1394,[363]6.1508,[364]6.1572,[365]6.1630,[366]6.1641,[367]6.1729,[368]6.1704,[369]6.1713,[370]6.1725,[371]6.1665,[372]6.1713,[373]6.1769,[374]6.1755,[375]6.1752,[376]6.1827,[377]6.1776,[378]6.1801,[379]6.1858,[380]6.1776,[381]6.1734,[382]6.1681,[383]6.1671,[384]6.1663,[385]6.1649,[386]6.1645,[387]6.1639,[388]6.1596,[389]6.1541,[390]6.1469,[391]6.1391,[392]6.1349,[393]6.1330,[394]6.1355,[395]6.1339,[396]6.1262,[397]6.1337,[398]6.1374,[399]6.1455,[400]6.1448,[401]6.1464,[402]6.1471,[403]6.1490,[404]6.1554,[405]6.1458,[406]6.1424,[407]6.1420,[408]6.1434,[409]6.1553,[410]6.1666,[411]6.1783,[412]6.1942,[413]6.2055,[414]6.2132,[415]6.2186,[416]6.2265,[417]6.2391,[418]6.2427,[419]6.2497,[420]6.2585,[421]6.2706,[422]6.2757,[423]6.2826,[424]6.2942,[425]6.3033,[426]6.3100,[427]6.3145,[428]6.3228,[429]6.3279,[430]6.3367,[431]6.3512,[432]6.3551,[433]6.3543,[434]6.3495,[435]6.3502,[436]6.3524,[437]6.3618,[438]6.3695,[439]6.3663,[440]6.3655,[441]6.3602,[442]6.3591,[443]6.3604,[444]6.3608,[445]6.3588,[446]6.3612,[447]6.3640,[448]6.3688,[449]6.3665,[450]6.3670,[451]6.3626,[452]6.3504,[453]6.3417,[454]6.3357,[455]6.3366,[456]6.3416,[457]6.3436,[458]6.3416,[459]6.3422,[460]6.3508,[461]6.3482,[462]6.3469,[463]6.3514,[464]6.3505,[465]6.3476,[466]6.3396,[467]6.3399,[468]6.3396,[469]6.3421,[470]6.3426,[471]6.3379,[472]6.3425,[473]6.3369,[474]6.3382,[475]6.3322,[476]6.3340,[477]6.3265,[478]6.3254,[479]6.3317,[480]6.3367,[481]6.3389,[482]6.3344,[483]6.3303,[484]6.3323,[485]6.3301,[486]6.3247,[487]6.3246,[488]6.3226,[489]6.3177,[490]6.3152,[491]6.3124,[492]6.3066,[493]6.3035,[494]6.3017,[495]6.3015,[496]6.2978,[497]6.2921,[498]6.2904,[499]6.2856,[500]6.2759,[501]6.2691,[502]6.2693,[503]6.2687,[504]6.2595,[505]6.2619,[506]6.2627,[507]6.2569,[508]6.2527,[509]6.2517,[510]6.2555,[511]6.2600,[512]6.2639,[513]6.2660,[514]6.2723,[515]6.2669,[516]6.2658,[517]6.2668,[518]6.2670,[519]6.2699,[520]6.2725,[521]6.2741,[522]6.2770,[523]6.2779,[524]6.2837,[525]6.2873,[526]6.2884,[527]6.2905,[528]6.2856,[529]6.2861,[530]6.2812,[531]6.2799,[532]6.2850,[533]6.2872,[534]6.2857,[535]6.2882,[536]6.2829,[537]6.2805,[538]6.2853,[539]6.2862,[540]6.2902,[541]6.2909,[542]6.2920,[543]6.2933,[544]6.2946,[545]6.2924,[546]6.2932,[547]6.2887,[548]6.2836,[549]6.2835,[550]6.2808,[551]6.2770,[552]6.2753,[553]6.2711,[554]6.2686,[555]6.2660,[556]6.2654,[557]6.2675,[558]6.2635,[559]6.2633,[560]6.2628,[561]6.2630,[562]6.2609,[563]6.2612,[564]6.2657,[565]6.2677,[566]6.2674,[567]6.2652,[568]6.2658,[569]6.2640,[570]6.2666,[571]6.2671,[572]6.2681,[573]6.2681,[574]6.2647,[575]6.2642,[576]6.2643,[577]6.2631,[578]6.2612,[579]6.2620,[580]6.2552,[581]6.2514,[582]6.2502,[583]6.2510,[584]6.2514,[585]6.2438,[586]6.2371,[587]6.2374,[588]6.2425,[589]6.2481,[590]6.2512,[591]6.2534,[592]6.2518,[593]6.2484,[594]6.2494,[595]6.2470,[596]6.2505,[597]6.2482,[598]6.2448,[599]6.2469,[600]6.2465,[601]6.2450,[602]6.2464,[603]6.2495,[604]6.2505,[605]6.2539,[606]6.2558,[607]6.2540,[608]6.2506,[609]6.2513,[610]6.2548,[611]6.2527,[612]6.2552,[613]6.2516,[614]6.2466,[615]6.2390,[616]6.2418,[617]6.2354,[618]6.2299,[619]6.2244,[620]6.2101,[621]6.2027,[622]6.2009,[623]6.2024,[624]6.2027,[625]6.2028,[626]6.2013,[627]6.2034,[628]6.2036,[629]6.2032,[630]6.2067,[631]6.2127,[632]6.2182,[633]6.2165,[634]6.2197,[635]6.2202,[636]6.2174,[637]6.2141,[638]6.2170,[639]6.2139,[640]6.2149,[641]6.2151,[642]6.2218,[643]6.2240,[644]6.2250,[645]6.2229,[646]6.2271,[647]6.2234,[648]6.2243,[649]6.2243,[650]6.2280,[651]6.2336,[652]6.2345,[653]6.2389,[654]6.2324,[655]6.2319,

llama_print_timings:        load time = 22958.01 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14822479.15 ms / 335360 tokens (   44.20 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 14855793.77 ms

real	247m35.909s
user	1971m45.730s
sys	2m32.664s

With `BLAS`:

25 iters: 6.5146

$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-new.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681742938
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
8.29 seconds per pass - ETA 1.51 hours
[1]4.3868,[2]4.9427,[3]5.8069,[4]6.4258,[5]6.5128,[6]6.4886,[7]6.6714,[8]6.7577,[9]7.1337,[10]7.3815,[11]7.6024,[12]7.6349,[13]7.5573,[14]7.6293,[15]7.8738,[16]7.4744,[17]7.3524,[18]7.3061,[19]6.9353,[20]6.9242,[21]6.8210,[22]6.6417,[23]6.6042,[24]6.5112,[25]6.5146

655 iters: 6.2316

$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-new.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
llama.o
perplexity
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681742938
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-new.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
8.29 seconds per pass - ETA 1.51 hours
[1]4.3868,[2]4.9427,[3]5.8069,[4]6.4258,[5]6.5128,[6]6.4886,[7]6.6714,[8]6.7577,[9]7.1337,[10]7.3815,[11]7.6024,[12]7.6349,[13]7.5573,[14]7.6293,[15]7.8738,[16]7.4744,[17]7.3524,[18]7.3061,[19]6.9353,[20]6.9242,[21]6.8210,[22]6.6417,[23]6.6042,[24]6.5112,[25]6.5146,[26]6.3455,[27]6.1702,[28]6.0707,[29]5.9841,[30]5.8252,[31]5.7927,[32]5.8130,[33]5.7557,[34]5.7925,[35]5.8209,[36]5.8621,[37]5.8633,[38]5.8810,[39]5.9175,[40]5.9753,[41]5.9870,[42]6.0252,[43]5.9860,[44]6.0400,[45]6.0459,[46]6.0185,[47]6.0396,[48]6.0117,[49]6.0160,[50]5.9756,[51]5.9725,[52]5.9638,[53]6.0067,[54]5.9900,[55]5.9673,[56]6.0024,[57]6.0263,[58]6.0514,[59]6.0671,[60]6.1116,[61]6.1053,[62]6.1657,[63]6.1989,[64]6.2161,[65]6.2625,[66]6.2718,[67]6.2902,[68]6.3064,[69]6.3323,[70]6.3654,[71]6.3862,[72]6.4166,[73]6.4783,[74]6.4809,[75]6.4948,[76]6.5103,[77]6.5254,[78]6.5113,[79]6.5401,[80]6.5323,[81]6.5428,[82]6.5482,[83]6.4938,[84]6.4764,[85]6.4641,[86]6.4420,[87]6.3761,[88]6.3478,[89]6.3277,[90]6.3130,[91]6.3371,[92]6.3342,[93]6.3375,[94]6.3345,[95]6.3625,[96]6.3623,[97]6.3573,[98]6.3533,[99]6.3408,[100]6.3421,[101]6.3681,[102]6.3615,[103]6.3826,[104]6.3899,[105]6.3886,[106]6.4062,[107]6.4050,[108]6.4183,[109]6.4131,[110]6.4098,[111]6.4330,[112]6.4526,[113]6.4550,[114]6.4521,[115]6.4602,[116]6.4528,[117]6.4579,[118]6.4873,[119]6.5089,[120]6.5456,[121]6.5619,[122]6.5864,[123]6.6243,[124]6.6413,[125]6.6315,[126]6.6695,[127]6.7057,[128]6.7347,[129]6.7170,[130]6.7274,[131]6.7234,[132]6.7143,[133]6.7012,[134]6.7123,[135]6.7076,[136]6.6951,[137]6.6873,[138]6.6698,[139]6.6582,[140]6.6540,[141]6.6235,[142]6.6185,[143]6.5900,[144]6.5697,[145]6.5594,[146]6.5468,[147]6.5536,[148]6.5548,[149]6.5489,[150]6.5440,[151]6.5464,[152]6.5375,[153]6.5206,[154]6.5118,[155]6.5186,[156]6.5134,[157]6.5312,[158]6.5359,[159]6.5396,[160]6.5423,[161]6.5539,[162]6.5241,[163]6.5114,[164]6.4866,[165]6.4552,[166]6.4263,[167]6.3882,[168]6.3562,[169]6.3427,[170]6.3315,[171]6.3030,[172]6.2845,[173]6.2657,[174]6.2346,[175]6.2133,[176]6.2031,[177]6.1821,[178]6.1585,[179]6.1417,[180]6.1317,[181]6.1095,[182]6.0915,[183]6.0775,[184]6.0777,[185]6.0702,[186]6.0713,[187]6.0770,[188]6.0731,[189]6.0911,[190]6.0920,[191]6.1130,[192]6.1300,[193]6.1475,[194]6.1589,[195]6.1803,[196]6.1960,[197]6.2177,[198]6.2331,[199]6.2360,[200]6.2406,[201]6.2362,[202]6.2559,[203]6.2632,[204]6.2611,[205]6.2721,[206]6.2791,[207]6.2750,[208]6.2843,[209]6.2888,[210]6.2942,[211]6.3036,[212]6.3112,[213]6.3220,[214]6.3253,[215]6.3289,[216]6.3440,[217]6.3616,[218]6.3749,[219]6.3743,[220]6.3709,[221]6.3653,[222]6.3622,[223]6.3511,[224]6.3449,[225]6.3407,[226]6.3619,[227]6.3707,[228]6.3753,[229]6.3820,[230]6.3782,[231]6.3948,[232]6.3824,[233]6.3652,[234]6.3503,[235]6.3329,[236]6.3251,[237]6.3149,[238]6.3181,[239]6.3024,[240]6.2924,[241]6.2956,[242]6.2994,[243]6.2978,[244]6.2861,[245]6.2834,[246]6.2719,[247]6.2596,[248]6.2524,[249]6.2505,[250]6.2555,[251]6.2478,[252]6.2441,[253]6.2341,[254]6.2299,[255]6.2183,[256]6.2000,[257]6.1887,[258]6.1800,[259]6.1786,[260]6.1711,[261]6.1671,[262]6.1613,[263]6.1565,[264]6.1356,[265]6.1354,[266]6.1345,[267]6.1277,[268]6.1373,[269]6.1352,[270]6.1361,[271]6.1436,[272]6.1477,[273]6.1478,[274]6.1493,[275]6.1582,[276]6.1635,[277]6.1796,[278]6.1911,[279]6.1999,[280]6.2034,[281]6.2128,[282]6.2192,[283]6.2343,[284]6.2417,[285]6.2510,[286]6.2646,[287]6.2639,[288]6.2698,[289]6.2603,[290]6.2448,[291]6.2293,[292]6.2138,[293]6.2005,[294]6.2021,[295]6.2018,[296]6.2064,[297]6.2047,[298]6.2081,[299]6.2052,[300]6.1941,[301]6.1943,[302]6.1868,[303]6.1786,[304]6.1706,[305]6.1679,[306]6.1552,[307]6.1580,[308]6.1621,[309]6.1457,[310]6.1397,[311]6.1331,[312]6.1355,[313]6.1303,[314]6.1289,[315]6.1124,[316]6.1075,[317]6.0911,[318]6.0697,[319]6.0821,[320]6.0949,[321]6.0990,[322]6.0946,[323]6.0881,[324]6.0855,[325]6.0958,[326]6.0958,[327]6.0977,[328]6.1019,[329]6.1076,[330]6.1102,[331]6.1231,[332]6.1197,[333]6.1264,[334]6.1204,[335]6.1141,[336]6.1177,[337]6.1151,[338]6.1141,[339]6.1088,[340]6.1046,[341]6.1128,[342]6.1154,[343]6.1208,[344]6.1208,[345]6.1206,[346]6.1176,[347]6.1226,[348]6.1262,[349]6.1279,[350]6.1245,[351]6.1253,[352]6.1258,[353]6.1204,[354]6.1204,[355]6.1257,[356]6.1284,[357]6.1251,[358]6.1344,[359]6.1376,[360]6.1336,[361]6.1329,[362]6.1394,[363]6.1508,[364]6.1572,[365]6.1631,[366]6.1641,[367]6.1730,[368]6.1705,[369]6.1714,[370]6.1726,[371]6.1666,[372]6.1714,[373]6.1769,[374]6.1755,[375]6.1751,[376]6.1827,[377]6.1776,[378]6.1801,[379]6.1858,[380]6.1776,[381]6.1734,[382]6.1681,[383]6.1671,[384]6.1663,[385]6.1650,[386]6.1646,[387]6.1640,[388]6.1597,[389]6.1542,[390]6.1470,[391]6.1392,[392]6.1350,[393]6.1330,[394]6.1355,[395]6.1339,[396]6.1262,[397]6.1337,[398]6.1374,[399]6.1455,[400]6.1448,[401]6.1464,[402]6.1471,[403]6.1489,[404]6.1553,[405]6.1458,[406]6.1424,[407]6.1420,[408]6.1433,[409]6.1553,[410]6.1664,[411]6.1782,[412]6.1941,[413]6.2054,[414]6.2131,[415]6.2185,[416]6.2264,[417]6.2390,[418]6.2427,[419]6.2497,[420]6.2586,[421]6.2705,[422]6.2757,[423]6.2825,[424]6.2942,[425]6.3032,[426]6.3099,[427]6.3144,[428]6.3227,[429]6.3277,[430]6.3366,[431]6.3511,[432]6.3551,[433]6.3542,[434]6.3494,[435]6.3501,[436]6.3524,[437]6.3618,[438]6.3695,[439]6.3662,[440]6.3654,[441]6.3602,[442]6.3590,[443]6.3603,[444]6.3607,[445]6.3587,[446]6.3611,[447]6.3640,[448]6.3688,[449]6.3665,[450]6.3669,[451]6.3626,[452]6.3503,[453]6.3416,[454]6.3357,[455]6.3366,[456]6.3416,[457]6.3435,[458]6.3416,[459]6.3422,[460]6.3508,[461]6.3481,[462]6.3468,[463]6.3513,[464]6.3505,[465]6.3475,[466]6.3396,[467]6.3399,[468]6.3396,[469]6.3420,[470]6.3426,[471]6.3378,[472]6.3424,[473]6.3369,[474]6.3382,[475]6.3322,[476]6.3339,[477]6.3264,[478]6.3254,[479]6.3317,[480]6.3368,[481]6.3389,[482]6.3344,[483]6.3303,[484]6.3324,[485]6.3302,[486]6.3248,[487]6.3247,[488]6.3227,[489]6.3178,[490]6.3153,[491]6.3124,[492]6.3067,[493]6.3036,[494]6.3018,[495]6.3016,[496]6.2979,[497]6.2922,[498]6.2904,[499]6.2856,[500]6.2760,[501]6.2692,[502]6.2694,[503]6.2688,[504]6.2595,[505]6.2619,[506]6.2628,[507]6.2570,[508]6.2528,[509]6.2518,[510]6.2556,[511]6.2601,[512]6.2640,[513]6.2661,[514]6.2724,[515]6.2669,[516]6.2659,[517]6.2668,[518]6.2671,[519]6.2699,[520]6.2725,[521]6.2741,[522]6.2770,[523]6.2779,[524]6.2836,[525]6.2872,[526]6.2883,[527]6.2905,[528]6.2855,[529]6.2861,[530]6.2812,[531]6.2798,[532]6.2849,[533]6.2871,[534]6.2856,[535]6.2881,[536]6.2828,[537]6.2804,[538]6.2852,[539]6.2861,[540]6.2901,[541]6.2908,[542]6.2918,[543]6.2931,[544]6.2945,[545]6.2922,[546]6.2930,[547]6.2886,[548]6.2834,[549]6.2834,[550]6.2806,[551]6.2769,[552]6.2751,[553]6.2710,[554]6.2684,[555]6.2658,[556]6.2653,[557]6.2674,[558]6.2634,[559]6.2632,[560]6.2627,[561]6.2629,[562]6.2608,[563]6.2611,[564]6.2656,[565]6.2675,[566]6.2673,[567]6.2650,[568]6.2656,[569]6.2638,[570]6.2665,[571]6.2670,[572]6.2681,[573]6.2680,[574]6.2646,[575]6.2641,[576]6.2643,[577]6.2630,[578]6.2611,[579]6.2619,[580]6.2551,[581]6.2513,[582]6.2502,[583]6.2510,[584]6.2514,[585]6.2438,[586]6.2370,[587]6.2373,[588]6.2424,[589]6.2480,[590]6.2511,[591]6.2532,[592]6.2516,[593]6.2482,[594]6.2492,[595]6.2468,[596]6.2503,[597]6.2479,[598]6.2445,[599]6.2467,[600]6.2463,[601]6.2447,[602]6.2462,[603]6.2492,[604]6.2502,[605]6.2535,[606]6.2554,[607]6.2536,[608]6.2503,[609]6.2509,[610]6.2545,[611]6.2524,[612]6.2549,[613]6.2513,[614]6.2462,[615]6.2387,[616]6.2415,[617]6.2351,[618]6.2296,[619]6.2241,[620]6.2099,[621]6.2025,[622]6.2006,[623]6.2022,[624]6.2024,[625]6.2026,[626]6.2011,[627]6.2031,[628]6.2033,[629]6.2029,[630]6.2064,[631]6.2124,[632]6.2178,[633]6.2162,[634]6.2194,[635]6.2199,[636]6.2171,[637]6.2138,[638]6.2166,[639]6.2135,[640]6.2145,[641]6.2147,[642]6.2215,[643]6.2236,[644]6.2246,[645]6.2226,[646]6.2267,[647]6.2230,[648]6.2239,[649]6.2240,[650]6.2277,[651]6.2333,[652]6.2342,[653]6.2386,[654]6.2322,[655]6.2316,

llama_print_timings:        load time =  8857.77 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 5035069.54 ms / 335360 tokens (   15.01 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 5069457.11 ms

real	84m29.811s
user	128m20.366s
sys	4m12.799s

The new 7B perplexity on this branch with BLAS enabled is: 6.2316
We can expect similar value without BLAS thanks to the #951

The perplexity on master for the same setup is: 6.2897

Therefore we observe a delta of -0.0581 thanks to the 2x F16 scale factors in Q4_0

Somehow was hoping for a value closer to the Q4_1 6.0863 reported in #896
The current RMSE is:

q4_0 : rmse 0.00194636, maxerr 0.18359375, 95pct<0.0038, median<0.0016

Which is much higher than the reported one in #896 for this approach:

# this value is after RMSE optimization
rmse 0.00159265, maxerr 0.17480469, 95pct<0.0030, median<0.0012

Either the claim in #896 that RMSE optimization brings only 0.02 ppl is not entirely correct or my expectation that 2x F16 Q4_0 would be similar to Q4_1 on master was not correct.

Took the 2x F16 model from #896 and running perplexity with the current branch. The result is:

655 iters: 6.2039

$  make clean && make -j perplexity && time ./perplexity -m ./models/7B/ggml-model-q4_0-ik.bin -f ./build/wiki.test.raw -t 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
common.o
ggml.o
ggml_extra.o
llama.o
main
quantize
quantize-stats
perplexity
embedding
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -c examples/common.cpp -o common.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity  -framework Accelerate
main: seed = 1681751064
llama.cpp: loading model from ./models/7B/ggml-model-q4_0-ik.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
9.84 seconds per pass - ETA 1.79 hours
[1]4.4431,[2]4.8785,[3]5.7704,[4]6.3945,[5]6.5144,[6]6.4809,[7]6.6729,[8]6.7852,[9]7.1349,[10]7.3725,[11]7.5874,[12]7.6104,[13]7.5305,[14]7.6091,[15]7.8651,[16]7.4754,[17]7.3523,[18]7.3179,[19]6.9540,[20]6.9445,[21]6.8514,[22]6.6759,[23]6.6396,[24]6.5458,[25]6.5466,[26]6.3792,[27]6.1966,[28]6.0971,[29]6.0075,[30]5.8487,[31]5.8170,[32]5.8392,[33]5.7809,[34]5.8179,[35]5.8429,[36]5.8853,[37]5.8890,[38]5.9038,[39]5.9398,[40]5.9989,[41]6.0119,[42]6.0517,[43]6.0092,[44]6.0643,[45]6.0674,[46]6.0436,[47]6.0647,[48]6.0361,[49]6.0390,[50]5.9974,[51]5.9926,[52]5.9816,[53]6.0245,[54]6.0068,[55]5.9818,[56]6.0093,[57]6.0292,[58]6.0502,[59]6.0678,[60]6.1125,[61]6.1039,[62]6.1622,[63]6.1986,[64]6.2147,[65]6.2615,[66]6.2691,[67]6.2873,[68]6.3050,[69]6.3318,[70]6.3648,[71]6.3871,[72]6.4174,[73]6.4797,[74]6.4852,[75]6.4994,[76]6.5124,[77]6.5248,[78]6.5100,[79]6.5375,[80]6.5285,[81]6.5380,[82]6.5416,[83]6.4870,[84]6.4696,[85]6.4588,[86]6.4359,[87]6.3708,[88]6.3421,[89]6.3222,[90]6.3079,[91]6.3315,[92]6.3251,[93]6.3275,[94]6.3242,[95]6.3516,[96]6.3496,[97]6.3442,[98]6.3370,[99]6.3219,[100]6.3228,[101]6.3487,[102]6.3433,[103]6.3635,[104]6.3710,[105]6.3701,[106]6.3853,[107]6.3827,[108]6.3958,[109]6.3890,[110]6.3849,[111]6.4074,[112]6.4279,[113]6.4307,[114]6.4274,[115]6.4346,[116]6.4257,[117]6.4314,[118]6.4607,[119]6.4811,[120]6.5165,[121]6.5342,[122]6.5598,[123]6.5977,[124]6.6162,[125]6.6058,[126]6.6455,[127]6.6822,[128]6.7127,[129]6.6963,[130]6.7065,[131]6.7028,[132]6.6942,[133]6.6814,[134]6.6921,[135]6.6877,[136]6.6750,[137]6.6664,[138]6.6512,[139]6.6407,[140]6.6354,[141]6.6050,[142]6.6013,[143]6.5713,[144]6.5505,[145]6.5415,[146]6.5282,[147]6.5346,[148]6.5347,[149]6.5284,[150]6.5233,[151]6.5249,[152]6.5128,[153]6.4960,[154]6.4871,[155]6.4938,[156]6.4887,[157]6.5066,[158]6.5098,[159]6.5152,[160]6.5171,[161]6.5299,[162]6.4997,[163]6.4865,[164]6.4615,[165]6.4297,[166]6.4018,[167]6.3636,[168]6.3318,[169]6.3185,[170]6.3072,[171]6.2792,[172]6.2620,[173]6.2446,[174]6.2141,[175]6.1922,[176]6.1822,[177]6.1617,[178]6.1379,[179]6.1210,[180]6.1120,[181]6.0904,[182]6.0726,[183]6.0586,[184]6.0583,[185]6.0509,[186]6.0516,[187]6.0579,[188]6.0534,[189]6.0706,[190]6.0717,[191]6.0935,[192]6.1097,[193]6.1271,[194]6.1384,[195]6.1595,[196]6.1754,[197]6.1966,[198]6.2116,[199]6.2157,[200]6.2203,[201]6.2156,[202]6.2364,[203]6.2445,[204]6.2432,[205]6.2538,[206]6.2604,[207]6.2567,[208]6.2652,[209]6.2696,[210]6.2751,[211]6.2850,[212]6.2924,[213]6.3028,[214]6.3051,[215]6.3089,[216]6.3242,[217]6.3422,[218]6.3554,[219]6.3561,[220]6.3516,[221]6.3467,[222]6.3440,[223]6.3334,[224]6.3260,[225]6.3220,[226]6.3430,[227]6.3517,[228]6.3564,[229]6.3624,[230]6.3588,[231]6.3756,[232]6.3627,[233]6.3457,[234]6.3305,[235]6.3137,[236]6.3064,[237]6.2961,[238]6.2990,[239]6.2834,[240]6.2733,[241]6.2758,[242]6.2795,[243]6.2775,[244]6.2659,[245]6.2630,[246]6.2511,[247]6.2385,[248]6.2311,[249]6.2291,[250]6.2330,[251]6.2261,[252]6.2226,[253]6.2127,[254]6.2086,[255]6.1976,[256]6.1796,[257]6.1675,[258]6.1591,[259]6.1573,[260]6.1500,[261]6.1459,[262]6.1406,[263]6.1354,[264]6.1159,[265]6.1149,[266]6.1135,[267]6.1070,[268]6.1165,[269]6.1146,[270]6.1156,[271]6.1235,[272]6.1265,[273]6.1263,[274]6.1282,[275]6.1360,[276]6.1418,[277]6.1573,[278]6.1674,[279]6.1761,[280]6.1790,[281]6.1883,[282]6.1945,[283]6.2092,[284]6.2168,[285]6.2255,[286]6.2393,[287]6.2393,[288]6.2449,[289]6.2360,[290]6.2202,[291]6.2047,[292]6.1894,[293]6.1755,[294]6.1774,[295]6.1771,[296]6.1811,[297]6.1795,[298]6.1823,[299]6.1793,[300]6.1679,[301]6.1682,[302]6.1605,[303]6.1524,[304]6.1442,[305]6.1418,[306]6.1290,[307]6.1314,[308]6.1349,[309]6.1191,[310]6.1130,[311]6.1068,[312]6.1097,[313]6.1039,[314]6.1023,[315]6.0859,[316]6.0810,[317]6.0647,[318]6.0435,[319]6.0557,[320]6.0683,[321]6.0726,[322]6.0683,[323]6.0615,[324]6.0589,[325]6.0693,[326]6.0690,[327]6.0707,[328]6.0744,[329]6.0804,[330]6.0830,[331]6.0955,[332]6.0926,[333]6.0997,[334]6.0940,[335]6.0869,[336]6.0902,[337]6.0875,[338]6.0873,[339]6.0817,[340]6.0775,[341]6.0853,[342]6.0876,[343]6.0925,[344]6.0923,[345]6.0922,[346]6.0892,[347]6.0935,[348]6.0968,[349]6.0988,[350]6.0953,[351]6.0958,[352]6.0960,[353]6.0900,[354]6.0905,[355]6.0959,[356]6.0986,[357]6.0953,[358]6.1045,[359]6.1075,[360]6.1038,[361]6.1035,[362]6.1103,[363]6.1219,[364]6.1282,[365]6.1340,[366]6.1351,[367]6.1442,[368]6.1416,[369]6.1420,[370]6.1434,[371]6.1376,[372]6.1425,[373]6.1477,[374]6.1462,[375]6.1461,[376]6.1533,[377]6.1485,[378]6.1510,[379]6.1568,[380]6.1487,[381]6.1449,[382]6.1395,[383]6.1386,[384]6.1381,[385]6.1374,[386]6.1372,[387]6.1367,[388]6.1326,[389]6.1275,[390]6.1205,[391]6.1128,[392]6.1086,[393]6.1069,[394]6.1094,[395]6.1080,[396]6.1003,[397]6.1081,[398]6.1122,[399]6.1205,[400]6.1203,[401]6.1220,[402]6.1228,[403]6.1250,[404]6.1316,[405]6.1215,[406]6.1180,[407]6.1173,[408]6.1186,[409]6.1305,[410]6.1414,[411]6.1528,[412]6.1688,[413]6.1806,[414]6.1881,[415]6.1932,[416]6.2010,[417]6.2134,[418]6.2172,[419]6.2244,[420]6.2332,[421]6.2448,[422]6.2497,[423]6.2566,[424]6.2682,[425]6.2771,[426]6.2838,[427]6.2882,[428]6.2966,[429]6.3017,[430]6.3102,[431]6.3244,[432]6.3286,[433]6.3275,[434]6.3230,[435]6.3238,[436]6.3260,[437]6.3356,[438]6.3431,[439]6.3398,[440]6.3395,[441]6.3344,[442]6.3330,[443]6.3343,[444]6.3345,[445]6.3327,[446]6.3354,[447]6.3384,[448]6.3430,[449]6.3405,[450]6.3417,[451]6.3375,[452]6.3247,[453]6.3159,[454]6.3102,[455]6.3114,[456]6.3160,[457]6.3180,[458]6.3159,[459]6.3162,[460]6.3247,[461]6.3218,[462]6.3201,[463]6.3249,[464]6.3239,[465]6.3207,[466]6.3128,[467]6.3126,[468]6.3123,[469]6.3143,[470]6.3146,[471]6.3097,[472]6.3146,[473]6.3090,[474]6.3098,[475]6.3035,[476]6.3055,[477]6.2983,[478]6.2971,[479]6.3031,[480]6.3076,[481]6.3094,[482]6.3048,[483]6.3006,[484]6.3029,[485]6.3014,[486]6.2959,[487]6.2960,[488]6.2937,[489]6.2889,[490]6.2866,[491]6.2837,[492]6.2777,[493]6.2747,[494]6.2731,[495]6.2734,[496]6.2699,[497]6.2643,[498]6.2624,[499]6.2577,[500]6.2480,[501]6.2413,[502]6.2415,[503]6.2409,[504]6.2320,[505]6.2345,[506]6.2355,[507]6.2299,[508]6.2260,[509]6.2252,[510]6.2290,[511]6.2338,[512]6.2371,[513]6.2391,[514]6.2454,[515]6.2398,[516]6.2389,[517]6.2398,[518]6.2398,[519]6.2429,[520]6.2454,[521]6.2471,[522]6.2501,[523]6.2510,[524]6.2565,[525]6.2602,[526]6.2614,[527]6.2632,[528]6.2582,[529]6.2585,[530]6.2537,[531]6.2526,[532]6.2576,[533]6.2599,[534]6.2583,[535]6.2608,[536]6.2552,[537]6.2529,[538]6.2575,[539]6.2585,[540]6.2624,[541]6.2626,[542]6.2637,[543]6.2651,[544]6.2663,[545]6.2639,[546]6.2647,[547]6.2603,[548]6.2554,[549]6.2552,[550]6.2521,[551]6.2486,[552]6.2464,[553]6.2425,[554]6.2401,[555]6.2371,[556]6.2366,[557]6.2388,[558]6.2351,[559]6.2345,[560]6.2343,[561]6.2342,[562]6.2321,[563]6.2321,[564]6.2366,[565]6.2387,[566]6.2384,[567]6.2364,[568]6.2369,[569]6.2352,[570]6.2378,[571]6.2384,[572]6.2394,[573]6.2396,[574]6.2364,[575]6.2358,[576]6.2357,[577]6.2344,[578]6.2324,[579]6.2330,[580]6.2261,[581]6.2222,[582]6.2210,[583]6.2219,[584]6.2221,[585]6.2147,[586]6.2080,[587]6.2084,[588]6.2132,[589]6.2186,[590]6.2213,[591]6.2235,[592]6.2221,[593]6.2186,[594]6.2194,[595]6.2171,[596]6.2206,[597]6.2184,[598]6.2154,[599]6.2175,[600]6.2170,[601]6.2154,[602]6.2171,[603]6.2204,[604]6.2214,[605]6.2248,[606]6.2267,[607]6.2250,[608]6.2217,[609]6.2224,[610]6.2259,[611]6.2240,[612]6.2267,[613]6.2229,[614]6.2177,[615]6.2104,[616]6.2132,[617]6.2070,[618]6.2018,[619]6.1962,[620]6.1820,[621]6.1749,[622]6.1732,[623]6.1747,[624]6.1753,[625]6.1755,[626]6.1742,[627]6.1762,[628]6.1763,[629]6.1757,[630]6.1789,[631]6.1847,[632]6.1902,[633]6.1885,[634]6.1918,[635]6.1925,[636]6.1895,[637]6.1862,[638]6.1890,[639]6.1860,[640]6.1869,[641]6.1871,[642]6.1937,[643]6.1958,[644]6.1969,[645]6.1950,[646]6.1990,[647]6.1952,[648]6.1962,[649]6.1964,[650]6.2005,[651]6.2060,[652]6.2069,[653]6.2108,[654]6.2045,[655]6.2039,

llama_print_timings:        load time = 10403.47 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 6083050.18 ms / 335360 tokens (   18.14 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 6118917.59 ms

real	101m59.169s
user	145m24.421s
sys	4m22.143s

So indeed, RMSE optimization leads to just about -0.03 perplexity gain.

I guess, I don't feel so confident about dropping Q4_1 given these results.

Conclusions

The new 2x F16 Q4_0 format is viable: improves 7B perplexity by -0.0581 and has almost same inference speed as the original format (50ms per token vs 55ms per token on M1)

Next steps

Reimplement this new format as Q4_2

#define QK4_2 16
typedef struct {
    ggml_fp16_t d;          // delta
    uint8_t qs[QK4_2 / 2];  // nibbles / quants
} block_q4_2;

Merge into master and have the other arches merged as well
Continue to support Q4_1 adding 8-bit intermediate results as in Add Q8_0 quantization for intermediate results #951 and potentially implementing Q4_3 - similar to the approach in this PR
Measure perplexity with F16 output using Q4_2 and Q4_3 (see Measure perplexity delta between Q4_0 and F16 "output" tensor #1003) and decide if more quantization improvements should be pursued

sw · 2023-04-17T17:06:42Z

I would have hoped the new format would be defined like this:

#define QK4_2 16
typedef struct {
    ggml_fp16_t d;          // delta
    uint8_t qs[QK4_2 / 2];  // nibbles / quants
} block_q4_2;

That is, don't force what are essentially two blocks into one struct, and also define a new version number.

Pros:

simplifies the scalar/reference implementation
doesn't break anything for Q4_0 users
SIMD instruction can still operate on two blocks at a time, if it makes sense

Cons:

needs two loads of the quants for SIMD
larger code base?

Of course, in the long run, we might decide to stop supporting Q4_0 or Q4_1.

What do you think? Would that really be slower?

ggerganov · 2023-04-17T17:14:52Z

@sw
Yes, let's do that. I will reimplement this using Q4_2 as suggested.

Btw, I'm again reconsidering keeping the SIMD quantize / dequantize. Too much code for little gains.
In #951 we considered keeping support due to LoRA, but I think we have to measure how much it takes to do LoRA with and without SIMD quantization. It might turn out that the difference is very small compared to overall computation. Also, integration of RMSE-optimized quantization similar to #896 would be much simpler. cc @slaren

Dropping SIMD quantization support would make changes as in this PR much simpler.

slaren · 2023-04-17T17:31:20Z

Here is a quick test of the performance impact of quantization on LoRA:

perf_total_per_op_us[             ADD] =  12.776 ms
perf_total_per_op_us[         MUL_MAT] =  47.818 ms
perf_total_per_op_us[           SCALE] =   9.319 ms
perf_total_per_op_us[             CPY] =  45.780 ms

The quantization happens on the CPY operation, and currently this represents about 40% of the time to apply a layer.

Changing ggml_cpy to use quantize_row_q_reference instead:

perf_total_per_op_us[             ADD] =  13.698 ms
perf_total_per_op_us[         MUL_MAT] =  46.508 ms
perf_total_per_op_us[           SCALE] =   9.733 ms
perf_total_per_op_us[             CPY] = 384.500 ms

Without SIMD, the CPY takes 80% of the time.

Caveat: ggml_cpy is still not multithreaded, so we could expect significant gains by doing that. However, I think that the difference is still going to be significant.

slaren · 2023-04-17T17:41:50Z

That said, I absolutely agree that ggml.c is too big and any simplification would be good. If you not opposed to splitting ggml into multiple files, we could look into moving all the SIMD implementations to a different file for each platform. So we would have ggml.c with the core code and the C implementations, ggml-avx.c with the AVX implementations, ggml-neon.c for NEON, and so on.

ggerganov · 2023-04-17T18:29:49Z

If we change the roundf to:

            const uint8_t vi0 = (uint8_t)(v0 + 8.5f);
            const uint8_t vi1 = (uint8_t)(v1 + 8.5f);

The timing should be much closer I think.
But in any case, 0.5s for applying a LoRA adapter does not sound bad at all.

We can reduce ggml.c significantly with a few macros. For example:

#define GGML_TENSOR_SIZES_SRC0(x) // defines ne00, ne01, ..., nb00, nb01, etc..
#define GGML_TENSOR_SIZES_SRC1(x) // defines ne10, ne11, ..., nb10, nb11, etc..
#define GGML_TENSOR_SIZES_DST(x)  // defines ne0, ne1, ..., nb0, nb1, etc..

#define GGML_GET_PTR_ROW(x, i) // get ptr to ith row using strides

etc..

We can do this refactoring soon, let's say after the quantization work is done.
It should make ggml.c much more compact.

But, the SIMD remains a problem because even if you had multiple small files or ggml.c wasn't so big, you still need to implement all the missing routines for each new experiment. And the quantize / dequantize routines are no longer needed at inference run-time. Only for model generation and LoRA.

Edit:

you still need to implement all the missing routines for each new experiment

Hm, actually that's not true. I guess because I was editing over the original Q4_0 I felt like reimplementing everything.
With the new Q4_2 I can implement just the dot product and have the reference implementations for the quantization for now.

slaren · 2023-04-17T18:46:44Z

But in any case, 0.5s for applying a LoRA adapter does not sound bad at all.

This is just for a single tensor, applying it to the entire model can take a lot of time, especially with the larger models. Applying the entire LoRA (this is baize-lora-7B) takes 30 seconds on my machine. This is a bit of a pathological case since it also modifies the feed-forward tensors (usually LoRAs only modify the attention tensors), but still, this is very slow for something that has to be done every time.

But, the SIMD remains a problem because even if you had multiple small files or ggml.c wasn't so big, you still need to implement all the missing routines for each new experiment.

We can consider the SIMD implementations of functions like quantize and dequantize very low priority and just plainly ignore them in experiments, and only if the experiment is successful, we can allow other people to implement them later on in other PRs. I think this is what we are already doing in practice.

Edit: removing the roundf from quantize is indeed much faster:

quantize fn	overall time
AVX2	32148.85 ms
ref without roundf	41659.51 ms
ref	76086.71 ms

howard0su · 2023-04-18T14:08:39Z

@sw Yes, let's do that. I will reimplement this using Q4_2 as suggested.

Btw, I'm again reconsidering keeping the SIMD quantize / dequantize. Too much code for little gains. In #951 we considered keeping support due to LoRA, but I think we have to measure how much it takes to do LoRA with and without SIMD quantization. It might turn out that the difference is very small compared to overall computation. Also, integration of RMSE-optimized quantization similar to #896 would be much simpler. cc @slaren

Dropping SIMD quantization support would make changes as in this PR much simpler.

One thing to consider is openmp which supports multi-thread and SIMD today. It will keep the code simple, portable while leveraging latest hardware features (multi core and AVX).

ggerganov · 2023-04-18T18:03:51Z

Wow - we handle running out of disk space so gracefully!

😄

ggerganov · 2023-04-18T18:25:39Z

Reimplementation continues in #1046

ggerganov added generation quality Quality of model output breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels Apr 17, 2023

ggerganov force-pushed the q4_0-f16 branch 3 times, most recently from ec870de to b4c74b7 Compare April 17, 2023 14:34

ggerganov linked an issue Apr 17, 2023 that may be closed by this pull request

Investigate the performance (speed and perplexity) of Q4_0 with 2x F16 factors #995

Closed

3 tasks

ggerganov self-assigned this Apr 17, 2023

ggerganov closed this Apr 17, 2023

ggerganov reopened this Apr 17, 2023

ggml : initial ARM_NEON 2x F16 Q4_0 implementation

7da998d

ggerganov force-pushed the q4_0-f16 branch from ef5b539 to f704d1b Compare April 17, 2023 15:05

ci : do not run on drafts

0f5ee9e

ggerganov force-pushed the q4_0-f16 branch from f704d1b to 0f5ee9e Compare April 17, 2023 15:08

ggml : explicit assignment of deltas

a8592fb

ggerganov removed the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Apr 17, 2023

ggml : minor

d1b51ce

ggerganov mentioned this pull request Apr 18, 2023

ggml : Q4_2 ARM #1046

Merged

ggerganov closed this Apr 18, 2023

ggerganov deleted the q4_0-f16 branch April 24, 2023 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

ggerganov commented Apr 17, 2023 •

edited

Loading

sw commented Apr 17, 2023

ggerganov commented Apr 17, 2023 •

edited

Loading

slaren commented Apr 17, 2023 •

edited

Loading

slaren commented Apr 17, 2023

ggerganov commented Apr 17, 2023 •

edited

Loading

slaren commented Apr 17, 2023 •

edited

Loading

howard0su commented Apr 18, 2023

ggerganov commented Apr 18, 2023

ggerganov commented Apr 18, 2023

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

New Q4_0 implementation using 2x F16 instead of 1x F32 #1026

Conversation

ggerganov commented Apr 17, 2023 • edited Loading

ARM NEON only implementation

Timing

Perplexity

Without BLAS

With BLAS:

Conclusions

Next steps

sw commented Apr 17, 2023

ggerganov commented Apr 17, 2023 • edited Loading

slaren commented Apr 17, 2023 • edited Loading

slaren commented Apr 17, 2023

ggerganov commented Apr 17, 2023 • edited Loading

slaren commented Apr 17, 2023 • edited Loading

howard0su commented Apr 18, 2023

ggerganov commented Apr 18, 2023

ggerganov commented Apr 18, 2023

ggerganov commented Apr 17, 2023 •

edited

Loading

Without `BLAS`

With `BLAS`:

ggerganov commented Apr 17, 2023 •

edited

Loading

slaren commented Apr 17, 2023 •

edited

Loading

ggerganov commented Apr 17, 2023 •

edited

Loading

slaren commented Apr 17, 2023 •

edited

Loading