The MLX Challenge #6539

ggerganov · 2024-04-08T09:55:49Z

ref https://twitter.com/awnihannun/status/1777072588633882741

This branch starts from the flash-attention branch (#5021, #6508).

To perform a benchmark for the challenge, run:

# generate pure 4-bit model
./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0

make -j llama-bench
./llama-bench -m ./models/mistral-7b/ggml-model-q4_0-pure.gguf -p 0 -t 4 -n 128 -r 10 -fa 1

Current numbers on M2 Ultra:

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.79 GiB	7.24 B	Metal	99	4	tg 128	102.29 ± 0.07

build: 22df85f (2707)

ggerganov · 2024-11-17T09:29:55Z

We don't support group size of 64 atm (which is what I think MLX uses), so can't make an apples-to-apples comparison with MLX.

Base automatically changed from gg/flash-attn-vec to gg/flash-attn April 18, 2024 11:33

ggerganov force-pushed the gg/flash-attn branch 4 times, most recently from 82b282c to ce281b9 Compare April 24, 2024 14:54

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs performance Speed related topics labels May 10, 2024

llama : more metal-friendly KV cache PAD

33a004e

ggerganov force-pushed the mlx-challenge branch from 22df85f to 33a004e Compare May 13, 2024 07:40

ggerganov changed the base branch from gg/flash-attn to master May 13, 2024 07:40

atelepov approved these changes Jul 24, 2024

View reviewed changes

ggerganov closed this Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The MLX Challenge #6539

The MLX Challenge #6539

ggerganov commented Apr 8, 2024 •

edited

Loading

ggerganov commented Nov 17, 2024

The MLX Challenge #6539

The MLX Challenge #6539

Conversation

ggerganov commented Apr 8, 2024 • edited Loading

ggerganov commented Nov 17, 2024

ggerganov commented Apr 8, 2024 •

edited

Loading