Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The MLX Challenge #6539

Closed
wants to merge 1 commit into from
Closed

The MLX Challenge #6539

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 8, 2024

ref https://twitter.com/awnihannun/status/1777072588633882741

This branch starts from the flash-attention branch (#5021, #6508).

To perform a benchmark for the challenge, run:

# generate pure 4-bit model
./quantize --pure models/mistral-7b/ggml-model-f16.gguf models/mistral-7b/ggml-model-q4_0-pure.gguf q4_0

make -j llama-bench
./llama-bench -m ./models/mistral-7b/ggml-model-q4_0-pure.gguf -p 0 -t 4 -n 128 -r 10 -fa 1

Current numbers on M2 Ultra:

model size params backend ngl threads test t/s
llama 7B Q4_0 3.79 GiB 7.24 B Metal 99 4 tg 128 102.29 ± 0.07

build: 22df85f (2707)

Base automatically changed from gg/flash-attn-vec to gg/flash-attn April 18, 2024 11:33
@ggerganov ggerganov force-pushed the gg/flash-attn branch 4 times, most recently from 82b282c to ce281b9 Compare April 24, 2024 14:54
@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs performance Speed related topics labels May 10, 2024
@ggerganov ggerganov changed the base branch from gg/flash-attn to master May 13, 2024 07:40
@ggerganov
Copy link
Owner Author

We don't support group size of 64 atm (which is what I think MLX uses), so can't make an apples-to-apples comparison with MLX.

@ggerganov ggerganov closed this Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants