Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction of gemm4xN and gemmMx4 for Q4_0 and Q8_0 for better performance results #8908

Merged

Conversation

Srihari-mcw
Copy link
Contributor

  • The PR introduces gemm4xN and gemmMx4 templated functions for the gemm functions of relevant dimensions for Q4_0 and Q8_0
  • The functions make use of _mm_cvtph_ps function for conversion of delta values to FP32 precision before using _mm_mul_ps for multiplication of delta values, Loop unrolling is done so as to extract and use the resultant delta multiplication outputs
  • The above changes help the Q4_0 and Q8_0 quantizations to gain performance especially with prompt processing

GCC Linux :

Meta Llama2 7B model:

Q4_0 Model :

model size params backend threads test t/s speedup Commit id Notes
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 pp 512 43.79 ± 0.08 7e72aa74 Base commit
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 pp 512 59.37 ± 0.08 35.58% cdf3a251 Commit with PR changes
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg 128 14.65 ± 0.01 7e72aa74 Base commit
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg 128 14.51 ± 0.00 -0.96% cdf3a251 Commit with PR changes

Q8_0 Model :

model size params backend threads test t/s speedup Commit id Notes
llama 7B Q8_0 6.67 GiB 6.74 B CPU 6 pp 512 56.87 + 0.06 7e72aa74 Base commit
llama 7B Q8_0 6.67 GiB 6.74 B CPU 6 pp 512 68.03 + 0.13 19.69% cdf3a251 Commit with PR changes
llama 7B Q8_0 6.67 GiB 6.74 B CPU 6 tg 128 8.12 ± 0.00 7e72aa74 Base commit
llama 7B Q8_0 6.67 GiB 6.74 B CPU 6 tg 128 8.12 ± 0.00 0.00% cdf3a251 Commit with PR changes

Mistral-7B-Instruct-v0.3 model:

Q4_0 Model :

model size params backend threads test t/s speedup Commit id Notes
llama 7B Q4_0 3.83 GiB 7.25 B CPU 6 pp 512 40.96 ± 0.05 7e72aa74 Base commit
llama 7B Q4_0 3.83 GiB 7.25 B CPU 6 pp 512 55.71 ± 0.11 36.01% cdf3a251 Commit with PR changes
llama 7B Q4_0 3.83 GiB 7.25 B CPU 6 tg 128 13.81 ± 0.01 7e72aa74 Base commit
llama 7B Q4_0 3.83 GiB 7.25 B CPU 6 tg 128 13.66 ± 0.00 -1.09% cdf3a251 Commit with PR changes

Q8_0 Model :

model size params backend threads test t/s speedup Commit id Notes
llama 7B Q8_0 7.17 GiB 7.25 B CPU 6 pp 512 53.34 + 0.04 7e72aa74 Base commit
llama 7B Q8_0 7.17 GiB 7.25 B CPU 6 pp 512 63.64 + 0.07 19.31% cdf3a251 Commit with PR changes
llama 7B Q8_0 7.17 GiB 7.25 B CPU 6 tg 128 7.59 ± 0.00 7e72aa74 Base commit
llama 7B Q8_0 7.17 GiB 7.25 B CPU 6 tg 128 7.60 ± 0.00 0.13% cdf3a251 Commit with PR changes

GCC Version = 12.3

The PR was tested in AMD Raphael 7600X which supports the following flags by default :

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1|

Original Unquantized Models :

Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b
Mistral 7B Instruct v0.3 : https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

@Srihari-mcw
Copy link
Contributor Author

The PR #8908 was also tested in an AMD Ryzen ThreadRipper PRO 5995WX machine. Test Results are attached below along with flags supported and other details

Performance Results in AMD Ryzen Threadripper PRO 5995WX

GCC Linux :

Mistral-7B-Instruct-v0.3 model:

Q4_0 Model :

model size params backend threads test t/s speedup Commit id Notes
llama 7B Q4_0 3.83 GiB 7.25 B CPU 64 pp 512 189.30 ± 0.31 7e72aa74 Base commit
llama 7B Q4_0 3.83 GiB 7.25 B CPU 64 pp 512 210.26 ± 0.32 11.07% cdf3a251 Commit with PR changes
llama 7B Q4_0 3.83 GiB 7.25 B CPU 64 tg 128 33.74 ± 0.04 7e72aa74 Base commit
llama 7B Q4_0 3.83 GiB 7.25 B CPU 64 tg 128 33.77 ± 0.05 0.09% cdf3a251 Commit with PR changes

Q8_0 Model :

model size params backend threads test t/s speedup Commit id Notes
llama 7B Q8_0 7.17 GiB 7.25 B CPU 64 pp 512 214.93 + 0.25 7e72aa74 Base commit
llama 7B Q8_0 7.17 GiB 7.25 B CPU 64 pp 512 241.85 + 0.47 12.53% cdf3a251 Commit with PR changes
llama 7B Q8_0 7.17 GiB 7.25 B CPU 64 tg 128 19.83 ± 0.01 7e72aa74 Base commit
llama 7B Q8_0 7.17 GiB 7.25 B CPU 64 tg 128 19.74 ± 0.00 0.13% cdf3a251 Commit with PR changes

GCC Version = 12.3

The machine supports the following flags by default :

| AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

Original Unquantized Models :

Llama2 7B : https://huggingface.co/meta-llama/Llama-2-7b
Mistral 7B Instruct v0.3 : https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Aug 8, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observe 10%-15% PP speed improvement on Ryzen 9 5950X using Gemma 2 2B models. Perplexity is the same

@ggerganov ggerganov merged commit ea5d747 into ggerganov:master Aug 31, 2024
53 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants