gfx908 optimizations #8082

IMbackK · 2024-06-23T17:50:10Z

This minor optimization work increases CDNA performance by around 10x.

Current master:

model	size	params	backend	ngl	test	t/s
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	ROCm	99	pp1024	56.73 ± 0.34
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	pp1024	585.58 ± 0.86

This pr:

model	size	params	backend	ngl	test	t/s
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	ROCm	99	pp1024	462.30 ± 0.95
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	pp1024	3196.68 ± 1.97

As now most of the of the remaining time is spent in attn kernels, merging #7011 further increases performance by 2x

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

JohannesGaessler · 2024-06-23T19:43:21Z

Which specific GPU are you using?

IMbackK · 2024-06-23T20:32:07Z

gfx908 aka MI100, gfx90a aka mi200 family should have completely identical performance characteristics.

ccbadd · 2024-06-24T00:04:27Z

If this gets merged I'm going to have to fire up my server with the 2X MI100s and give it another try. I never understood why they were pretty much the same speed as my W6800 Pro's.

IMbackK · 2024-06-24T08:26:40Z

If this gets merged I'm going to have to fire up my server with the 2X MI100s and give it another try. I never understood why they were pretty much the same speed as my W6800 Pro's.

Well this ofc dosent help at all with token generation as gemv is the most time consumeing kenrel thair. In generall looking at omniperf, there is quite some distance to go there for decent performance

JohannesGaessler · 2024-06-24T10:44:54Z

There is some parallel work by me on the matrix multiplication code in #8062 and #8075 that touch the same code as this PR. After these two PRs have been merged, please adapt your changes and confirm that the performance improvement is still there. Sorry for the inconvenience.

IMbackK · 2024-07-15T10:21:17Z

closed due to obsoletion for now, i will reopen with a rebased version at some point in the future

gfx908 optimizations

13fe282

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 23, 2024

daniandtheweb mentioned this pull request Jun 24, 2024

gfx1010 optimizations #8085

Merged

4 tasks

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 24, 2024

IMbackK mentioned this pull request Jun 24, 2024

Fix flash attention for ROCm #7011

Draft

IMbackK closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfx908 optimizations #8082

gfx908 optimizations #8082

IMbackK commented Jun 23, 2024

JohannesGaessler commented Jun 23, 2024

IMbackK commented Jun 23, 2024 •

edited

Loading

ccbadd commented Jun 24, 2024

IMbackK commented Jun 24, 2024

JohannesGaessler commented Jun 24, 2024

IMbackK commented Jul 15, 2024 •

edited

Loading

gfx908 optimizations #8082

gfx908 optimizations #8082

Conversation

IMbackK commented Jun 23, 2024

JohannesGaessler commented Jun 23, 2024

IMbackK commented Jun 23, 2024 • edited Loading

ccbadd commented Jun 24, 2024

IMbackK commented Jun 24, 2024

JohannesGaessler commented Jun 24, 2024

IMbackK commented Jul 15, 2024 • edited Loading

IMbackK commented Jun 23, 2024 •

edited

Loading

IMbackK commented Jul 15, 2024 •

edited

Loading