Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gfx1010 optimizations #8085

Merged

Conversation

daniandtheweb
Copy link
Contributor

@daniandtheweb daniandtheweb commented Jun 24, 2024

Reading @IMbackK 's PR #8082 I've noticed that RDNA1 cards can also benefit from a small performance gain just by adjusting the same values as that PR.

model size params backend ngl test t/s master t/s PR Speedup
llama 8B Q5_K - Small 5.21 GiB 8.03 B ROCm 99 pp512 163.86 ± 0.18 276.60 ± 0.41 1.69

This is still far from the performance pre #7716 (RDNA1 cards suffered a 50% performance drop with that) but it's still a good performance improvement.

Thanks again to @IMbackK for his PR as I wouldn't have noticed this without it.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 24, 2024
@daniandtheweb daniandtheweb changed the title Gfx1010 optimizations gfx1010 optimizations Jun 24, 2024
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 24, 2024
@daniandtheweb daniandtheweb force-pushed the gfx1010_optimizations branch 3 times, most recently from 1d1754f to 46923c6 Compare July 3, 2024 02:23
@slaren slaren requested a review from JohannesGaessler July 3, 2024 12:42
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 3, 2024

I've only managed to get this far as setting the return of get_mmq_y_device to 32 boosts the performance to 800 t/s but completely breaks the inference by outputting only gibberish.

get_mmq_y_host and get_mmq_y_device need to return consistent values. If you only reduce get_mmq_y_device then only part of the matrix multiplication will be done which of course leads to a much faster (but wrong) kernel. As of right now both functions are situated next to each other in mmq.cuh. If you want you can try tuning the values again, otherwise I would just merge this PR as-is.

@daniandtheweb
Copy link
Contributor Author

Thanks for the information. I'll test it better in a couple of hours but for now using a value of 64 instead of the default 128 of the master branch I manage 275 t/s for the prompt processing.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Jul 3, 2024

The performance boost is consistent to 275 t/s and the output works fine, however I'm having some trouble to add a check in get_mmq_y_host since checking if RDNA1 is defined fails.

static constexpr int get_mmq_y_host(const int cc) {
#if defined(RDNA1)
    return 64;
#else
    return int8_mma_available(cc) || cc >= CC_VOLTA ? 128 : 64;
#endif // defined(RDNA1)
}

static constexpr __device__ int get_mmq_y_device() {
#if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
#if defined(RDNA1)
    return 64;
#else
    return 128;
#endif // defined RDNA1
#else
#if __CUDA_ARCH__ >= CC_VOLTA
    return 128;
#else
    return 64;
#endif // __CUDA_ARCH__ >= CC_VOLTA
#endif // defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
}

@JohannesGaessler any idea about how to solve this?

@sorasoras
Copy link

Does this pr affect RDNA3? I really can use some optimizations.

@daniandtheweb
Copy link
Contributor Author

Does this pr affect RDNA3? I really can use some optimizations.

Not at all, this PR just tunes some parameters on Navi 10 that are already tuned in the 7000 series.

@JohannesGaessler
Copy link
Collaborator

I would patch it like this:

static constexpr int get_mmq_y_host(const int cc) {
    return cc >= CC_OFFSET_AMD ? (cc >= CC_RDNA2 ? 128 : 64) : (cc >= CC_VOLTA ? 128 : 64);
}

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Jul 3, 2024

@JohannesGaessler I've applied the change. I still think this isn't the best way to do it because if different values are needed for different cards this can result in something messy, maybe using a normal if statement so that the row isn't too long, however I haven't been able to make it work that way.

#if cc >= CC_OFFSET_AMD
#if cc == CC_RDNA1
    return 64;
#else
    return 128;
#endif // cc == CC_RDNA1
#else
#if int8_mma_available(cc) || cc >= CC_VOLTA \\ checking int8_mma_available(cc) causes the whole if statement to fail.
    return 128;
#else
    return 64;
#endif // int8_mma_available(cc) || cc >= CC_VOLTA
#endif // cc >= CC_OFFSET_AMD

EDIT: Apparently all the issues I've been having are caused by the check on int8_mma_available not working as intended. Just removing it in the if check makes everything work again.
Is that check needed for this or is the cc >= CC_VOLTA the only necessary check?

@JohannesGaessler
Copy link
Collaborator

I still think this isn't the best way to do it because if different values are needed for different cards this can result in something messy, maybe using a normal if statement so that the row isn't too long, however I haven't been able to make it work that way.

Yes, constexpr functions do not allow conditional statements in the C++ 11 standard.

Apparently all the issues I've been having are caused by the check on int8_mma_available not working as intended.

It is working as intended, you are just not using it as intended. The lowercase constexpr functions are intended to be used in host code to determine feature availability at runtime. The uppercase preprocessor macros are intended to be used in device code to determine feature availability at compile time. But if you have a separate check for AMD you don't need INT8_MMA_AVAILABLE anyways. On NVIDIA INT8_MMA_AVAILABLE is effectively __CUDA_ARCH__ >= CC_TURING and therefore always true if __CUDA_ARCH__ >= CC_VOLTA.

@daniandtheweb
Copy link
Contributor Author

I see. In this case would it be better to keep it like this:

return cc >= CC_OFFSET_AMD ? (cc == CC_RDNA1 ? 64 : 128) : (cc >= CC_VOLTA ? 128 : 64);

Or go like this:

#if cc >= CC_OFFSET_AMD
#if cc == CC_RDNA1
    return 64;
#else
    return 128;
#endif // cc == CC_RDNA1
#else
#if cc >= CC_VOLTA
    return 128;
#else
    return 64;
#endif // cc >= CC_VOLTA
#endif // cc >= CC_OFFSET_AMD

?

@JohannesGaessler
Copy link
Collaborator

You cannot do the second one. cc is a function parameter and not defined in the preprocessor phase where the macros are being resolved. The first one would be correct for your particular case but the version I posted earlier would be better because AMD architectures older than RDNA1 would be closer to the hardware that you are tuning performance for than the hardware that I used (RX 6800).

@daniandtheweb
Copy link
Contributor Author

Okay, I actually wanted to specify for RDNA1 because I wasn't sure of the effects it could have on the pr #8082.

@daniandtheweb daniandtheweb force-pushed the gfx1010_optimizations branch from c4005a9 to e4accb8 Compare July 3, 2024 22:06
@JohannesGaessler
Copy link
Collaborator

Sorry, actually it has to be done the way you had it with an RDNA1 check. On AMD you cannot do a simple check against a number because there is no sensible value for __CUDA_ARCH__.

@daniandtheweb daniandtheweb force-pushed the gfx1010_optimizations branch from e4accb8 to 68b57ed Compare July 3, 2024 22:08
@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Jul 3, 2024

I think this may be ready to merge, once all the checks are completed. Thanks for the tips on how to improve it.

@JohannesGaessler JohannesGaessler merged commit d23287f into ggerganov:master Jul 3, 2024
49 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 4, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 5, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 6, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 7, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants