Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How do I import the cutlass::gemm::kernel::GemmUniversal #1814

Open
mrakgr opened this issue Sep 13, 2024 · 9 comments
Open

[QST] How do I import the cutlass::gemm::kernel::GemmUniversal #1814

mrakgr opened this issue Sep 13, 2024 · 9 comments

Comments

@mrakgr
Copy link

mrakgr commented Sep 13, 2024

Following up on #1291, I am finally making the time to properly understand Cutlass. I can build and run the examples in the repo, but I am having a lot of difficulty understanding the type errors I am getting when I try to import the GemmUniversal classes. For example...

#include <iostream>

#include "cutlass/cutlass.h"
#include "cute/tensor.hpp"
#include "cute/atom/mma_atom.hpp"

#include "cutlass/numeric_types.h"

#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "default_gemm_configuration.hpp"

using namespace cute;

/////////////////////////////////////////////////////////////////////////////////////////////////

using Config = cutlass::gemm::device::DefaultGemmConfigurationToCutlass3Types<
	cutlass::arch::OpClassSimt, cutlass::arch::Sm50,
	double, cutlass::layout::ColumnMajor,
	double, cutlass::layout::ColumnMajor,
	double, cutlass::layout::ColumnMajor,
	double>;

using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
	Shape<int, int, int, int>,
	Config::CollectiveMainloop,
	Config::CollectiveEpilogue
>;

using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;

I've isolated this piece of code from one of the tests and am trying to bring it into another project without success. Even when I seemingly have all the necessary includes, I see something like this...

Image

Probably I shouldn't be trying to import test harness code, but the trouble is that there are no clean examples how to use the universal kernel.

https://github.com/NVIDIA/cutlass/blob/main/media/docs/quickstart.md#launching-a-gemm-kernel-using-cutlass-30-or-newer

The quickstart one is clean but doesn't typecheck for me.

There are some examples like the 48_hopper_warp_specialized_gemm which do have use examples of it, but when I get rid of the device arch checking macros, I just get a ton of type errors. Those kinds of examples aren't usable.

I am going to try recompiling the Cutlass library for 9.0 devices just to see if something changes.

But otherwise, could I get some help to start me off with Cutlass? Thanks.

@mrakgr
Copy link
Author

mrakgr commented Sep 13, 2024

Of course, just #include "cutlass/gemm/kernel/gemm_universal.h" doesn't work for me. The IDE doesn't see the class inside the file.

@mrakgr
Copy link
Author

mrakgr commented Sep 13, 2024

I just recompiled Cutlass with 90 support. Here is how it looks like in the IDE.

Image

GroupProblemShape has a type error, and these kinds of issues aren't present in the non-90 examples. More problematically, everything below #if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED) is grayed out.

If I were to remove those macro checks...

Image
Image
Image

Please have mercy on me here.

@thakkarV
Copy link
Collaborator

@ANIKET-SHIVAM CC

@d-k-b
Copy link
Collaborator

d-k-b commented Sep 19, 2024

@mrakgr, just to understand your setup, are you using the CMake configuration flow built into the IDE? Or are you just using the IDE to view/edit the files but building from elsewhere like powershell? I ask because the CUTLASS device-side code relies on many definitions that are computed and generated in CMake. To ensure things work properly, I would start at least by copying an existing example and get that working with a custom kernel you need, and then modify that example CMakeLists.txt to produce a library that contains the kernel for you to import in your other larger project (or something along those lines).

@mrakgr
Copy link
Author

mrakgr commented Sep 20, 2024

My actual setup is that I am using my own language Spiral which compiles to Python + Cuda. You can see some examples in that tutorial to get a sense of how it looks. My stance towards complex C++ build setups is that I don't like them and don't want anything to do with them if possible. And it's not possible because Cutlass is hoisting CMake upon me, leaving me no choice but to interact with it. I've only been using VS because I am on Windows and am trying to get a handle on the library, which would be a lot harder from Spiral directly.

I ask because the CUTLASS device-side code relies on many definitions that are computed and generated in CMake.

If that is true, that might make Cutlass unusable for me. I have a fully fused ML library and all the matmults need to be called on the device directly instead of being called from host. I've done my own, but it's not as good as the Cutlass one, so my goal is to replace it with that. I want to find a way to use Cutlass as a header-only library, but if that is impossible, I'll give up and wait for the next gen consumer NVidia cards to come out. They have warp group matrix multiplication instructions operating on shared memory, which I expect should make it easier to make a fully performant matmult kernel.

Even if it's somehow possible to get it to compile by fitting all the right compiler options, I have a limited complexity budget. I am willing to do one or two extra compile time options to set up the Cutlass library when including it in Spiral code, but absolutely nothing like those enormous CMake files that I see in its repo.

@mrakgr
Copy link
Author

mrakgr commented Sep 20, 2024

As an aside, I did try out cuBLASDx and got it to work, but the performance was absolutely horrible. I am just hoping that I can either find a way to bring in Cutlass, or that on next gen cards, matmults will be easier to implement.

@mrakgr
Copy link
Author

mrakgr commented Sep 20, 2024

Also, I should mention one aspect of my setup. All the tensor/matrix dimensions, layouts and their strides are known at compile time, so in theory, Cutlass should have everything it needs to select the optimal kernel at compile time without the need for build options.

@mrakgr
Copy link
Author

mrakgr commented Sep 20, 2024

In the ML library, I have poker games, which are running directly on the GPU, which are calling the ML library functions. As the games are register hungry, I've been running one block per SM, and the ML library and the matmult would need to be performant under that condition.

Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants