-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA wrapper capability. #714
base: main
Are you sure you want to change the base?
Conversation
We should look at what CI options we have to keep this working long term. |
CUDA Toolkit can be installed without a GPU on the device, so we can test if the generated CUDA code compiles without needing a GPU in CI. |
In terms of testing infrastructure, we have: https://github.com/FEniCS/ffcx/blob/main/demo/test_demos.py but this is aimed at ahead of time compilers. Can we do something similar with NVRTC? |
We should be able to do something similar with NVRTC, we'll just need to have an extra step where we compile using the regular C++ compiler, and then call the generated wrapper functions to get the input for NVRTC. |
…ts clarifying the use of NVRTC and the need for typedefs in generated CUDA source code.
I've updated the PR with the requested renaming and explanatory comments. How should I go about adding CI for this? A separate PR after this is merged, or as part of this one? |
More general question here, perhaps to motivate another PR later: cuda tends to be nvidia only, and closed source, unless that's changed recently. How much work would it take to enable other GPUs with open source tools (sycl, ROCm) ? |
On the ffcx side with the current wrapper approach, honestly not too much. Downstream in the GPU assembly routines more refactoring effort will be required. This is my next development priority along with support for multi-GPU assembly. |
I have reservations about this change. We know from our own work that the standard FFCx tabulate kernels are not performant on GPUs. I don't want a change that encourages slow/sub-optimal code. We should looks at how FFCx can be made extensible to allow user-modified generated code. |
Garth, thanks for weighing in on this. I would love to see FFCx have easier options for customization. I agree that assembly with high-order FFCx tabulate kernels is not efficient on GPUs. However, in my testing, it's quite performant for order-1 and even order-2 elements, which cover a large percentage of use cases. The rationale behind this change is to start getting people to use FEniCS with GPU acceleration, which will hopefully spur development and result in delivering a mature and optimized GPU capability. That said, in the short term, all I really need is a way to map the generated body of an FFCx tabulate tensor to the corresponding ufcx_form object in C++. Thoughts about how to best implement that? |
To provide some specific examples of performance, using Lagrange elements for assembly of a Poisson mass matrix results in 70% memory throughput on an NVIDIA GH200 with a cubic tetrahedral mesh (6M elements). If the form has enough terms, it can also get a good compute throughput. This is the case for applying the Streamline Upwind Petrov-Galerkin (SUPG) method to the shallow water equations, which resulted in a 60% compute throughput for Jacobian matrix assembly on an NVIDIA A100 (also Lagrange, but triangular elements). These represent pretty high levels of GPU resource utilization. The speedups relative to parallel CPU assembly are correspondingly high. |
I've taken another look at the PR and the comments from @garth-wells and @bpachev. I think @garth-wells is right in that we should be careful to not encourage the early use of features that are potentially sub-optimal. However this patch adds an optional low-level interface to a component of FEniCS (FFCx) that is essentially not user facing. Additionally, there is little scope to implement this differently. So, I see this as orthogonal to the path suggested by @garth-wells which is adding the capability to customise the generated code - after all, we would still want a well-defined interface in order to define how that customised code should be executed. My remaining major with this PR is that we need a basic unit test to ensure that it continues working. We have this already for the standard C kernels, can this be extended/adapted? https://github.com/FEniCS/ffcx/blob/main/demo/test_demos.py @drew-parsons HIP has an equivalent RTC infrastructure: https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html |
@jhale Appreciate you taking another look at this. While I don't think we will be able to execute generated CUDA kernels without a GPU in the CI, we should certainly be able to test that the generated code compiles under NVRTC without errors. I'll update the PR with a test. |
Ok. I've added a simple unit test to test_demo.py that tests compilation of a sample CUDA-wrapped tabulate tensor function with NVRTC. |
ffcx/codegeneration/C/integrals.py
Outdated
@@ -60,6 +60,7 @@ def generator(ir: IntegralIR, options): | |||
code["enabled_coefficients"] = "NULL" | |||
|
|||
code["tabulate_tensor"] = body | |||
code["tabulate_tensor_quoted"] = body.replace("\n", '\\n"\n "') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always necessary to output? Will increase the code size a lot for large kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only becomes part of the generated code if --cuda is set. Otherwise the format parameter is not used. However, I've updated the code to make the format parameter an empty string if --cuda is not set.
demo/test_demos.py
Outdated
assert ( | ||
os.system( | ||
f"cd {demo_dir} && " | ||
f"{cxx} -I../ffcx/codegeneration -I{nvrtc_dir}/include -L{nvrtc_dir}/lib" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we build with a higher C++ standard? e.g. 20.
And also -Werror
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added -Werror but had issues in CI with C++20
if not sys.platform.startswith("linux"): | ||
pytest.skip(reason="NVRTC support only tested on Linux") | ||
else: | ||
from nvidia import cuda_nvrtc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an import skip on this test if cuda_nvrtc doesn't exist
Remove platform restrictions - I see no reason to artificially restrict this to Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently the cuda_nvrtc package won't install on MacOS, which is the rationale for restriction to Linux. This is also why I stuck the package under optional dependencies, because those are only installed in the CI for Linux.
It is possible to get NVRTC on any platform by installing the entire CUDA Toolkit, however this is way too much overhead in this context.
nvrtcDestroyProgram(&program); | ||
|
||
std::stringstream ss; | ||
ss << "nvrtcCompileProgram() failed with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use std::format if you can switch to C++20, removes the need for using C++ terrible string formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried switching this code to use std::format, however the C++ compiler in Github's CI environment appears to not support c++20 (tests worked locally, but broke during CI on Github).
size_t log_size; | ||
nvrtc_err = nvrtcGetProgramLogSize(program, &log_size); | ||
if (nvrtc_err != NVRTC_SUCCESS) { | ||
program_log = std::string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Std::format?
nvrtc_err = nvrtcGetProgramLog( | ||
program, const_cast<char*>(program_log.c_str())); | ||
if (nvrtc_err != NVRTC_SUCCESS) { | ||
program_log = std::string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Std::format?
@@ -32,7 +32,7 @@ ffcx = "ffcx:__main__.main" | |||
[project.optional-dependencies] | |||
lint = ["ruff"] | |||
docs = ["sphinx", "sphinx_rtd_theme"] | |||
optional = ["numba", "pygraphviz==1.7"] | |||
optional = ["numba", "pygraphviz==1.7", "nvidia-cuda-nvrtc-cu12"] | |||
test = ["pytest >= 6.0", "sympy", "numba"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add Nvidia rtc here - the full set of tests should be able to execute installing this optional set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current CI, the optional dependencies are installed only on Linux, while the others are installed on all platforms. Unfortunately, sticking nvidia rtc in the test dependencies breaks the MacOS CI. Ideally we'd support all platforms, but I don't want to add CUDA Toolkit as a dependency and for the time being testing on Linux should suffice to ensure correctness of the generated CUDA wrapper code.
Ok, @jhale I've updated the PR to test with a wider range of forms, addressed the issues with tabulate_tensor_quoted, and renamed the --cuda option to --cuda_nvrtc. The nvrtc pip package does install on Windows (but not MacOS), although I haven't been able to confirm if it works the same (I don't have a Windows dev env, and have been testing via Github's CI). Futhermore, the C++ compiler on Windows has a pretty strict limit on the maximum character count for a string literal, which is a problem when wrapping the tensor for the elasticity demo, as it has a line over that limit. The current PR only explicitly skips Windows in the test, but the import skip will effectively skip both MacOS and Windows. Thanks again for giving this PR a second chance and for your feedback! |
This pull request adds the optional capability to generate CUDA-wrapped versions of the tabulate tensor functions. The wrappers are called at runtime to get the modified source code for each tabulate tensor, which can then be passed to the NVIDIA runtime compiler. This was originally developed by James Trotter (https://www.sciencedirect.com/science/article/pii/S0167819123000571),
and then modified to work with the current version of ffcx.
This capability supports assembly on GPUs, implemented here as an add-on package (https://github.com/bpachev/cuda-dolfinx).
Would love any feedback or suggestions on how to do this in a better way. Thanks!