Add CUDA wrapper capability. #714

bpachev · 2024-09-13T22:41:59Z

This pull request adds the optional capability to generate CUDA-wrapped versions of the tabulate tensor functions. The wrappers are called at runtime to get the modified source code for each tabulate tensor, which can then be passed to the NVIDIA runtime compiler. This was originally developed by James Trotter (https://www.sciencedirect.com/science/article/pii/S0167819123000571),
and then modified to work with the current version of ffcx.

This capability supports assembly on GPUs, implemented here as an add-on package (https://github.com/bpachev/cuda-dolfinx).

Would love any feedback or suggestions on how to do this in a better way. Thanks!

ffcx/codegeneration/ufcx.h

ffcx/options.py

jhale · 2024-09-14T11:50:50Z

We should look at what CI options we have to keep this working long term.

bpachev · 2024-09-16T19:55:43Z

CUDA Toolkit can be installed without a GPU on the device, so we can test if the generated CUDA code compiles without needing a GPU in CI.

ffcx/codegeneration/ufcx.h

ffcx/codegeneration/C/integrals_template.py

ffcx/codegeneration/C/integrals.py

jhale · 2024-09-17T13:22:14Z

In terms of testing infrastructure, we have:

https://github.com/FEniCS/ffcx/blob/main/demo/test_demos.py

but this is aimed at ahead of time compilers. Can we do something similar with NVRTC?

ffcx/codegeneration/C/integrals_template.py

bpachev · 2024-09-17T23:08:10Z

We should be able to do something similar with NVRTC, we'll just need to have an extra step where we compile using the regular C++ compiler, and then call the generated wrapper functions to get the input for NVRTC.

…ts clarifying the use of NVRTC and the need for typedefs in generated CUDA source code.

bpachev · 2024-09-19T00:50:41Z

I've updated the PR with the requested renaming and explanatory comments. How should I go about adding CI for this? A separate PR after this is merged, or as part of this one?

drew-parsons · 2024-09-20T16:16:13Z

More general question here, perhaps to motivate another PR later: cuda tends to be nvidia only, and closed source, unless that's changed recently. How much work would it take to enable other GPUs with open source tools (sycl, ROCm) ?

bpachev · 2024-09-20T18:14:10Z

On the ffcx side with the current wrapper approach, honestly not too much. Downstream in the GPU assembly routines more refactoring effort will be required. This is my next development priority along with support for multi-GPU assembly.

garth-wells · 2024-09-23T19:32:03Z

I have reservations about this change. We know from our own work that the standard FFCx tabulate kernels are not performant on GPUs. I don't want a change that encourages slow/sub-optimal code.

We should looks at how FFCx can be made extensible to allow user-modified generated code.

bpachev · 2024-09-23T23:39:09Z

Garth, thanks for weighing in on this. I would love to see FFCx have easier options for customization. I agree that assembly with high-order FFCx tabulate kernels is not efficient on GPUs. However, in my testing, it's quite performant for order-1 and even order-2 elements, which cover a large percentage of use cases. The rationale behind this change is to start getting people to use FEniCS with GPU acceleration, which will hopefully spur development and result in delivering a mature and optimized GPU capability.

That said, in the short term, all I really need is a way to map the generated body of an FFCx tabulate tensor to the corresponding ufcx_form object in C++. Thoughts about how to best implement that?

bpachev · 2024-10-07T21:20:48Z

To provide some specific examples of performance, using Lagrange elements for assembly of a Poisson mass matrix results in 70% memory throughput on an NVIDIA GH200 with a cubic tetrahedral mesh (6M elements). If the form has enough terms, it can also get a good compute throughput. This is the case for applying the Streamline Upwind Petrov-Galerkin (SUPG) method to the shallow water equations, which resulted in a 60% compute throughput for Jacobian matrix assembly on an NVIDIA A100 (also Lagrange, but triangular elements).

These represent pretty high levels of GPU resource utilization. The speedups relative to parallel CPU assembly are correspondingly high.

jhale · 2024-11-05T11:58:28Z

I've taken another look at the PR and the comments from @garth-wells and @bpachev.

I think @garth-wells is right in that we should be careful to not encourage the early use of features that are potentially sub-optimal.

However this patch adds an optional low-level interface to a component of FEniCS (FFCx) that is essentially not user facing. Additionally, there is little scope to implement this differently. So, I see this as orthogonal to the path suggested by @garth-wells which is adding the capability to customise the generated code - after all, we would still want a well-defined interface in order to define how that customised code should be executed.

My remaining major with this PR is that we need a basic unit test to ensure that it continues working. We have this already for the standard C kernels, can this be extended/adapted?

https://github.com/FEniCS/ffcx/blob/main/demo/test_demos.py

@drew-parsons HIP has an equivalent RTC infrastructure:

https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html

bpachev · 2024-11-06T20:24:40Z

@jhale Appreciate you taking another look at this. While I don't think we will be able to execute generated CUDA kernels without a GPU in the CI, we should certainly be able to test that the generated code compiles under NVRTC without errors. I'll update the PR with a test.

bpachev · 2024-11-11T21:38:11Z

Ok. I've added a simple unit test to test_demo.py that tests compilation of a sample CUDA-wrapped tabulate tensor function with NVRTC.

jhale · 2024-11-12T07:26:53Z

ffcx/codegeneration/C/integrals.py

@@ -60,6 +60,7 @@ def generator(ir: IntegralIR, options):
        code["enabled_coefficients"] = "NULL"

    code["tabulate_tensor"] = body
+    code["tabulate_tensor_quoted"] = body.replace("\n", '\\n"\n    "')


Is this always necessary to output? Will increase the code size a lot for large kernels.

This only becomes part of the generated code if --cuda is set. Otherwise the format parameter is not used. However, I've updated the code to make the format parameter an empty string if --cuda is not set.

jhale · 2024-11-12T07:28:04Z

demo/test_demos.py

+        assert (
+            os.system(
+                f"cd {demo_dir} && "
+                f"{cxx} -I../ffcx/codegeneration -I{nvrtc_dir}/include -L{nvrtc_dir}/lib"


Can we build with a higher C++ standard? e.g. 20.

And also -Werror

Added -Werror but had issues in CI with C++20

jhale · 2024-11-12T07:29:49Z

demo/test_demos.py

+    if not sys.platform.startswith("linux"):
+        pytest.skip(reason="NVRTC support only tested on Linux")
+    else:
+        from nvidia import cuda_nvrtc


We need an import skip on this test if cuda_nvrtc doesn't exist

Remove platform restrictions - I see no reason to artificially restrict this to Linux.

So currently the cuda_nvrtc package won't install on MacOS, which is the rationale for restriction to Linux. This is also why I stuck the package under optional dependencies, because those are only installed in the CI for Linux.

It is possible to get NVRTC on any platform by installing the entire CUDA Toolkit, however this is way too much overhead in this context.

jhale · 2024-11-12T07:31:52Z

demo/nvrtc_test.cpp

+    nvrtcDestroyProgram(&program);
+
+    std::stringstream ss;
+    ss << "nvrtcCompileProgram() failed with "


You can use std::format if you can switch to C++20, removes the need for using C++ terrible string formatting.

So I tried switching this code to use std::format, however the C++ compiler in Github's CI environment appears to not support c++20 (tests worked locally, but broke during CI on Github).

jhale · 2024-11-12T07:32:16Z

demo/nvrtc_test.cpp

+    size_t log_size;
+    nvrtc_err = nvrtcGetProgramLogSize(program, &log_size);
+    if (nvrtc_err != NVRTC_SUCCESS) {
+      program_log = std::string(


Std::format?

jhale · 2024-11-12T07:32:31Z

demo/nvrtc_test.cpp

+      nvrtc_err = nvrtcGetProgramLog(
+        program, const_cast<char*>(program_log.c_str()));
+      if (nvrtc_err != NVRTC_SUCCESS) {
+        program_log = std::string(


Std::format?

ffcx/options.py

demo/test_demos.py

jhale · 2024-11-12T07:43:32Z

pyproject.toml

@@ -32,7 +32,7 @@ ffcx = "ffcx:__main__.main"
 [project.optional-dependencies]
 lint = ["ruff"]
 docs = ["sphinx", "sphinx_rtd_theme"]
-optional = ["numba", "pygraphviz==1.7"]
+optional = ["numba", "pygraphviz==1.7", "nvidia-cuda-nvrtc-cu12"]
 test = ["pytest >= 6.0", "sympy", "numba"]


Also add Nvidia rtc here - the full set of tests should be able to execute installing this optional set.

In the current CI, the optional dependencies are installed only on Linux, while the others are installed on all platforms. Unfortunately, sticking nvidia rtc in the test dependencies breaks the MacOS CI. Ideally we'd support all platforms, but I don't want to add CUDA Toolkit as a dependency and for the time being testing on Linux should suffice to ensure correctness of the generated CUDA wrapper code.

…ental windows support.

bpachev · 2024-11-14T02:57:41Z

Ok, @jhale I've updated the PR to test with a wider range of forms, addressed the issues with tabulate_tensor_quoted, and renamed the --cuda option to --cuda_nvrtc.

The nvrtc pip package does install on Windows (but not MacOS), although I haven't been able to confirm if it works the same (I don't have a Windows dev env, and have been testing via Github's CI). Futhermore, the C++ compiler on Windows has a pretty strict limit on the maximum character count for a string literal, which is a problem when wrapping the tensor for the elasticity demo, as it has a line over that limit.

The current PR only explicitly skips Windows in the test, but the import skip will effectively skip both MacOS and Windows. Thanks again for giving this PR a second chance and for your feedback!

Add CUDA wrapper capability.

fe9c170

jhale reviewed Sep 14, 2024

View reviewed changes

ffcx/codegeneration/ufcx.h Outdated Show resolved Hide resolved

ffcx/options.py Outdated Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/ufcx.h Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals_template.py Outdated Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals.py Outdated Show resolved Hide resolved

jhale reviewed Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals_template.py Show resolved Hide resolved

jhale requested changes Sep 17, 2024

View reviewed changes

ffcx/codegeneration/C/integrals_template.py Show resolved Hide resolved

Rename tabulate_tensor_cuda to tabulate_tensor_cuda_nvrtc. Add commen…

3369f07

…ts clarifying the use of NVRTC and the need for typedefs in generated CUDA source code.

bpachev added 4 commits November 7, 2024 14:30

Merge branch 'FEniCS:main' into new_main

8ea5070

Added unit test for cuda wrapper functionality via nvrtc.

b8e717b

Fixed lint issues.

5b8bdde

Only run NVRTC tests on Linux.

359b709

jhale requested changes Nov 12, 2024

View reviewed changes

jhale reviewed Nov 12, 2024

View reviewed changes

demo/test_demos.py Outdated Show resolved Hide resolved

jhale reviewed Nov 12, 2024

View reviewed changes

bpachev added 2 commits November 13, 2024 19:39

Updated NVRTC test with more forms, more explicit naming, and experim…

86bca5e

…ental windows support.

Only explicitly skip tests for Windows.

0a5af4f

Merge branch 'main' into new_main

d1fc805

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA wrapper capability. #714

Add CUDA wrapper capability. #714

bpachev commented Sep 13, 2024

jhale commented Sep 14, 2024

bpachev commented Sep 16, 2024

jhale commented Sep 17, 2024

bpachev commented Sep 17, 2024

bpachev commented Sep 19, 2024

drew-parsons commented Sep 20, 2024

bpachev commented Sep 20, 2024

garth-wells commented Sep 23, 2024

bpachev commented Sep 23, 2024

bpachev commented Oct 7, 2024

jhale commented Nov 5, 2024

bpachev commented Nov 6, 2024

bpachev commented Nov 11, 2024

jhale Nov 12, 2024

bpachev Nov 14, 2024

jhale Nov 12, 2024

bpachev Nov 14, 2024

jhale Nov 12, 2024

bpachev Nov 14, 2024 •

edited

Loading

jhale Nov 12, 2024

bpachev Nov 14, 2024

jhale Nov 12, 2024

jhale Nov 12, 2024

jhale Nov 12, 2024

bpachev Nov 14, 2024

bpachev commented Nov 14, 2024

Add CUDA wrapper capability. #714

Are you sure you want to change the base?

Add CUDA wrapper capability. #714

Conversation

bpachev commented Sep 13, 2024

jhale commented Sep 14, 2024

bpachev commented Sep 16, 2024

jhale commented Sep 17, 2024

bpachev commented Sep 17, 2024

bpachev commented Sep 19, 2024

drew-parsons commented Sep 20, 2024

bpachev commented Sep 20, 2024

garth-wells commented Sep 23, 2024

bpachev commented Sep 23, 2024

bpachev commented Oct 7, 2024

jhale commented Nov 5, 2024

bpachev commented Nov 6, 2024

bpachev commented Nov 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpachev Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpachev commented Nov 14, 2024

bpachev Nov 14, 2024 •

edited

Loading