[Backend] Add support for Nvidia GPUs #210
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds initial support for codegen on Nvidia GPUs.
It includes an end-to-end generic field addition kernel for 𝔽p and 𝔽r.
This closes #92.
Overview
As evidenced by https://zprize.io $8M in prizes and many IACR preprints on implementing cryptography on GPUs, there is a growing demand for GPU-accelerated cryptography.
Currently there are 2 main areas of cryptography that benefits from GPU:
Design considerations
See: https://forum.nim-lang.org/t/9794
As a reminder we aim for security, performance and compactness. Also Constantine strives to only depend on as few dependencies as possible, basically the Nim compiler, the C compiler and not even the Nim standard library. Other runtime libraries will not have our focus on security nor our concerns about hidden control flow (exceptions, defects even on array indexing).
There are multiple strategies to generate code for GPU in a Nim project
Due to the lack of generics, we would need extensive use of C macros.
Embedding can use string interpolation like Arraymancer https://github.com/mratsim/Arraymancer/blob/v0.7.19/src/arraymancer/tensor/private/incl_higher_order_cuda.nim
Embedding will run into compiler flag issues. We would need to gate all flags, for example
-fpermissive
must become-Xcompiler -fpermissive
, also we need tohack the build system with
nim c --cincludes:/opt/cuda/include --cc:clang --clang.exe:/opt/cuda/bin/nvcc --clang.linkerexe:/opt/cuda/bin/nvcc --clang.cpp.exe:/opt/cuda/bin/nvcc --clang.cpp.linkerexe:/opt/cuda/bin/nvcc
.Alternatively we compile Nim and C to object files with separate compilers and link them together afterward. That makes the library hard to use the Nim way (nimble install) instead of as a DLL.
With NVRTC we could compile it a runtime, specialized per curve.
Generation can use string interpolation akin to Arraymancer (but at runtime)
or a code generator like exprgrad does for OpenCL: https://github.com/can-lehmann/exprgrad/blob/v0.1.0/exprgrad/clgen.nim
In case of 1 and 2, that would force downstream project to use C++ compilation.
In case of 2, embedding Cuda C++ and Nim will not compile unless jumping through hoops because there is no way to strop the
-std=gnu++14
that Nim adds but NVCC doesn't support.In case of 3, there is no reason to generate C++ at runtime when we could generate C which is simpler and fast to compile.
This can be done with cudanim approach, which cleans Nim AST so that the generated C code is Cuda compatible. It is though tricky to remove stackframes, array indexing exceptions and GC random stack-scanning. It means keeping up with Nim devel changes in GC, destructors, exceptions and control-flow related changes.
It can also be done using nlvm as axel is doing (with a fork). This however introduces an extra-dependency with few eyes / tests / fuzzing on it. And it also requires to compile device code first and then staticRead it (https://github.com/guibar64/axel/blob/5b7cefb/src/axel/build_device_code.nim#L35). Using CLI opens up a can of worms of vulnerability, from an error that deletes data to "script-in-the-middle" that hijacks the expected binary.
This requires learning calling conventions, registers, data and parameter declarations, writing a code generator. Compared to CPU ISA, documentation will be limited and testing and ensuring correctness would be a large undertaking.
This frees us from building/compilation woes. LLVM instruction set is simple, has verification tools. It can be generated from Nim AST.
Furthermore it makes it easier to target other backends like AMD and Intel GPUs or WASM, or even produce assembly that MSVC can ingest (as we can't use our inline assembly codegen with MSVC). Most of the infrastructure can be reused, there are documentation and examples and a community. LLVM is extensively tested, well maintained and largely used, for example to build everything Apple. Given how high profile it is, that LLVM itself has no dependencies and the distribution model, supply chain attacks are unlikely.
Implementation
So we choose to generate LLVM IR, extended with Nvidia PTX inline assembly for add-with-carry, substract-with-borrow and multiply-add instructions which have no intrinsics.
In the future, we can add support for AMD GPUs via the AMDGPU backend, Intel GPUs via OpenCL/SPIR-V/Vulkan (require LLVM 15), Apple GPUs (when they decide to make public their pipeline), Qualcomm Hexagon.
We compile LLVM IR to PTX via the LLVM NVPTX backend. Nvidia provides its own NVVM backend with extra optimizations but it uses the LLVM 7.0.1 IR format and it seems like inline assembly encoding changed since then.