Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backend] Add support for Nvidia GPUs #210

Merged
merged 6 commits into from
Jan 12, 2023
Merged

[Backend] Add support for Nvidia GPUs #210

merged 6 commits into from
Jan 12, 2023

Conversation

mratsim
Copy link
Owner

@mratsim mratsim commented Jan 11, 2023

This PR adds initial support for codegen on Nvidia GPUs.

It includes an end-to-end generic field addition kernel for 𝔽p and 𝔽r.

This closes #92.

Overview

As evidenced by https://zprize.io $8M in prizes and many IACR preprints on implementing cryptography on GPUs, there is a growing demand for GPU-accelerated cryptography.

Currently there are 2 main areas of cryptography that benefits from GPU:

  1. Fully Homomorphic Encryption, which uses lattice as a basis and enable privacy-preserving machine-learning. Besides the obvious need for accelerating machine learning, lattice-based cryptography uses matrices extensively which are an excellent fit for GPUs.
  2. Zero-Knowledge primitives for blockchains:

Design considerations

See: https://forum.nim-lang.org/t/9794

As a reminder we aim for security, performance and compactness. Also Constantine strives to only depend on as few dependencies as possible, basically the Nim compiler, the C compiler and not even the Nim standard library. Other runtime libraries will not have our focus on security nor our concerns about hidden control flow (exceptions, defects even on array indexing).

There are multiple strategies to generate code for GPU in a Nim project

  1. Write Cuda C as a separate file
    Due to the lack of generics, we would need extensive use of C macros.
  2. Embed Cuda C at compile-time
    Embedding can use string interpolation like Arraymancer https://github.com/mratsim/Arraymancer/blob/v0.7.19/src/arraymancer/tensor/private/incl_higher_order_cuda.nim
    Embedding will run into compiler flag issues. We would need to gate all flags, for example
    -fpermissive must become -Xcompiler -fpermissive, also we need to
    hack the build system with
    nim c --cincludes:/opt/cuda/include --cc:clang --clang.exe:/opt/cuda/bin/nvcc --clang.linkerexe:/opt/cuda/bin/nvcc --clang.cpp.exe:/opt/cuda/bin/nvcc --clang.cpp.linkerexe:/opt/cuda/bin/nvcc.
    Alternatively we compile Nim and C to object files with separate compilers and link them together afterward. That makes the library hard to use the Nim way (nimble install) instead of as a DLL.
  3. Generate Cuda C at runtime
    With NVRTC we could compile it a runtime, specialized per curve.
    Generation can use string interpolation akin to Arraymancer (but at runtime)
    or a code generator like exprgrad does for OpenCL: https://github.com/can-lehmann/exprgrad/blob/v0.1.0/exprgrad/clgen.nim
  4. 1, 2 or 3 with C++ instead of C.
    In case of 1 and 2, that would force downstream project to use C++ compilation.
    In case of 2, embedding Cuda C++ and Nim will not compile unless jumping through hoops because there is no way to strop the -std=gnu++14 that Nim adds but NVCC doesn't support.
    In case of 3, there is no reason to generate C++ at runtime when we could generate C which is simpler and fast to compile.
  5. Generate GPU code directly from Nim
    This can be done with cudanim approach, which cleans Nim AST so that the generated C code is Cuda compatible. It is though tricky to remove stackframes, array indexing exceptions and GC random stack-scanning. It means keeping up with Nim devel changes in GC, destructors, exceptions and control-flow related changes.
    It can also be done using nlvm as axel is doing (with a fork). This however introduces an extra-dependency with few eyes / tests / fuzzing on it. And it also requires to compile device code first and then staticRead it (https://github.com/guibar64/axel/blob/5b7cefb/src/axel/build_device_code.nim#L35). Using CLI opens up a can of worms of vulnerability, from an error that deletes data to "script-in-the-middle" that hijacks the expected binary.
  6. Generate PTX (Nvidia Virtual Assembly for GPU)
    This requires learning calling conventions, registers, data and parameter declarations, writing a code generator. Compared to CPU ISA, documentation will be limited and testing and ensuring correctness would be a large undertaking.
  7. Generate LLVM IR and compiling it to PTX
    This frees us from building/compilation woes. LLVM instruction set is simple, has verification tools. It can be generated from Nim AST.
    Furthermore it makes it easier to target other backends like AMD and Intel GPUs or WASM, or even produce assembly that MSVC can ingest (as we can't use our inline assembly codegen with MSVC). Most of the infrastructure can be reused, there are documentation and examples and a community. LLVM is extensively tested, well maintained and largely used, for example to build everything Apple. Given how high profile it is, that LLVM itself has no dependencies and the distribution model, supply chain attacks are unlikely.

Implementation

So we choose to generate LLVM IR, extended with Nvidia PTX inline assembly for add-with-carry, substract-with-borrow and multiply-add instructions which have no intrinsics.
In the future, we can add support for AMD GPUs via the AMDGPU backend, Intel GPUs via OpenCL/SPIR-V/Vulkan (require LLVM 15), Apple GPUs (when they decide to make public their pipeline), Qualcomm Hexagon.

We compile LLVM IR to PTX via the LLVM NVPTX backend. Nvidia provides its own NVVM backend with extra optimizations but it uses the LLVM 7.0.1 IR format and it seems like inline assembly encoding changed since then.

@mratsim mratsim merged commit 1f4bb17 into master Jan 12, 2023
@mratsim mratsim deleted the nvidia branch January 12, 2023 00:28
@mratsim mratsim mentioned this pull request Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU backends
1 participant