[Backend] Add support for Nvidia GPUs #210

mratsim · 2023-01-11T23:33:25Z

This PR adds initial support for codegen on Nvidia GPUs.

It includes an end-to-end generic field addition kernel for 𝔽p and 𝔽r.

This closes #92.

Overview

As evidenced by https://zprize.io $8M in prizes and many IACR preprints on implementing cryptography on GPUs, there is a growing demand for GPU-accelerated cryptography.

Currently there are 2 main areas of cryptography that benefits from GPU:

Fully Homomorphic Encryption, which uses lattice as a basis and enable privacy-preserving machine-learning. Besides the obvious need for accelerating machine learning, lattice-based cryptography uses matrices extensively which are an excellent fit for GPUs.
Zero-Knowledge primitives for blockchains:

Multi-scalar-multiplication
- is the bottleneck of all ZK protocols
- is a bottleneck in polynomial commitments schemes such as KZG planned for Ethereum sharding:
  - https://github.com/ethereum/consensus-specs/blob/v1.3.0-rc.0/specs/eip4844/polynomial-commitments.md#g1_lincomb
  - https://notes.ethereum.org/@dankrad/new_sharding
  Compute KZG proofs for 32 MB of data in ~1s. This requires a 32-64 core CPU. I think a machine that is capable can be built for <5,000$, proof of concept to be implemented. According to conversations with Dag Arne Osvik, it’s likely possible to do this on a GPU much cheaper
Point addition is a bottleneck for Secret Single Leader Election
- https://ethresear.ch/t/whisk-a-practical-shuffle-based-ssle-protocol-for-ethereum/11763/7
  - https://ethresear.ch/t/simplified-ssle/12315/
  - Batch additions #207
- https://eprint.iacr.org/2020/025.pdf

Design considerations

See: https://forum.nim-lang.org/t/9794

As a reminder we aim for security, performance and compactness. Also Constantine strives to only depend on as few dependencies as possible, basically the Nim compiler, the C compiler and not even the Nim standard library. Other runtime libraries will not have our focus on security nor our concerns about hidden control flow (exceptions, defects even on array indexing).

There are multiple strategies to generate code for GPU in a Nim project

Write Cuda C as a separate file
Due to the lack of generics, we would need extensive use of C macros.
Embed Cuda C at compile-time
Embedding can use string interpolation like Arraymancer https://github.com/mratsim/Arraymancer/blob/v0.7.19/src/arraymancer/tensor/private/incl_higher_order_cuda.nim
Embedding will run into compiler flag issues. We would need to gate all flags, for example
-fpermissive must become -Xcompiler -fpermissive, also we need to
hack the build system with
nim c --cincludes:/opt/cuda/include --cc:clang --clang.exe:/opt/cuda/bin/nvcc --clang.linkerexe:/opt/cuda/bin/nvcc --clang.cpp.exe:/opt/cuda/bin/nvcc --clang.cpp.linkerexe:/opt/cuda/bin/nvcc.
Alternatively we compile Nim and C to object files with separate compilers and link them together afterward. That makes the library hard to use the Nim way (nimble install) instead of as a DLL.
Generate Cuda C at runtime
With NVRTC we could compile it a runtime, specialized per curve.
Generation can use string interpolation akin to Arraymancer (but at runtime)
or a code generator like exprgrad does for OpenCL: https://github.com/can-lehmann/exprgrad/blob/v0.1.0/exprgrad/clgen.nim
1, 2 or 3 with C++ instead of C.
In case of 1 and 2, that would force downstream project to use C++ compilation.
In case of 2, embedding Cuda C++ and Nim will not compile unless jumping through hoops because there is no way to strop the -std=gnu++14 that Nim adds but NVCC doesn't support.
In case of 3, there is no reason to generate C++ at runtime when we could generate C which is simpler and fast to compile.
Generate GPU code directly from Nim
This can be done with cudanim approach, which cleans Nim AST so that the generated C code is Cuda compatible. It is though tricky to remove stackframes, array indexing exceptions and GC random stack-scanning. It means keeping up with Nim devel changes in GC, destructors, exceptions and control-flow related changes.
It can also be done using nlvm as axel is doing (with a fork). This however introduces an extra-dependency with few eyes / tests / fuzzing on it. And it also requires to compile device code first and then staticRead it (https://github.com/guibar64/axel/blob/5b7cefb/src/axel/build_device_code.nim#L35). Using CLI opens up a can of worms of vulnerability, from an error that deletes data to "script-in-the-middle" that hijacks the expected binary.
Generate PTX (Nvidia Virtual Assembly for GPU)
This requires learning calling conventions, registers, data and parameter declarations, writing a code generator. Compared to CPU ISA, documentation will be limited and testing and ensuring correctness would be a large undertaking.
Generate LLVM IR and compiling it to PTX
This frees us from building/compilation woes. LLVM instruction set is simple, has verification tools. It can be generated from Nim AST.
Furthermore it makes it easier to target other backends like AMD and Intel GPUs or WASM, or even produce assembly that MSVC can ingest (as we can't use our inline assembly codegen with MSVC). Most of the infrastructure can be reused, there are documentation and examples and a community. LLVM is extensively tested, well maintained and largely used, for example to build everything Apple. Given how high profile it is, that LLVM itself has no dependencies and the distribution model, supply chain attacks are unlikely.

Implementation

So we choose to generate LLVM IR, extended with Nvidia PTX inline assembly for add-with-carry, substract-with-borrow and multiply-add instructions which have no intrinsics.
In the future, we can add support for AMD GPUs via the AMDGPU backend, Intel GPUs via OpenCL/SPIR-V/Vulkan (require LLVM 15), Apple GPUs (when they decide to make public their pipeline), Qualcomm Hexagon.

We compile LLVM IR to PTX via the LLVM NVPTX backend. Nvidia provides its own NVVM backend with extra optimizations but it uses the LLVM 7.0.1 IR format and it seems like inline assembly encoding changed since then.

mratsim added 6 commits January 11, 2023 19:32

Add PoC of JIT exec on Nvidia GPUs [skip ci]

bfec303

Split GPU bindings into low-level (ABI) and high-level [skip ci]

f9d6f9b

small typedef reorg [skip ci]

5e226b8

refine LLVM IR/Nvidia GPU hello worlds

9b32ebb

[Nvidia GPU] PoC implementation of field addition [skip ci]

07e61d5

prod-ready field addition + tests on Nvidia GPUs via LLVM codegen

af0e51f

mratsim merged commit 1f4bb17 into master Jan 12, 2023

mratsim deleted the nvidia branch January 12, 2023 00:28

mratsim mentioned this pull request Apr 8, 2023

GPU backends #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend] Add support for Nvidia GPUs #210

[Backend] Add support for Nvidia GPUs #210

mratsim commented Jan 11, 2023

[Backend] Add support for Nvidia GPUs #210

[Backend] Add support for Nvidia GPUs #210

Conversation

mratsim commented Jan 11, 2023

Overview

Design considerations

Implementation