Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU backends #92

Closed
mratsim opened this issue Sep 27, 2020 · 5 comments · Fixed by #210
Closed

GPU backends #92

mratsim opened this issue Sep 27, 2020 · 5 comments · Fixed by #210
Labels

Comments

@mratsim
Copy link
Owner

mratsim commented Sep 27, 2020

Zero Knowledge Proofs work by handling constraints circuits with millions of gates corresponding to field operations.

Those can be executed in parallel and the full constant-time design with no branch of Constantine actually helps to avoid divergence at the GPU warp level.

Resources:

@mratsim mratsim added enhancement :shipit: New feature or request Zero Knowledge 🤫 labels Sep 27, 2020
@mratsim mratsim changed the title GPU backend for Zero Knowledge Proofs GPU backends Apr 8, 2023
@mratsim mratsim reopened this Apr 8, 2023
@mratsim
Copy link
Owner Author

mratsim commented Apr 8, 2023

Reopening this to track potential GPU backends.

Overview

For now we want to limit ourselves to backends supported by LLVM.

Another approach would be having a source code generator and use the corresponding runtime compiler, for example for OpenCL or Apple Metal.

AMD GPUs

We can use the LLVM AMDGPU backend which is considered stable and included in all recent LLVM builds by default. https://www.llvm.org/docs/AMDGPUUsage.html

Relevant inline assembly:

  • S_ADDC_U32: add-with-carry
  • S_SUBB_U32: sub-with-borrow
  • S_MUL_HI_U32: extended precision multiplication high limb
  • S_CSELECT_B32(dst, a, b): conditional select a if SCC flag set, b otherwise
  • S_CMOV_B32(dst, a): conditional move a into dst if SCC flag set.

See RDNA 3 ISA doc: https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

Apple Metal

There is no official LLVM IR to Apple Metal backend but Apple uses a fork of LLVM.
By linking to it it might be possible to generate Metal shaders using a target triple of the form:

  • air64-apple-macos13.0
  • air64-apple-ios16.0-macabi

(see https://developer.apple.com/forums/thread/707695)

Metal doesn't seem to allow assembly for add-with-carry and extended precision multiplication.

Nvidia Cuda

Backend configured and added in #210

OpenCL

Generating OpenCL code through LLVM requires going through SPIR-V and loading the resulting kernel through clCreateProgramWithIL

SPIR-V is an experimental backend starting from LLVM 15 and likely needs to be configured through LLVM_EXPERIMENTAL_TARGETS_TO_BUILD (see https://stackoverflow.com/questions/46905464/how-to-enable-a-llvm-backend, https://reviews.llvm.org/D115009 )

Alternatively there is https://github.com/KhronosGroup/SPIRV-LLVM-Translator but it would require compiling Nim in C++ mode.

Intel GPUs inline assembly:

  • ADDC: add with carry
  • SUBB: sub with borrow
  • MULH: extended precision multiplication high limb
  • SEL(dst, a, b): conditional select a if predicate is set else b
  • MAD(dst, a, b, c): multiply-add dst = a*b+c or multiply-accumulate dst += a*b+c
  • MADW(dst, a, b, c): extended precision MAD, stores the full 64-bit result

See also:

ARM GPUs inline assembly:
TODO

Other backends

Backends superceded by vendor-specific backends, in particular due to not being available in https://github.com/llvm/llvm-project/tree/main/llvm/lib/Target or not allowing inline assembly for add-with-carry and extended precision multiplication:

@mratsim
Copy link
Owner Author

mratsim commented Aug 2, 2024

Some more investigation on the AMD backend: https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

You have scalar and vector execution units, 8x more vector units in this example.
spectacle_20240802_111949
up to 2 scalar unit per wave (32 units). And vector add with carry does exist
spectacle_20240802_112424

What was mentioned S_ADDC_U32 was actually for scalar code
spectacle_20240802_112658

but we in-fact need vector code which does exist:
spectacle_20240802_112933
spectacle_20240802_113017

However, it doesn't seem like AMD provides an auto-vectorizer like when we use Nvidia PTX virtual ISA, so we'll have to vectorize the code ourselves. I.e. implement fp_add_x32, fp_mul_x32, ...

@mratsim
Copy link
Owner Author

mratsim commented Aug 2, 2024

Looking at some of the AMD codegen, it might be that we can stay within LLVM IR as there were some LLVM improvements related to add with carry:

Though we'll likely have the same issue with sub-with-borrow (what's the canonical IR?)

@mratsim
Copy link
Owner Author

mratsim commented Aug 5, 2024

For OpenCL / SPIR-V on Intel GPUs, 2 extensions are of particular interest:

  • SPV_INTEL_arbitrary_precision_integers
  • SPV_INTEL_inline_assembly

as the inline assembly would allow guaranteeing addition with carry from Intel virtual ISA: https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/6_instructions.md

@mratsim
Copy link
Owner Author

mratsim commented Aug 27, 2024

Closed by #465

@mratsim mratsim closed this as completed Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant