Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Vindaar · 2024-09-10T11:58:26Z

This extends the existing add, mul, sub operations on finite fields on the Nvidia target by ccopy, neg, cneg, nsqr. For nsqr we have two different implementations for now.

generate N squaring operations based on a Nim runtime / JIT compile time value
generate a loop on the GPU using a phi node and branches for JIT runtime values

I added a simple test case for each operation, which compares with the equivalent operation on the CPU. Currently I just cloned the test file for each case. It'd probably be wise to try to unite them / abstract some of the boilerplate a bit.

Update 2024/09/19

I have now added the following more finite field operations:

setZero
cadd
csub
`double
isZero
isOdd
shiftRight
div2

In addition, I split all the implementations for each finite field op into an internal and public part. That way we can reuse the operations in other ops.

Further, I added distinct Array types to deal with finite field points and elliptic curve points. With those, I then added a 'test case' of sorts, where I ported the EC short Weierstrass Jacobian sumImpl logic line by line. With the help of some templates the implementation is essentially the same as for the CPU now and the code produces the same result.

Next I'll clean up that test case and add the EC point addition to the pub_curves.nim module.

Update 2024/09/23

I've now added the EC addition to a new pub_curves.nim module. In addition, I added a helper type NvidiaAssembler to simplify the Nvidia setup & compilation process. It can now be done with 2 lines:

  let nv = initNvAsm(EC_ShortW_Jac[field, G1], wordSize)  # either takes an EC_ShortW_Jac type or a Fp / Fr type
  let kernel = nv.compile(genEcSum) # pass the proc defining an `llvmPublicFnDef`

(alternatively one can use the Assembler_LLVM part of the NvidiaAssembler object to build instructions / call a function and then just pass the name of the kernel to compile)

Using this, the boilerplate in the tests is reduced massively.

Note: The t_ec_sum_port.nim file is the one I used to write the port. I think it can be useful to give ideas on how to port some our existing CPU logic to the LLVM target.

Next: I'll add other EC operations.

Update 2024/09/26

Over the last few days I have:

split the EcPoint type into EcPointAff, EcPointJac and EcPointProj, which each have their own set of internal / public procedures
added more misc. field and EC operations
added templates fieldOps, ellipticOps, ellipticAffOps that inject multiple templates for operations to allow writing code equivalent to the CPU code that can
ported mixedSum for Jacobian + Affine EC points. Thanks to the templates, all that was needed to port the code was replace variable declarations and = assignments by var X = asy.newEcPointJac(ed) and store(X, Y). Aside from a minor issue due to an isNeutral name collision (internal procs are still using ValueRef as arguments and hence overload resolution is an issue; since renamed the affine version), it worked immediately.

E.g. when calling `addIncoming` for φ nodes, one needs to pass the current function.

`nqsr` exists both as a 'compile time' and 'runtime' version, where compile time here refers to the JIT compilation of the CUDA code.

And a very basic load / store example (useful to understand how data is passed to GPU depending on type)

That extra special `r` out of place argument was a bit useless

Because we might not always want to load directly, but rather keep the pointer to some arbitrary (nested) array.

This allows for a bit more sanity wrt differentiating between field points and EC points

We will need 'prime plus 1 div 2' later to implement `div2` for finite field points. The code for the logic is directly ported from the `precompute.nim` logic. Ideally we could avoid the duplication of logic, but well.

I mainly added to debug a bug I saw

I started with the CPU `sumImpl` template and line by line added each operation for the GPU. With a bunch of helper templates the code essentially looks identical. I checked every line to see if they match. Hence all the commented out `asy.store()` instructions and different proc signatures.

Clearer as to what the "test" does and adds a doc comment to the top explaining how it was used

Deals with deciding if to allocate or just pass by value. Though in a very simple manner!

EcPoint now is EcPointJac. We will have separate files for different coordinates, like for the CPU code. The distinct Array types will be defined in their respective files.

Slightly more type safe version of the templates included in some of the procs previously.

Raises at Nim runtime but LLVM code construction time. The exception should never raise under normal conditions, only if the code construction is wrong.

TODO: We still need to differentiate between 2.0 and devel due to `var object` requirement for destroy there iirc

We could consider to make all the `_internal` procedures take `EcPoint` etc types. That way overload resolution would not be a problem and we'd avoid more nasty bugs

Vindaar added 27 commits September 10, 2024 13:43

minor docstring fixes / improvements

dca40dd

borrow isNil for BasicBlockRef

acce12d

fix ABI for LLVM neg procedures

0a55a5f

add phi and branch related LLVM ABI functions

947201f

inject LLVM fn in llvmFnDef for caller

73eeef8

E.g. when calling `addIncoming` for φ nodes, one needs to pass the current function.

add logic to generate ccopy, neg, cneg, nsqr

b114940

`nqsr` exists both as a 'compile time' and 'runtime' version, where compile time here refers to the JIT compilation of the CUDA code.

[tests] add test cases for ccopy, neg, cneg, nsqr, nsqrRT

e386234

And a very basic load / store example (useful to understand how data is passed to GPU depending on type)

change ccopy to not have additional extra argument

5b062c7

That extra special `r` out of place argument was a bit useless

split cneg into internal/public part, fix logic for ccopy change

5bc3c84

[tests] fix ccopy test case for API change

893a4b6

[tests] remove extra pointers in nsqr test

0e2e034

add setZero internal / public finite field proc

0ea234f

split finite field sub into internal/public procs

e31a418

split finite field mul into internal/public procs

f805bce

split finite field nsqr into internal / public procs

43142fd

add finite field isZero proc

1782bbd

add basic CurveDescriptor type

481e18c

add asArray overload taking BuilderRef

1b692ca

add more generic getElementPtr

ae13db5

Because we might not always want to load directly, but rather keep the pointer to some arbitrary (nested) array.

add wrapper Field, EcPoint that are distinct arrays

24ce403

This allows for a bit more sanity wrt differentiating between field points and EC points

port required code precompute -> impl_field_globas for p+1 div 2

dfeecac

We will need 'prime plus 1 div 2' later to implement `div2` for finite field points. The code for the logic is directly ported from the `precompute.nim` logic. Ideally we could avoid the duplication of logic, but well.

add proc to get pointer to the prime plus 1 div 2 value

9bf28a7

add internal add, fix up docstrings for internal procs

ea7e5a3

add cadd, csub, shiftRight, div2, double, isOdd for Fp

7c678b2

[tests] add test case for EC point component retrieval

7b5cb3d

[tests] add test case for mul

0d37690

I mainly added to debug a bug I saw

Vindaar changed the title ~~Implement finite field ccopy, neg, cneg, nsqr for CUDA target~~ Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target Sep 19, 2024

Vindaar added 2 commits September 23, 2024 12:03

rename t_ec_sum_impl to t_ec_sum_port, add doc comment

114c981

Clearer as to what the "test" does and adds a doc comment to the top explaining how it was used

[nvidia] add macro execCuda to execute CUDA kernel with var args

4ff6b95

Deals with deciding if to allocate or just pass by value. Though in a very simple manner!

Vindaar added 30 commits September 23, 2024 17:01

run all nvidia tests for test_nvidia nimble task

4556a0f

add overload of execCuda without inputs argument

a999f5a

add neg and cneg for EC points

a106ed9

port scalar multiplication with CT integer for finite fields to LLVM

1176c37

add double for EC points

b9670dd

prepare for different EcPoint types, move to pub_curves_jacobian

a544ecf

EcPoint now is EcPointJac. We will have separate files for different coordinates, like for the CPU code. The distinct Array types will be defined in their respective files.

port precompute code for Montgomery 'One' for LLVM

255e658

remove Assembler_LLVM argument from store for field / EC

1bde059

add setOne and csetZero for finite fields

a5e15fc

add file for LLVM EC points in affine coords w/ isNeutral

564eb77

add asEcPointJac overload taking CurveDescritpor

8111575

add fromAffine for LLVM

96eb1f0

fix type passed in genEcIsNeutral for affine coords

844bc95

add template fieldOps, ellipticOps to simplify operations

842255e

Slightly more type safe version of the templates included in some of the procs previously.

allocate limbs for BigNum in LLVM 'precompute'

396e903

remove leftover ptrBool line

abf6b52

fix setOne template pointing to setZero

3fe5617

fix setOne for finite field

9731668

improve derefBool logic to only maybe deref & raise if not bool

6707a17

Raises at Nim runtime but LLVM code construction time. The exception should never raise under normal conditions, only if the code construction is wrong.

use derefBool in other conditionals

b09ca90

add csetOne for fields

468dbd5

add csetOne for fields

f33263b

define destructor for Assembler_LLVM if ARC/ORC

5a3a4da

TODO: We still need to differentiate between 2.0 and devel due to `var object` requirement for destroy there iirc

fix global retrieval possibly yielding nil if already called

67272e8

call destroy for Assembler_LLVM field for NvidiaAssembler

891bb9c

rename isNeutral for affine coords to avoid name clash

0cef0e7

We could consider to make all the `_internal` procedures take `EcPoint` etc types. That way overload resolution would not be a problem and we'd avoid more nasty bugs

add ellipticAffOps for affine coordinates

3f9cee2

adjust EcPointJac templates for added affine template

d55ffa9

add mixedSum between Jacobian and Affine coordinates

c881d8c

[tests] add test case for jacobian + affine coords

c74af74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Vindaar commented Sep 10, 2024 •

edited

Loading

Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target #466

Are you sure you want to change the base?

Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target #466

Conversation

Vindaar commented Sep 10, 2024 • edited Loading

Update 2024/09/19

Update 2024/09/23

Update 2024/09/26

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Implement finite field `ccopy`, `neg`, `cneg`, `nsqr`, ... for CUDA target #466

Vindaar commented Sep 10, 2024 •

edited

Loading