Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target #466

Open
wants to merge 63 commits into
base: master
Choose a base branch
from

Conversation

Vindaar
Copy link
Collaborator

@Vindaar Vindaar commented Sep 10, 2024

This extends the existing add, mul, sub operations on finite fields on the Nvidia target by ccopy, neg, cneg, nsqr. For nsqr we have two different implementations for now.

  1. generate N squaring operations based on a Nim runtime / JIT compile time value
  2. generate a loop on the GPU using a phi node and branches for JIT runtime values

I added a simple test case for each operation, which compares with the equivalent operation on the CPU. Currently I just cloned the test file for each case. It'd probably be wise to try to unite them / abstract some of the boilerplate a bit.

Update 2024/09/19

I have now added the following more finite field operations:

  • setZero
  • cadd
  • csub
  • `double
  • isZero
  • isOdd
  • shiftRight
  • div2

In addition, I split all the implementations for each finite field op into an internal and public part. That way we can reuse the operations in other ops.

Further, I added distinct Array types to deal with finite field points and elliptic curve points. With those, I then added a 'test case' of sorts, where I ported the EC short Weierstrass Jacobian sumImpl logic line by line. With the help of some templates the implementation is essentially the same as for the CPU now and the code produces the same result.

Next I'll clean up that test case and add the EC point addition to the pub_curves.nim module.

Update 2024/09/23

I've now added the EC addition to a new pub_curves.nim module. In addition, I added a helper type NvidiaAssembler to simplify the Nvidia setup & compilation process. It can now be done with 2 lines:

  let nv = initNvAsm(EC_ShortW_Jac[field, G1], wordSize)  # either takes an EC_ShortW_Jac type or a Fp / Fr type
  let kernel = nv.compile(genEcSum) # pass the proc defining an `llvmPublicFnDef` 

(alternatively one can use the Assembler_LLVM part of the NvidiaAssembler object to build instructions / call a function and then just pass the name of the kernel to compile)

Using this, the boilerplate in the tests is reduced massively.

Note: The t_ec_sum_port.nim file is the one I used to write the port. I think it can be useful to give ideas on how to port some our existing CPU logic to the LLVM target.

Next: I'll add other EC operations.

Update 2024/09/26

Over the last few days I have:

  • split the EcPoint type into EcPointAff, EcPointJac and EcPointProj, which each have their own set of internal / public procedures
  • added more misc. field and EC operations
  • added templates fieldOps, ellipticOps, ellipticAffOps that inject multiple templates for operations to allow writing code equivalent to the CPU code that can
  • ported mixedSum for Jacobian + Affine EC points. Thanks to the templates, all that was needed to port the code was replace variable declarations and = assignments by var X = asy.newEcPointJac(ed) and store(X, Y). Aside from a minor issue due to an isNeutral name collision (internal procs are still using ValueRef as arguments and hence overload resolution is an issue; since renamed the affine version), it worked immediately.

E.g. when calling `addIncoming` for φ nodes, one needs to pass the
current function.
`nqsr` exists both as a 'compile time' and 'runtime' version, where
compile time here refers to the JIT compilation of the CUDA code.
And a very basic load / store example (useful to understand how data
is passed to GPU depending on type)
That extra special `r` out of place argument was a bit useless
Because we might not always want to load directly, but rather keep the
pointer to some arbitrary (nested) array.
This allows for a bit more sanity wrt differentiating between field
points and EC points
We will need 'prime plus 1 div 2' later to implement `div2` for finite
field points.

The code for the logic is directly ported from the `precompute.nim`
logic. Ideally we could avoid the duplication of logic, but well.
I mainly added to debug a bug I saw
I started with the CPU `sumImpl` template and line by line added each
operation for the GPU. With a bunch of helper templates the code
essentially looks identical.

I checked every line to see if they match. Hence all the commented out
`asy.store()` instructions and different proc signatures.
@Vindaar Vindaar changed the title Implement finite field ccopy, neg, cneg, nsqr for CUDA target Implement finite field ccopy, neg, cneg, nsqr, ... for CUDA target Sep 19, 2024
Clearer as to what the "test" does and adds a doc comment to the top
explaining how it was used
Deals with deciding if to allocate or just pass by value. Though in a
very simple manner!
EcPoint now is EcPointJac.

We will have separate files for different coordinates, like for the
CPU code. The distinct Array types will be defined in their respective files.
Slightly more type safe version of the templates included in some of
the procs previously.
Raises at Nim runtime but LLVM code construction time. The exception
should never raise under normal conditions, only if the code
construction is wrong.
TODO: We still need to differentiate between 2.0 and devel due to `var
object` requirement for destroy there iirc
We could consider to make all the `_internal` procedures take
`EcPoint` etc types. That way overload resolution would not be a
problem and we'd avoid more nasty bugs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant