Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal API: in-place vs result #21

Closed
mratsim opened this issue Apr 11, 2020 · 1 comment
Closed

Internal API: in-place vs result #21

mratsim opened this issue Apr 11, 2020 · 1 comment

Comments

@mratsim
Copy link
Owner

mratsim commented Apr 11, 2020

Currently Constantine is using the following internal API for add, sub, mul:

func `+=`*(a: var Fp, b: Fp)
func sum*(r: var Fp, a, b: Fp)
func `-=`*(a: var Fp, b: Fp)
func diff*(r: var Fp, a, b: Fp)
func prod*(r: var Fp, a, b: Fp)

However, should we instead use the following?

func `+=`*(a: var Fp, b: Fp)
func `+`*(a, b: Fp): Fp
func `-=`*(a: var Fp, b: Fp)
func `-`*(a, b: Fp): Fp
func `*`*(r: var Fp, a, b: Fp)

Tradeoffs

Ergonomics

While those APIs are internal, they are used in some quite long formulas and it's easier to implement them without mistakes if we closely follow the mathematical notation, for example for Algorithm 7 below
image

Performance

There are a couple potential performance issue.
Note that Fp is often Fp12 in pairing cases, which on the BLS12-381 curve (6 64-bit limbs) is of size 12 * 6 * 8 bytes = 576 bytes and so a huge parameter to pass by value semantics.

Parameter-passing of arrays

To be investigated, if we have a static array as result values, are they passed by stack? Or are they preallocated in the caller context and passed by EAX/RAX register.

Register vs memory

Key operations like add-with-carry behave differently if the destination is a memory location or a register. The latency can be up to 6 cycles instead of 1 cycle, see Agner Fog tables: https://www.agner.org/optimize/instruction_tables.pdf

Nehalem (2008)
image
Haswell (2013)
image
Skylake (2015)
image
Skylake-X (2015 - latest for consumer because Intel cannot produce new architectures)
image

Left highlighted column is latency: how many cycles do you have to wait if you depend on this result for further operation (like chained in-place additions or depending on a carry).

Right highlighted column is reciprocal throughput, how many cycles do you need to issue independent instructions. A value of 0.25 means 4 independent additions per cycle while a value of 2 means 1 every 2 cycles.

The last case ADC SBB m,r/i which is add-with-carry / sub-with-borrow: memory <- register, must absolutely be avoided due to a low throughput (once every 2 cycles) and high latency (5 cycles to wait) while the other order register <- memory has thoughput and latency of 1 cycle.

(Note: we can avoid the carry case all together with lazy carry but even though throughput is 4 per cycles instead of 1 per cycle, overall it is very likely to be slower due to more memory used (and multiplication costs growing in n² with n the number of limbs) and costly reduction operations in the general case see #15 (comment))

If the API is using var parameter that means we need a temporary variable in register anyway to store the result and then copy it to the destination.

In the second case, the compiler knows that the result is already a temporary variable on the stack.

The cost is potentially more stack usage due to those temporaries as the compiler does not directly construct the result in the destination buffer (i.e. we don't have copy-elision)

{.noInit.}

Nim zero-initializes every variable before use. The C compiler can usually detect and discard redundant initializations, for example the following:

let a = 5

gets compiled into

NI a;
a = (NI)0;
a = (NI)5;

While this avoids a complete bug class, this when working with value object of size over 100 bytes (or even 576 bytes for FP12 elements in pairing) we have the following problems:

  • Does the compiler recognize and elide the zero-initialization?
  • If not, can we elide it manually?

For manual elision this can be done with {.noInit.}:
With in-place sum

var r{.noInit.}: FP12[BLS12_381]
r.sum(a, b)

However with a result value

proc `+`(a, b: Fp12) Fp12 {.noInit.} =
  discard "implementation"

let r = a + b

The {.noInit.} applies to the function implicit result value, but does the final assignation to r, zero-initialize r under-the-hood and so do we need

proc `+`(a, b: Fp12) Fp12 {.noInit.} =
  discard "implementation"

let r {.noInit.} = a + b
@mratsim
Copy link
Owner Author

mratsim commented Apr 13, 2020

Note: this requires concepts to work for result:

either the natural way

func `+`*(a, b: CubicExtAddGroup): CubicExtAddGroup =
  result.sum(a, b)

func `-`*(a, b: CubicExtAddGroup): CubicExtAddGroup =
  result.sum(a, b)

or with a workaround in the vein of

func `+`*(a, b: CubicExtAddGroup): typeof(a) =
  result.sum(a, b)

func `-`*(a, b: CubicExtAddGroup): typeof(a) =
  result.sum(a, b)

Or alternatively implement the add/neg/sub/sum/diff proc for each of the extension fields ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant