Optimised modular arithmetic targeting elliptic curve operations #86

daosvik · 2021-08-13T16:18:46Z

In the context of verkle trees we need to optimise modular arithmetic for elliptic curve calculations.

This PR introduces speed optimisations for moduli from 193 to 256 bits in length:

it does not use division when adding already-reduced operands, which is typical in applications of modular arithmetic
it uses Barrett reduction, caching the mu value (the reciprocal of the modulus) for each modulus encountered, removing the need to perform a division for each reduction

gballet · 2021-08-19T07:32:23Z

Before:

goos: freebsd
goarch: amd64
pkg: github.com/holiman/uint256
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkMulMod
BenchmarkMulMod/small/uint256
BenchmarkMulMod/small/uint256-8         	24091966	        51.95 ns/op
BenchmarkMulMod/small/big
BenchmarkMulMod/small/big-8             	13057272	        93.92 ns/op
BenchmarkMulMod/mod64/uint256
BenchmarkMulMod/mod64/uint256-8         	 6775998	       162.3 ns/op
BenchmarkMulMod/mod64/big
BenchmarkMulMod/mod64/big-8             	 3723324	       324.2 ns/op
BenchmarkMulMod/mod128/uint256
BenchmarkMulMod/mod128/uint256-8        	 4329858	       247.9 ns/op
BenchmarkMulMod/mod128/big
BenchmarkMulMod/mod128/big-8            	 2085337	       693.0 ns/op
BenchmarkMulMod/mod192/uint256
BenchmarkMulMod/mod192/uint256-8        	 3751779	       318.3 ns/op
BenchmarkMulMod/mod192/big
BenchmarkMulMod/mod192/big-8            	 1637002	       760.2 ns/op
BenchmarkMulMod/mod256/uint256
BenchmarkMulMod/mod256/uint256-8        	 3333724	       343.7 ns/op
BenchmarkMulMod/mod256/big
BenchmarkMulMod/mod256/big-8            	 1535760	       763.2 ns/op
PASS
ok  	github.com/holiman/uint256	15.822s

After:

goos: freebsd
goarch: amd64
pkg: github.com/holiman/uint256
cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
BenchmarkMulMod
BenchmarkMulMod/small/uint256
BenchmarkMulMod/small/uint256-8         	25852512	        46.44 ns/op
BenchmarkMulMod/small/big
BenchmarkMulMod/small/big-8             	13739356	        88.47 ns/op
BenchmarkMulMod/mod64/uint256
BenchmarkMulMod/mod64/uint256-8         	 6403490	       196.1 ns/op
BenchmarkMulMod/mod64/big
BenchmarkMulMod/mod64/big-8             	 3167760	       394.4 ns/op
BenchmarkMulMod/mod128/uint256
BenchmarkMulMod/mod128/uint256-8        	 3329642	       338.7 ns/op
BenchmarkMulMod/mod128/big
BenchmarkMulMod/mod128/big-8            	 1510978	       829.5 ns/op
BenchmarkMulMod/mod192/uint256
BenchmarkMulMod/mod192/uint256-8        	 3204769	       357.8 ns/op
BenchmarkMulMod/mod192/big
BenchmarkMulMod/mod192/big-8            	 1512756	       778.7 ns/op
BenchmarkMulMod/mod256/uint256
BenchmarkMulMod/mod256/uint256-8        	 5124606	       233.6 ns/op
BenchmarkMulMod/mod256/big
BenchmarkMulMod/mod256/big-8            	 1604929	       720.6 ns/op
    benchmarks_test.go:712: Hits/Misses	4788461/1349390
PASS
ok  	github.com/holiman/uint256	17.042s

gballet · 2021-08-19T10:23:05Z

Running benchstat:

(master)> go test -v -run none -bench MulMod -count 5  -benchmem > master.txt
(master)> git checkout opt_modarith
(opt_modarith)> go test -v -run none -bench MulMod -count 5  -benchmem > daosvik.txt
> benchstat master.txt daosvik.txt 
name                     old time/op    new time/op    delta
MulMod/small/uint256-8     46.0ns ±28%    38.7ns ± 2%     ~     (p=0.063 n=5+4)
MulMod/small/big-8          115ns ± 8%     118ns ± 6%     ~     (p=0.460 n=5+5)
MulMod/mod64/uint256-8      174ns ± 8%     172ns ± 6%     ~     (p=1.000 n=5+5)
MulMod/mod64/big-8          332ns ±11%     331ns ± 4%     ~     (p=0.548 n=5+5)
MulMod/mod128/uint256-8     280ns ± 6%     282ns ± 0%     ~     (p=0.730 n=5+4)
MulMod/mod128/big-8         699ns ± 3%     755ns ±12%     ~     (p=0.056 n=5+5)
MulMod/mod192/uint256-8     287ns ± 8%     377ns ±17%  +31.49%  (p=0.008 n=5+5)
MulMod/mod192/big-8         675ns ± 3%     743ns ± 7%  +10.17%  (p=0.016 n=5+5)
MulMod/mod256/uint256-8     288ns ±10%     215ns ±12%  -25.47%  (p=0.008 n=5+5)
MulMod/mod256/big-8         659ns ± 8%     682ns ± 1%     ~     (p=0.190 n=5+4)

There seems to be a significant performance hit. These are on a very specific platform, though, I wonder what the results would be on more standard platforms?

daosvik · 2021-08-19T13:16:06Z

Here are my MulMod benchmark results on amd64 and arm64, with go1.13.8 on Ubuntu 20.04.3 LTS.

6-core Ryzen 5 3600 running at 3.6 GHz (boost is off):

MulMod/small/uint256-12     30.7ns ± 0%    31.4ns ± 0%   +2.41%  (p=0.016 n=5+4)
MulMod/small/big-12         69.9ns ± 2%    70.5ns ± 1%     ~     (p=0.254 n=5+5)
MulMod/mod64/uint256-12     93.7ns ± 1%    95.6ns ± 0%   +1.99%  (p=0.008 n=5+5)
MulMod/mod64/big-12          247ns ± 1%     248ns ± 1%     ~     (p=0.833 n=5+5)
MulMod/mod128/uint256-12     167ns ± 0%     164ns ± 0%   -1.80%  (p=0.029 n=4+4)
MulMod/mod128/big-12         453ns ± 2%     467ns ± 2%   +2.91%  (p=0.016 n=5+5)
MulMod/mod192/uint256-12     160ns ± 0%     159ns ± 1%   -0.88%  (p=0.048 n=5+5)
MulMod/mod192/big-12         414ns ± 1%     420ns ± 1%     ~     (p=0.063 n=5+5)
MulMod/mod256/uint256-12     155ns ± 0%     108ns ± 0%  -30.32%  (p=0.008 n=5+5)
MulMod/mod256/big-12         397ns ± 2%     406ns ± 2%     ~     (p=0.087 n=5+5)

4-core Core i7-1065G7 running at 1.3 GHz:

MulMod/small/uint256-8     91.3ns ± 0%    90.2ns ± 6%     ~     (p=0.690 n=5+5)
MulMod/small/big-8          132ns ± 0%     130ns ± 0%   -1.52%  (p=0.000 n=5+4)
MulMod/mod64/uint256-8      212ns ± 0%     205ns ± 0%   -3.67%  (p=0.008 n=5+5)
MulMod/mod64/big-8          337ns ± 0%     337ns ± 0%     ~     (p=1.000 n=5+5)
MulMod/mod128/uint256-8     433ns ± 0%     424ns ± 0%   -2.12%  (p=0.016 n=5+4)
MulMod/mod128/big-8         805ns ± 0%     799ns ± 0%   -0.77%  (p=0.008 n=5+5)
MulMod/mod192/uint256-8     418ns ± 1%     412ns ± 0%   -1.44%  (p=0.016 n=5+4)
MulMod/mod192/big-8         785ns ±14%     752ns ± 0%     ~     (p=0.056 n=5+5)
MulMod/mod256/uint256-8     412ns ± 0%     320ns ±11%  -22.23%  (p=0.016 n=4+5)
MulMod/mod256/big-8         719ns ± 0%     715ns ± 1%     ~     (p=0.127 n=5+5)

4-core Cortex-A72 running at 1.5 GHz (Raspberry Pi 4):

MulMod/small/uint256-4      186ns ± 0%     185ns ± 0%   -0.54%  (p=0.008 n=5+5)
MulMod/small/big-4          375ns ± 0%     375ns ± 1%     ~     (p=0.841 n=5+5)
MulMod/mod64/uint256-4      416ns ± 0%     413ns ± 0%   -0.86%  (p=0.008 n=5+5)
MulMod/mod64/big-4         1.07µs ± 1%    1.08µs ± 1%     ~     (p=0.095 n=5+5)
MulMod/mod128/uint256-4     741ns ± 0%     736ns ± 0%   -0.78%  (p=0.008 n=5+5)
MulMod/mod128/big-4        2.01µs ± 2%    2.01µs ± 2%     ~     (p=0.690 n=5+5)
MulMod/mod192/uint256-4     746ns ± 0%     742ns ± 0%   -0.56%  (p=0.016 n=5+5)
MulMod/mod192/big-4        1.89µs ± 1%    1.91µs ± 1%   +0.95%  (p=0.008 n=5+5)
MulMod/mod256/uint256-4     721ns ± 0%     612ns ± 0%  -15.17%  (p=0.008 n=5+5)
MulMod/mod256/big-4        1.78µs ± 1%    1.79µs ± 1%     ~     (p=0.690 n=5+5)

holiman · 2021-08-19T13:52:38Z

on cpu: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz

go version
go version go1.16.4 linux/amd64

name                     old time/op    new time/op    delta
MulMod/small/uint256-6     36.8ns ± 3%    36.3ns ± 1%     ~     (p=0.206 n=5+5)
MulMod/small/big-6         70.1ns ±13%    65.9ns ± 4%     ~     (p=0.238 n=5+5)
MulMod/mod64/uint256-6      117ns ± 9%     105ns ± 3%  -10.22%  (p=0.008 n=5+5)
MulMod/mod64/big-6          238ns ± 1%     216ns ± 1%   -9.30%  (p=0.008 n=5+5)
MulMod/mod128/uint256-6     211ns ± 9%     173ns ± 0%  -18.18%  (p=0.016 n=5+4)
MulMod/mod128/big-6         572ns ±11%     504ns ± 3%  -11.96%  (p=0.032 n=5+5)
MulMod/mod192/uint256-6     195ns ± 4%     194ns ± 1%     ~     (p=0.841 n=5+5)
MulMod/mod192/big-6         563ns ± 6%     474ns ± 2%  -15.91%  (p=0.008 n=5+5)
MulMod/mod256/uint256-6     226ns ± 7%     124ns ± 3%  -45.02%  (p=0.008 n=5+5)
MulMod/mod256/big-6         518ns ± 5%     451ns ± 5%  -12.94%  (p=0.008 n=5+5)

So that also looks better than @gballet 's numbers

holiman · 2021-08-19T13:55:08Z

with go1.13.8 on Ubuntu 20.04.3 LTS.

@daosvik you should definitely update that, it's very old, and your benchmarks may be totally off.

gballet · 2021-08-19T13:57:33Z

On an AWS x5.large instance, I get:

name                     old time/op    new time/op    delta
MulMod/small/uint256-4     44.0ns ± 0%    44.1ns ± 0%   +0.35%  (p=0.008 n=5+5)
MulMod/small/big-4         74.0ns ± 0%    74.8ns ± 0%   +1.06%  (p=0.008 n=5+5)
MulMod/mod64/uint256-4      127ns ± 0%     127ns ± 0%   -0.21%  (p=0.024 n=5+5)
MulMod/mod64/big-4          243ns ± 0%     245ns ± 0%   +0.75%  (p=0.008 n=5+5)
MulMod/mod128/uint256-4     207ns ± 0%     208ns ± 0%   +0.45%  (p=0.016 n=4+5)
MulMod/mod128/big-4         516ns ± 0%     520ns ± 0%   +0.73%  (p=0.008 n=5+5)
MulMod/mod192/uint256-4     212ns ± 0%     212ns ± 0%     ~     (p=0.087 n=5+5)
MulMod/mod192/big-4         493ns ± 0%     497ns ± 0%   +0.82%  (p=0.008 n=5+5)
MulMod/mod256/uint256-4     212ns ± 0%     132ns ± 0%  -37.80%  (p=0.000 n=5+4)
MulMod/mod256/big-4         466ns ± 0%     470ns ± 0%   +0.79%  (p=0.008 n=5+5)

holiman · 2021-08-19T14:00:05Z

@daosvik another thing -- you're adding/changing some benchmarks. It would help if you moved those changes so that your very first commit is the benchmark-change, and after that, the commits that modify the code.

That way, it's possible to compare benchmarks for "master" (master + new/modified benches) against PR, by just jumping between commits. Which is pretty difficult right now, when the bench-changes are mixed with code changes.

daosvik · 2021-08-19T14:04:28Z

@daosvik another thing -- you're adding/changing some benchmarks. It would help if you moved those changes so that your very first commit is the benchmark-change, and after that, the commits that modify the code.

That way, it's possible to compare benchmarks for "master" (master + new/modified benches) against PR, by just jumping between commits. Which is pretty difficult right now, when the bench-changes are mixed with code changes.

Sorry about that - I almost got it right. The first commit adds an unnecessary printout of cache stats, which you can delete or comment out to generate the baseline benchmark.

gballet · 2021-08-19T14:06:41Z

Right, so I hadn't seen that the resource-heavy go language server was running in the background. Without it, I do get results that are closer to expectation:

name                     old time/op    new time/op    delta
MulMod/small/uint256-8     52.2ns ±24%    52.5ns ±43%     ~     (p=1.000 n=5+5)
MulMod/small/big-8          113ns ± 5%     113ns ± 8%     ~     (p=1.000 n=5+5)
MulMod/mod64/uint256-8      168ns ± 9%     172ns ±13%     ~     (p=0.841 n=5+5)
MulMod/mod64/big-8          335ns ± 7%     364ns ±23%     ~     (p=0.222 n=5+5)
MulMod/mod128/uint256-8     278ns ± 4%     275ns ±18%     ~     (p=0.310 n=5+5)
MulMod/mod128/big-8         722ns ± 7%     738ns ±11%     ~     (p=0.690 n=5+5)
MulMod/mod192/uint256-8     291ns ± 6%     310ns ± 4%     ~     (p=0.095 n=5+5)
MulMod/mod192/big-8         685ns ± 7%     659ns ± 4%     ~     (p=0.548 n=5+5)
MulMod/mod256/uint256-8     295ns ± 8%     192ns ± 2%  -35.01%  (p=0.008 n=5+5)
MulMod/mod256/big-8         644ns ±10%     657ns ±13%     ~     (p=0.548 n=5+5)

mod.go

holiman · 2021-08-19T14:21:25Z

mod.go

+//	return
+//}
+
+func shiftleft(x Int, s uint) (z Int) {


Do we need both this method and the existing function

// Lsh sets z = x << n and returns z. func (z *Int) Lsh(x *Int, n uint) *Int {

?

We don't need both, but there was a performance improvement over Lsh on Ryzen, so I want to do some more benchmarking before removing one in favour of the other.

mod.go

holiman · 2021-08-19T14:31:38Z

mod.go

+// - otherwise, the result is normalised to have non-zero most significant word
+// - starts with a 32-bit division, refines with newton-raphson iterations
+
+func reciprocal(m Int) (mu [5]uint64) {


As this implementation is .. fairly nontrivial ... it would be good if you can add some links to the origin of the algorithm, so the astute reader can have something to verify against

I created this algorithm for this specific purpose, so there is no paper or website to refer to as the origin of the algorithm.
It's really only computing the reciprocal of the modulus with 320-bit precision, taking care to avoid overflow.

holiman · 2021-08-19T14:34:20Z

mod.go

+		return
+	}
+
+	cache[cacheIndex].miss++


Isn't this an unsafe concurrent modification, since you're only holding the readlock ? x++ is not atomic

(and if so, same a few lines above with hit++)

Yes, it is unsafe, except it is not very important to have an exact result for these cache statistics.

holiman · 2021-08-19T14:45:40Z

Another benchmark -- if the cache is disabled:

name                     old time/op    new time/op    delta
MulMod/small/uint256-6     36.8ns ± 3%    36.6ns ± 3%     ~     (p=0.310 n=5+5)
MulMod/small/big-6         70.1ns ±13%    65.7ns ± 1%     ~     (p=0.310 n=5+5)
MulMod/mod64/uint256-6      117ns ± 9%     106ns ± 2%  -10.12%  (p=0.008 n=5+5)
MulMod/mod64/big-6          238ns ± 1%     220ns ± 5%   -7.70%  (p=0.008 n=5+5)
MulMod/mod128/uint256-6     211ns ± 9%     171ns ± 0%  -18.93%  (p=0.016 n=5+4)
MulMod/mod128/big-6         572ns ±11%     504ns ± 2%  -11.85%  (p=0.016 n=5+5)
MulMod/mod192/uint256-6     195ns ± 4%     196ns ± 2%     ~     (p=0.548 n=5+5)
MulMod/mod192/big-6         563ns ± 6%     483ns ± 5%  -14.20%  (p=0.008 n=5+5)
MulMod/mod256/uint256-6     226ns ± 7%     182ns ± 2%  -19.40%  (p=0.008 n=5+5)
MulMod/mod256/big-6         518ns ± 5%     451ns ± 3%  -12.81%  (p=0.008 n=5+5)

And the difference between with-cache and without-cache:

[user@work uint256]$ benchstat daosvik.txt daosvik_nocache.txt 
name                     old time/op    new time/op    delta
MulMod/small/uint256-6     36.3ns ± 1%    36.6ns ± 3%     ~     (p=0.286 n=5+5)
MulMod/small/big-6         65.9ns ± 4%    65.7ns ± 1%     ~     (p=1.000 n=5+5)
MulMod/mod64/uint256-6      105ns ± 3%     106ns ± 2%     ~     (p=0.452 n=5+5)
MulMod/mod64/big-6          216ns ± 1%     220ns ± 5%     ~     (p=0.167 n=5+5)
MulMod/mod128/uint256-6     173ns ± 0%     171ns ± 0%   -0.92%  (p=0.029 n=4+4)
MulMod/mod128/big-6         504ns ± 3%     504ns ± 2%     ~     (p=0.690 n=5+5)
MulMod/mod192/uint256-6     194ns ± 1%     196ns ± 2%     ~     (p=0.548 n=5+5)
MulMod/mod192/big-6         474ns ± 2%     483ns ± 5%     ~     (p=0.310 n=5+5)
MulMod/mod256/uint256-6     124ns ± 3%     182ns ± 2%  +46.58%  (p=0.008 n=5+5)
MulMod/mod256/big-6         451ns ± 5%     451ns ± 3%     ~     (p=1.000 n=5+5)

mod.go

holiman · 2021-08-19T15:10:01Z

mod.go

+
+	// Check for reciprocal in the cache
+
+	cacheIndex := (m[3]^m[2]^m[1]^m[0]) & cacheMask


I'm thinking that the entirety of the caching can be moved out of this function. So we have

func reciprocal(...){ ... if cached, ok := cache.has(m); ok{ return cached } ... // do calc cache.store(m, mu) return mu }

Then you'd also have to define a new type:

type reciprocalCache [cacheSets]cacheSet var cache reciprocalCache func( c reciprocalCache) has(...) bool{ ... }

That de-clutters things a bit. It should not be a loss of speed, if things are inlined properly by the compiler.

Then you could also move cacheStats and make it cache.stats(), which is nicer

I tried this, but it absolutely killed performance no matter my attempts at massaging the code

Really? I just pushed a change, which so far just makes the cachesstats separate. I'll look into if I can massage the second change so it doesn't affect performance

At least, my first change doesn't affect performance (I'm using BenchmarkMulMod/mod256/uint256-6 to check it)

holiman · 2021-08-20T13:02:28Z

@chfast would you mind taking a peek at this too?

chfast

I only reviewed AddMod. I prefer to handle AddMod first as it is much simpler. Also benchmark changes can be merged first before functional changes.

I tried to optimize AddMod for similar cases but it made the worst case up to 2x slower. However, here the change is different so I will revisit this. https://github.com/chfast/intx/pull/206/files

I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?

chfast · 2021-08-23T07:02:28Z

uint256.go

+
+	if (m[3] != 0) && (m[3] >= x[3]) && (m[3] >= y[3]) {
+		if z, overflow := z.AddOverflow(x, y); overflow {
+			z.Sub(z, m)


Is this equivalent of https://github.com/chfast/intx/pull/206/files#diff-56937962f5980bbff23d21ab8da34df2b7df07b21546691156a2e7a79d763f15R1111-R1113?

Not an exact equivalent, no. This code handles addition for m >= 2^192, with x,y <= m+2^192-1. This includes already-reduced values for x and y, that is 0 <= x,y < m.

chfast · 2021-08-23T07:02:48Z

uint256.go

+		for {
+			t := *z
+			if _, overflow := t.SubOverflow(&t, m); overflow {
+				break


How many times this loop can run?

The worst case here is m = 2^192, x = y = 2m-1, so x+y = 4m-2. Hence the loop is exited at the latest in the fourth iteration. For already-reduced inputs, this happens no later than the second iteration.

This comment should be added to the code.

Does this procedure has any name?

Comment added.
This procedure doesn't have any particular name, it's just repeated subtraction of the modulus until we have the smallest non-negative residue.

chfast · 2021-08-23T07:04:57Z

uint256.go

+	// Fast path for m >= 2^192, with x and y at most slightly bigger than m.
+	// This is always the case when x and y are already reduced modulo such m.
+
+	if (m[3] != 0) && (m[3] >= x[3]) && (m[3] >= y[3]) {


This only handles happy case, but maybe if m[3] < x[3] maybe it would be better to perform x mod m and then also use the more efficient implementation. Same for y mod m.

Happy case indeed, and also the normal case for modular arithmetic. :)

The code falls back to the slower division-based code for other cases, and will always return a fully-reduced result.

Ok, I can see now we cannot efficiently use SubOverflow to reduce x here.

uint256.go

holiman · 2021-08-23T07:52:57Z

Also benchmark changes can be merged first before functional changes.

Yeah, I noted the same. @daosvik can you make a separate PR with the benchmark changes, and then we can rebase this one on top of that one (or on top of master after we merge it). If not, I can do it myself too.

gballet · 2021-08-23T16:18:54Z

I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?

So this PR seeks to accelerate the specific case of doing modulo operations for a specific case, namely prime field operations used for elliptic curves. This isn't connected to the speed ADDMOD and MULMOD EVM instructions, but the calculations of the root commitment (equivalent to the root hash), so it can not really be addressed with gas costs.

chfast · 2021-08-23T18:14:36Z

I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?

So this PR seeks to accelerate the specific case of doing modulo operations for a specific case, namely prime field operations used for elliptic curves. This isn't connected to the speed ADDMOD and MULMOD EVM instructions, but the calculations of the root commitment (equivalent to the root hash), so it can not really be addressed with gas costs.

To confirm: you want to use these methods outside of EVM.

chfast

There is a bug in AddMod(). You can find failing tests in tests: more AddMod() test cases #89.
The test TestRandomMulMod is running forever what indicates there may be also bug in MulMod().
@holiman, the CI testing does not run for external PR, this should be enabled.

holiman · 2021-08-24T12:00:02Z

@holiman, the CI testing does not run for external PR, this should be enabled.

I think this is addressed now.

holiman · 2021-08-24T12:30:31Z

Some remaining tasks for this PR:

Rebase on top of latest master (benchmarks are now in master)
Move the caching implementation so it's not part of reciprocal
Look into the test failures which @chfast reported

codecov · 2021-08-26T01:03:41Z

Codecov Report

Merging #86 (295482d) into master (71c9c37) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master       #86    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            3         4     +1     
  Lines         1047      1364   +317     
==========================================
+ Hits          1047      1364   +317

chfast · 2021-08-26T13:14:24Z

uint256.go

+		)
+
+		s = *x
+		if _, overflow = s.SubOverflow(&s, m); overflow {


Suggested change

if _, overflow = s.SubOverflow(&s, m); overflow {

if _, overflow = s.SubOverflow(x, m); overflow {

The idea is good, but somehow this causes the compiler (1.17) to slow down AddMod/mod192 by more than 17%

chfast · 2021-08-26T13:14:44Z

uint256.go

+		}
+
+		t = *y
+		if _, overflow = t.SubOverflow(&t, m); overflow {


Suggested change

if _, overflow = t.SubOverflow(&t, m); overflow {

if _, overflow = t.SubOverflow(y, m); overflow {

uint256.go

chfast · 2021-08-26T13:51:36Z

uint256.go

+		}
+
+		t = s
+		if _, overflow = s.SubOverflow(&s, m); overflow {


You can also AddOverflow and SubOverflow one after another and then select the result based on the two overflow flags. As in https://github.com/chfast/intx/pull/206/files#diff-56937962f5980bbff23d21ab8da34df2b7df07b21546691156a2e7a79d763f15R1026-R1028.

Good suggestion again, but the time goes up about 30% with this

gballet · 2021-08-26T14:10:18Z

To confirm: you want to use these methods outside of EVM.

Our primary target is indeed to compute the state root faster. If the EVM can benefit from that, then all the better.

holiman · 2021-08-27T09:02:35Z

I go-formatted the files. I'm a bit surprised that the CI linter didn't complain about this earlier.

holiman · 2021-08-27T09:36:54Z

This PR is now rebased on master. @daosvik , you'll need to git pull, which will tell you it can't be applied, then do a git reset --hard origin/opt_modarith .

If you want to keep your old branch, you can first stash it away, like git checkout -b old_op_modarith; git checkout opt_modarith

daosvik · 2021-08-27T09:57:14Z

I go-formatted the files. I'm a bit surprised that the CI linter didn't complain about this earlier.

Uh-oh. I'll have to postpone that till the code is ready, as this reduces my productivity.

holiman · 2021-09-21T07:20:14Z

Even without the cache, I still see a speed up:

name                     old time/op  new time/op  delta
MulMod/small/uint256-6   32.0ns ± 1%  32.4ns ± 4%     ~     (p=0.064 n=8+10)
MulMod/mod64/uint256-6   74.3ns ± 1%  75.7ns ± 1%   +1.87%  (p=0.000 n=8+9)
MulMod/mod128/uint256-6   122ns ± 2%   121ns ± 3%     ~     (p=0.092 n=9+8)
MulMod/mod192/uint256-6   149ns ± 4%   148ns ± 0%     ~     (p=0.813 n=9+9)
MulMod/mod256/uint256-6   187ns ± 5%   160ns ± 9%  -14.26%  (p=0.000 n=9+10)
MulMod/small/big-6       61.3ns ±17%  57.2ns ± 1%     ~     (p=0.780 n=10+9)
MulMod/mod64/big-6        112ns ± 9%   113ns ± 6%     ~     (p=0.725 n=10+10)
MulMod/mod128/big-6       365ns ± 7%   357ns ± 2%     ~     (p=0.182 n=10+9)
MulMod/mod192/big-6       408ns ± 1%   416ns ± 6%     ~     (p=0.779 n=9+9)
MulMod/mod256/big-6       473ns ± 4%   485ns ± 6%     ~     (p=0.280 n=10+10)

holiman · 2021-09-21T07:29:52Z

And also for addmod:

[user@work uint256]$ benchstat addmod_before.txt addmod_after.txt 
name                     old time/op  new time/op  delta
AddMod/small/uint256-6   12.0ns ± 2%  12.2ns ± 2%   +1.65%  (p=0.016 n=10+9)
AddMod/mod64/uint256-6   19.3ns ± 3%  19.4ns ± 0%     ~     (p=0.055 n=9+8)
AddMod/mod128/uint256-6  43.3ns ± 2%  42.0ns ± 6%     ~     (p=0.661 n=9+10)
AddMod/mod192/uint256-6  46.6ns ± 1%  46.9ns ± 1%   +0.59%  (p=0.023 n=9+9)
AddMod/mod256/uint256-6  49.8ns ± 1%  19.8ns ± 4%  -60.19%  (p=0.000 n=9+10)
AddMod/small/big-6       44.7ns ± 1%  43.4ns ± 1%   -2.92%  (p=0.000 n=9+9)
AddMod/mod64/big-6       54.8ns ± 1%  53.6ns ± 1%   -2.14%  (p=0.000 n=9+9)
AddMod/mod128/big-6       143ns ± 3%   141ns ± 1%   -1.87%  (p=0.001 n=10+9)
AddMod/mod192/big-6       144ns ± 2%   143ns ± 2%     ~     (p=0.305 n=10+9)
AddMod/mod256/big-6       144ns ± 1%   143ns ± 1%     ~     (p=0.868 n=9+8)

holiman

I think this is OK now. Some follow-up tasks:

Adding the MulmodWithReciprocal and Reciprocal methods.
We should integrate uint256 with oss-fuzz, and let it run for a while before cutting a new release.
go fmt eventually

chfast · 2021-10-22T15:32:09Z

mod.go

+	return z
+}
+
+//func onesCount(x *Int) (z int) {


chfast · 2021-10-22T15:33:20Z

mod.go

+
+	if m[0] | m[1] | m[2] | (m[3] & (m[3]-1)) == 0 { // replace with commented code below for general case
+
+//	if onesCount(m) <= 1 {


chfast · 2021-10-22T16:07:42Z

Haswell 4.4 GHz, go 1.16.

name                     old time/op  new time/op  delta
Mod/small/uint256-8      12.8ns ± 0%  12.7ns ± 0%   -0.49%  (p=0.000 n=8+9)
Mod/small/big-8          42.3ns ± 1%  42.9ns ± 1%   +1.35%  (p=0.000 n=10+9)
Mod/mod64/uint256-8      59.8ns ± 1%  59.8ns ± 0%     ~     (p=0.796 n=9+10)
Mod/mod64/big-8           110ns ± 1%   109ns ± 1%   -0.46%  (p=0.048 n=10+10)
Mod/mod128/uint256-8     85.7ns ± 1%  85.9ns ± 1%     ~     (p=0.243 n=9+10)
Mod/mod128/big-8          242ns ± 1%   242ns ± 1%     ~     (p=0.643 n=10+10)
Mod/mod192/uint256-8     80.2ns ± 0%  80.1ns ± 0%     ~     (p=0.651 n=9+9)
Mod/mod192/big-8          220ns ± 1%   219ns ± 1%     ~     (p=0.306 n=10+10)
Mod/mod256/uint256-8     68.4ns ± 0%  68.3ns ± 0%   -0.15%  (p=0.037 n=9+9)
Mod/mod256/big-8          189ns ± 1%   188ns ± 0%     ~     (p=0.237 n=10+10)
AddMod/small/uint256-8   13.4ns ± 0%  13.6ns ± 0%   +1.68%  (p=0.000 n=9+8)
AddMod/small/big-8       42.1ns ± 3%  41.9ns ± 2%     ~     (p=0.897 n=10+10)
AddMod/mod64/uint256-8   20.5ns ± 0%  20.7ns ± 0%   +1.09%  (p=0.000 n=9+9)
AddMod/mod64/big-8       50.9ns ± 3%  50.5ns ± 1%     ~     (p=0.643 n=10+10)
AddMod/mod128/uint256-8  39.3ns ± 0%  39.7ns ± 0%   +0.84%  (p=0.000 n=8+9)
AddMod/mod128/big-8       120ns ± 1%   119ns ± 1%   -0.63%  (p=0.012 n=10+10)
AddMod/mod192/uint256-8  42.4ns ± 0%  42.8ns ± 0%   +0.89%  (p=0.000 n=9+10)
AddMod/mod192/big-8       122ns ± 0%   122ns ± 1%     ~     (p=0.524 n=8+10)
AddMod/mod256/uint256-8  45.1ns ± 0%  15.8ns ± 0%  -65.01%  (p=0.000 n=8+9)
AddMod/mod256/big-8       124ns ± 0%   124ns ± 1%     ~     (p=0.614 n=10+10)
MulMod/small/uint256-8   33.0ns ± 0%  33.1ns ± 0%   +0.29%  (p=0.000 n=8+8)
MulMod/small/big-8       55.9ns ± 1%  55.7ns ± 1%     ~     (p=0.065 n=10+10)
MulMod/mod64/uint256-8   70.1ns ± 0%  70.3ns ± 1%   +0.38%  (p=0.006 n=8+10)
MulMod/mod64/big-8        101ns ± 1%   101ns ± 1%     ~     (p=0.231 n=9+10)
MulMod/mod128/uint256-8   104ns ± 0%   105ns ± 1%   +0.81%  (p=0.000 n=9+10)
MulMod/mod128/big-8       297ns ± 1%   296ns ± 1%     ~     (p=0.403 n=10+10)
MulMod/mod192/uint256-8   127ns ± 0%   127ns ± 0%     ~     (p=0.297 n=8+9)
MulMod/mod192/big-8       342ns ± 1%   341ns ± 1%     ~     (p=0.285 n=10+9)
MulMod/mod256/uint256-8   160ns ± 0%   149ns ± 0%   -6.46%  (p=0.001 n=8+6)
MulMod/mod256/big-8       390ns ± 1%   388ns ± 1%   -0.44%  (p=0.008 n=10+10)

chfast · 2021-10-22T16:10:06Z

Skylake+ laptop, go 1.16

name                     old time/op  new time/op  delta
Mod/small/uint256-8      12.0ns ± 1%  12.1ns ± 0%   +0.69%  (p=0.005 n=9+9)
Mod/small/big-8          41.1ns ± 1%  42.5ns ± 4%   +3.40%  (p=0.001 n=9+10)
Mod/mod64/uint256-8      56.6ns ± 0%  57.2ns ± 2%     ~     (p=0.106 n=8+10)
Mod/mod64/big-8           109ns ± 6%   107ns ± 3%     ~     (p=0.109 n=10+10)
Mod/mod128/uint256-8     85.9ns ± 1%  86.2ns ± 1%     ~     (p=0.315 n=10+10)
Mod/mod128/big-8          236ns ± 2%   233ns ± 1%     ~     (p=0.093 n=10+10)
Mod/mod192/uint256-8     81.5ns ± 2%  82.9ns ± 4%     ~     (p=0.138 n=10+10)
Mod/mod192/big-8          214ns ± 2%   215ns ± 2%     ~     (p=0.060 n=10+10)
Mod/mod256/uint256-8     68.7ns ± 0%  69.0ns ± 1%     ~     (p=0.305 n=10+8)
Mod/mod256/big-8          189ns ± 2%   193ns ± 2%   +1.98%  (p=0.002 n=10+10)
AddMod/small/uint256-8   13.1ns ± 1%  14.1ns ± 1%   +7.86%  (p=0.000 n=10+9)
AddMod/small/big-8       41.6ns ± 2%  41.9ns ± 2%     ~     (p=0.148 n=10+10)
AddMod/mod64/uint256-8   19.9ns ± 1%  20.5ns ± 1%   +2.65%  (p=0.000 n=10+10)
AddMod/mod64/big-8       49.5ns ± 1%  49.8ns ± 1%     ~     (p=0.128 n=10+10)
AddMod/mod128/uint256-8  39.1ns ± 0%  39.3ns ± 0%   +0.47%  (p=0.000 n=8+10)
AddMod/mod128/big-8       117ns ± 1%   124ns ± 5%   +5.49%  (p=0.000 n=10+10)
AddMod/mod192/uint256-8  42.1ns ± 0%  42.3ns ± 1%     ~     (p=0.062 n=9+10)
AddMod/mod192/big-8       120ns ± 3%   122ns ± 2%   +2.04%  (p=0.002 n=10+10)
AddMod/mod256/uint256-8  44.9ns ± 0%  16.9ns ± 1%  -62.41%  (p=0.000 n=10+10)
AddMod/mod256/big-8       121ns ± 1%   123ns ± 2%   +1.68%  (p=0.000 n=8+10)
MulMod/small/uint256-8   32.5ns ± 1%  32.8ns ± 1%   +0.88%  (p=0.006 n=10+10)
MulMod/small/big-8       54.9ns ± 2%  54.8ns ± 1%     ~     (p=0.971 n=10+10)
MulMod/mod64/uint256-8   69.0ns ± 1%  69.4ns ± 1%   +0.59%  (p=0.035 n=10+10)
MulMod/mod64/big-8        100ns ± 2%   102ns ± 5%     ~     (p=0.055 n=8+10)
MulMod/mod128/uint256-8   105ns ± 1%   106ns ± 0%   +0.31%  (p=0.048 n=10+10)
MulMod/mod128/big-8       291ns ± 2%   290ns ± 3%     ~     (p=0.251 n=9+10)
MulMod/mod192/uint256-8   128ns ± 0%   127ns ± 0%   -0.50%  (p=0.001 n=9+9)
MulMod/mod192/big-8       334ns ± 1%   332ns ± 2%     ~     (p=0.474 n=9+9)
MulMod/mod256/uint256-8   156ns ± 0%   131ns ± 0%  -16.04%  (p=0.000 n=7+10)
MulMod/mod256/big-8       378ns ± 1%   381ns ± 3%     ~     (p=0.605 n=9+9)

chfast · 2021-10-22T16:17:55Z

uint256.go

+// and returns z, using the reciprocal of m provided as the mu parameter.
+// Use uint256.Reciprocal to calculate mu from m.
+// If m == 0, z is set to 0 (OBS: differs from the big.Int)
+func (z *Int) MulModWithReciprocal(x, y, m *Int, mu *[5]uint64) *Int {


This function is not tested at all. I think it should be enough to add it to any existing mulmod test case.

chfast · 2021-10-23T14:06:26Z

uint256.go

+		return z.Set(&r)
+	}
+
+	var (


Sorry, but I think I have found another issue. It looks like the Reciprocal() actually requires m[3] != 0 although the documentation says otherwise. E.g. try Reciprocal(2). This means that this code cannot be ever reached and is useless currently.

Not sure what @holiman thinks about it, but it may be simpler to just put if m[3] == 0 { panic } in Reciprocal().

It has been modified so that it is limited to m[3] != 0. Hence the first comment inside the function is "Note: specialized for m[3] != 0".
It's not so clear this way, so I'll adjust the comments preceding the function.

…s m[3] != 0

chfast

I still has some concerns about the API design but this can be easily changed separately later.

Thanks for good job and patience for reviewers.

chfast · 2021-10-24T12:59:57Z

mod.go

 // - starts with a 32-bit division, refines with newton-raphson iterations
 func Reciprocal(m *Int) (mu [5]uint64) {

-	// Note: specialized for m[3] != 0
+	if m[3] == 0 {
+		return mu


I'm not sure this is the best API design to fail silently instead of panic but it depends how this is going to be used.

I agree it's not so very elegant, but this way anyone using it can choose to use Reciprocal and MulModWithReciprocal for any modulus with upto 256 bits. Support for fast operations for shorter moduli can also be added later without any change required in client code.

holiman · 2021-10-25T06:58:46Z

Thanks all!

In elliptic curve context the x and y arguments are already reduced modulo mod. Here we can pick up similar condition and provide optimized implementation for such case. This case is 2x faster with little overhead for other cases. Based on holiman/uint256#86.

holiman reviewed Aug 19, 2021

View reviewed changes

daosvik marked this pull request as ready for review August 19, 2021 14:58

holiman reviewed Aug 19, 2021

View reviewed changes

chfast reviewed Aug 23, 2021

View reviewed changes

daosvik mentioned this pull request Aug 23, 2021

benchmarks: use least non negative residue of the addend for each modulus when testing AddMod #88

Merged

chfast requested changes Aug 24, 2021

View reviewed changes

chfast reviewed Aug 26, 2021

View reviewed changes

holiman force-pushed the opt_modarith branch from e051979 to 8607f4a Compare August 27, 2021 09:34

daosvik added 7 commits September 20, 2021 01:17

Remove automatic cache

6ee98f7

Pass modulus and reciprocal by reference instead of value

6c954e8

Make reciprocal() specialized for m[3] != 0 (modulus >= 2^193)

3002c1b

Add two corner case tests for reciprocal

7fdcd68

Remove unreachable safeguard code in reciprocal

74b3e8b

Merge branch 'holiman:master' into opt_modarith

54cdbb6

Comment out unused onesCount

15758be

holiman approved these changes Sep 21, 2021

View reviewed changes

daosvik added 5 commits September 21, 2021 21:06

Make leadingZeros faster

0c7b623

Simplify detection of whether a modulus is a power of 2

154f050

Speed up the end of reciprocal

b15b515

Add MulModWithReciprocal, export Reciprocal

0c7b52c

Pass reciprocal as pointer parameter to MulModWithReciprocal

17235df

chfast reviewed Oct 22, 2021

View reviewed changes

chfast requested changes Oct 22, 2021

View reviewed changes

daosvik added 2 commits October 22, 2021 18:48

Remove unused func onesCount

81c7df1

Add testing of MulModWithReciprocal

a8c1be9

chfast reviewed Oct 23, 2021

View reviewed changes

daosvik and others added 2 commits October 24, 2021 02:22

Adjust comments and tests to reflect that Reciprocal now only support…

baf4cd6

…s m[3] != 0

Also test MulModWithReciprocal on regular test cases

295482d

chfast approved these changes Oct 24, 2021

View reviewed changes

holiman merged commit 38e2c56 into holiman:master Oct 25, 2021


		// Check for reciprocal in the cache

		cacheIndex := (m[3]^m[2]^m[1]^m[0]) & cacheMask

	if _, overflow = s.SubOverflow(&s, m); overflow {
	if _, overflow = s.SubOverflow(x, m); overflow {

	if _, overflow = t.SubOverflow(&t, m); overflow {
	if _, overflow = t.SubOverflow(y, m); overflow {


		if m[0] \| m[1] \| m[2] \| (m[3] & (m[3]-1)) == 0 { // replace with commented code below for general case

		// if onesCount(m) <= 1 {

Optimised modular arithmetic targeting elliptic curve operations #86

Optimised modular arithmetic targeting elliptic curve operations #86

Conversation

daosvik commented Aug 13, 2021 • edited Loading

gballet commented Aug 19, 2021

gballet commented Aug 19, 2021

daosvik commented Aug 19, 2021

holiman commented Aug 19, 2021 • edited Loading

holiman commented Aug 19, 2021

gballet commented Aug 19, 2021

holiman commented Aug 19, 2021

daosvik commented Aug 19, 2021

gballet commented Aug 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holiman commented Aug 19, 2021

holiman Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holiman commented Aug 20, 2021

chfast left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holiman commented Aug 23, 2021

gballet commented Aug 23, 2021

chfast commented Aug 23, 2021

chfast left a comment

Choose a reason for hiding this comment

holiman commented Aug 24, 2021

holiman commented Aug 24, 2021

codecov bot commented Aug 26, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gballet commented Aug 26, 2021

holiman commented Aug 27, 2021

holiman commented Aug 27, 2021

daosvik commented Aug 27, 2021

holiman commented Sep 21, 2021

holiman commented Sep 21, 2021

holiman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chfast commented Oct 22, 2021

chfast commented Oct 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chfast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holiman commented Oct 25, 2021

daosvik commented Aug 13, 2021 •

edited

Loading

holiman commented Aug 19, 2021 •

edited

Loading

holiman Aug 19, 2021 •

edited

Loading

chfast left a comment •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading