-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimised modular arithmetic targeting elliptic curve operations #86
Conversation
Before:
After:
|
Running benchstat:
There seems to be a significant performance hit. These are on a very specific platform, though, I wonder what the results would be on more standard platforms? |
Here are my MulMod benchmark results on amd64 and arm64, with go1.13.8 on Ubuntu 20.04.3 LTS. 6-core Ryzen 5 3600 running at 3.6 GHz (boost is off):
4-core Core i7-1065G7 running at 1.3 GHz:
4-core Cortex-A72 running at 1.5 GHz (Raspberry Pi 4):
|
on
So that also looks better than @gballet 's numbers |
@daosvik you should definitely update that, it's very old, and your benchmarks may be totally off. |
On an AWS
|
@daosvik another thing -- you're adding/changing some benchmarks. It would help if you moved those changes so that your very first commit is the benchmark-change, and after that, the commits that modify the code. That way, it's possible to compare benchmarks for "master" (master + new/modified benches) against PR, by just jumping between commits. Which is pretty difficult right now, when the bench-changes are mixed with code changes. |
Sorry about that - I almost got it right. The first commit adds an unnecessary printout of cache stats, which you can delete or comment out to generate the baseline benchmark. |
Right, so I hadn't seen that the resource-heavy go language server was running in the background. Without it, I do get results that are closer to expectation:
|
mod.go
Outdated
// return | ||
//} | ||
|
||
func shiftleft(x Int, s uint) (z Int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both this method and the existing function
// Lsh sets z = x << n and returns z.
func (z *Int) Lsh(x *Int, n uint) *Int {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need both, but there was a performance improvement over Lsh on Ryzen, so I want to do some more benchmarking before removing one in favour of the other.
mod.go
Outdated
// - otherwise, the result is normalised to have non-zero most significant word | ||
// - starts with a 32-bit division, refines with newton-raphson iterations | ||
|
||
func reciprocal(m Int) (mu [5]uint64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this implementation is .. fairly nontrivial ... it would be good if you can add some links to the origin of the algorithm, so the astute reader can have something to verify against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this algorithm for this specific purpose, so there is no paper or website to refer to as the origin of the algorithm.
It's really only computing the reciprocal of the modulus with 320-bit precision, taking care to avoid overflow.
mod.go
Outdated
return | ||
} | ||
|
||
cache[cacheIndex].miss++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this an unsafe concurrent modification, since you're only holding the readlock ? x++
is not atomic
(and if so, same a few lines above with hit++
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is unsafe, except it is not very important to have an exact result for these cache statistics.
Another benchmark -- if the cache is disabled:
And the difference between with-cache and without-cache:
|
mod.go
Outdated
|
||
// Check for reciprocal in the cache | ||
|
||
cacheIndex := (m[3]^m[2]^m[1]^m[0]) & cacheMask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that the entirety of the caching can be moved out of this function. So we have
func reciprocal(...){
...
if cached, ok := cache.has(m); ok{
return cached
}
...
// do calc
cache.store(m, mu)
return mu
}
Then you'd also have to define a new type:
type reciprocalCache [cacheSets]cacheSet
var cache reciprocalCache
func( c reciprocalCache) has(...) bool{
...
}
That de-clutters things a bit. It should not be a loss of speed, if things are inlined properly by the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you could also move cacheStats
and make it cache.stats()
, which is nicer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it absolutely killed performance no matter my attempts at massaging the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? I just pushed a change, which so far just makes the cachesstats separate. I'll look into if I can massage the second change so it doesn't affect performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, my first change doesn't affect performance (I'm using BenchmarkMulMod/mod256/uint256-6
to check it)
@chfast would you mind taking a peek at this too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only reviewed AddMod. I prefer to handle AddMod first as it is much simpler. Also benchmark changes can be merged first before functional changes.
I tried to optimize AddMod for similar cases but it made the worst case up to 2x slower. However, here the change is different so I will revisit this. https://github.com/chfast/intx/pull/206/files
I also don't understand the motivation fully. Are ADDMOD and MULMOD so slow for the verkle-tree loads so slow that the node is not able to process blocks in time? If not, is gas cost reduction also planned?
uint256.go
Outdated
|
||
if (m[3] != 0) && (m[3] >= x[3]) && (m[3] >= y[3]) { | ||
if z, overflow := z.AddOverflow(x, y); overflow { | ||
z.Sub(z, m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an exact equivalent, no. This code handles addition for m >= 2^192, with x,y <= m+2^192-1. This includes already-reduced values for x and y, that is 0 <= x,y < m.
uint256.go
Outdated
for { | ||
t := *z | ||
if _, overflow := t.SubOverflow(&t, m); overflow { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many times this loop can run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst case here is m = 2^192, x = y = 2m-1, so x+y = 4m-2. Hence the loop is exited at the latest in the fourth iteration. For already-reduced inputs, this happens no later than the second iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should be added to the code.
Does this procedure has any name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added.
This procedure doesn't have any particular name, it's just repeated subtraction of the modulus until we have the smallest non-negative residue.
uint256.go
Outdated
// Fast path for m >= 2^192, with x and y at most slightly bigger than m. | ||
// This is always the case when x and y are already reduced modulo such m. | ||
|
||
if (m[3] != 0) && (m[3] >= x[3]) && (m[3] >= y[3]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only handles happy case, but maybe if m[3] < x[3]
maybe it would be better to perform x mod m
and then also use the more efficient implementation. Same for y mod m
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy case indeed, and also the normal case for modular arithmetic. :)
The code falls back to the slower division-based code for other cases, and will always return a fully-reduced result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can see now we cannot efficiently use SubOverflow to reduce x
here.
Yeah, I noted the same. @daosvik can you make a separate PR with the benchmark changes, and then we can rebase this one on top of that one (or on top of master after we merge it). If not, I can do it myself too. |
So this PR seeks to accelerate the specific case of doing modulo operations for a specific case, namely prime field operations used for elliptic curves. This isn't connected to the speed |
To confirm: you want to use these methods outside of EVM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is a bug in
AddMod()
. You can find failing tests in tests: more AddMod() test cases #89. - The test
TestRandomMulMod
is running forever what indicates there may be also bug inMulMod()
. - @holiman, the CI testing does not run for external PR, this should be enabled.
I think this is addressed now. |
Some remaining tasks for this PR:
|
Codecov Report
@@ Coverage Diff @@
## master #86 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 3 4 +1
Lines 1047 1364 +317
==========================================
+ Hits 1047 1364 +317 |
) | ||
|
||
s = *x | ||
if _, overflow = s.SubOverflow(&s, m); overflow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _, overflow = s.SubOverflow(&s, m); overflow { | |
if _, overflow = s.SubOverflow(x, m); overflow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is good, but somehow this causes the compiler (1.17) to slow down AddMod/mod192 by more than 17%
} | ||
|
||
t = *y | ||
if _, overflow = t.SubOverflow(&t, m); overflow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _, overflow = t.SubOverflow(&t, m); overflow { | |
if _, overflow = t.SubOverflow(y, m); overflow { |
} | ||
|
||
t = s | ||
if _, overflow = s.SubOverflow(&s, m); overflow { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also AddOverflow and SubOverflow one after another and then select the result based on the two overflow flags. As in https://github.com/chfast/intx/pull/206/files#diff-56937962f5980bbff23d21ab8da34df2b7df07b21546691156a2e7a79d763f15R1026-R1028.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion again, but the time goes up about 30% with this
Our primary target is indeed to compute the state root faster. If the EVM can benefit from that, then all the better. |
I go-formatted the files. I'm a bit surprised that the CI linter didn't complain about this earlier. |
This PR is now rebased on If you want to keep your old branch, you can first stash it away, like |
Uh-oh. I'll have to postpone that till the code is ready, as this reduces my productivity. |
Even without the cache, I still see a speed up:
|
And also for addmod:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK now. Some follow-up tasks:
- Adding the
MulmodWithReciprocal
andReciprocal
methods. - We should integrate
uint256
with oss-fuzz, and let it run for a while before cutting a new release. go fmt
eventually
mod.go
Outdated
return z | ||
} | ||
|
||
//func onesCount(x *Int) (z int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
mod.go
Outdated
|
||
if m[0] | m[1] | m[2] | (m[3] & (m[3]-1)) == 0 { // replace with commented code below for general case | ||
|
||
// if onesCount(m) <= 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
Haswell 4.4 GHz, go 1.16.
|
Skylake+ laptop, go 1.16
|
// and returns z, using the reciprocal of m provided as the mu parameter. | ||
// Use uint256.Reciprocal to calculate mu from m. | ||
// If m == 0, z is set to 0 (OBS: differs from the big.Int) | ||
func (z *Int) MulModWithReciprocal(x, y, m *Int, mu *[5]uint64) *Int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not tested at all. I think it should be enough to add it to any existing mulmod test case.
return z.Set(&r) | ||
} | ||
|
||
var ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I think I have found another issue. It looks like the Reciprocal()
actually requires m[3] != 0
although the documentation says otherwise. E.g. try Reciprocal(2)
. This means that this code cannot be ever reached and is useless currently.
Not sure what @holiman thinks about it, but it may be simpler to just put if m[3] == 0 { panic }
in Reciprocal()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been modified so that it is limited to m[3] != 0. Hence the first comment inside the function is "Note: specialized for m[3] != 0".
It's not so clear this way, so I'll adjust the comments preceding the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still has some concerns about the API design but this can be easily changed separately later.
Thanks for good job and patience for reviewers.
// - starts with a 32-bit division, refines with newton-raphson iterations | ||
func Reciprocal(m *Int) (mu [5]uint64) { | ||
|
||
// Note: specialized for m[3] != 0 | ||
if m[3] == 0 { | ||
return mu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the best API design to fail silently instead of panic but it depends how this is going to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not so very elegant, but this way anyone using it can choose to use Reciprocal and MulModWithReciprocal for any modulus with upto 256 bits. Support for fast operations for shorter moduli can also be added later without any change required in client code.
Thanks all! |
In elliptic curve context the x and y arguments are already reduced modulo mod. Here we can pick up similar condition and provide optimized implementation for such case. This case is 2x faster with little overhead for other cases. Based on holiman/uint256#86.
In elliptic curve context the x and y arguments are already reduced modulo mod. Here we can pick up similar condition and provide optimized implementation for such case. This case is 2x faster with little overhead for other cases. Based on holiman/uint256#86.
In elliptic curve context the x and y arguments are already reduced modulo mod. Here we can pick up similar condition and provide optimized implementation for such case. This case is 2x faster with little overhead for other cases. Based on holiman/uint256#86.
In the context of verkle trees we need to optimise modular arithmetic for elliptic curve calculations.
This PR introduces speed optimisations for moduli from 193 to 256 bits in length: