-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Some performance improvements #133
Conversation
CI failed on rust 1.0 .. We'll want to keep that working if possible. I actually have been working on something similar. I'll try to clean that up and push to a branch soon so we can compare/merge approaches. Do you have comparative benchmark results to share? |
On Mon, Nov 16, 2015 at 01:50:04AM -0800, Josh Stone wrote:
Yeah, I just saw that...
Figures :) I'm definitely interested in seeing what you did. This is my "screw I'm also toying with inline assembly for the inner loops - mainly for the
Yeah, in the patch messages. |
Just got it working with inline asm and BigDigit size == usize, if anyone wants to take a look. https://github.com/koverstreet/num/commits/asm Very hacky but it does give a nice performance boost on the multiply benchmark. |
Found some new performance improvements and rolled them into the patch. The current numbers are:
|
OK, I pushed my stuff to cuviper:wip. I think I've got the sort of val/ref forwarding changes you were thinking of. For the most part, my changes are more incremental tweaks than what you have. My performance improvement to Add/Sub is similar to yours, I think, but I definitely didn't improve Mul as much as you did. Take a look -- if it's not too hard for you to refactor your improvements on top of mine, that would be awesome... I think my forwarding simplifies it a bit, but let me know what you think. As for asm stuff, that's a tricky door to open. It's a feature only for rust nightly, of course, and only for archs you implement, so there still has to be a pure-Rust fallback. I think if you want to explore this route, you'd be better off working on ramp. |
On Mon, Nov 16, 2015 at 11:23:50AM -0800, Josh Stone wrote:
Yeah, I can do that. Any idea when your changes will go in?
Oh, I'm not in a rush to get the asm stuff in. But I do like my work to be used, Is it possible to use #[cfg] to switch off the rust version? Or would we be |
☔ The latest upstream changes (presumably 658637d) made this pull request unmergeable. Please resolve the merge conflicts. |
There you go, I pushed my changes. Thanks homu. :) I don't know any way to |
@cuviper there is none. Only way to support asm in |
On Wed, Nov 18, 2015 at 10:06:03PM -0800, Josh Stone wrote:
Damn, that sounds like a major oversight. If I'm feeling particularly ambitious That said - I also wrote a patch that implements all the low level arithmetic I pushed that patch out, and also rebased everything on top of your stuff. |
Let's step back a little and keep focus in this PR, and especially do what we can without breaking API. We can do more in followup PRs. So let's keep the basic add/sub/mul optimizations (first four commits). The ops against primitives can be their own PR, a lot like what #125 did for complex numbers. The BigDigit changes constitute an API break -- which I am open to, but we'd need to bump semver, and if we do this I'd like to hide BigDigit entirely as an implementation detail. (But perhaps some of your optimizations avoiding DoubleBigDigit are still usable right now, if they still help with u32 BigDigit.) I'll add a few more inline comments too... |
#[bench] | ||
fn divide_2(b: &mut Bencher) { | ||
divide_bench(b, 1 << 16, 1 << 12); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last one takes painfully long to benchmark, even after all of your optimizations. Let's leave it out or #[ignore]
it for now. (Then use --ignored
to run it anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new performance is impressive enough that we can keep this after all.
Thanks for the review feedback. On Sun, Nov 22, 2015 at 05:31:23PM -0800, Josh Stone wrote:
Alright, sounds good. I do have a patch to improve divide performance - do you mind including that, to The only thing I haven't changed per your comments is add/add2 and the way carry Reason being in the other places where add2 is used (multiply/divide), if they We could handle that with more wrapper functions, but... it just seems not worth So I wouldn't mind if it's changed later but I lean towards leaving it with the Aside from that - I incorporated all your review feedback, and pushed with just
nod Not used to having to consider API breakages. BTW, what I'd really like is to come up with a good way of separating out the And then ideally keep the DoubleBigDigit implementation, along with my BigDigit So maybe something to consider if/when we do an API break, if you know a nice |
17293fd
to
59f33b0
Compare
"The benchmark is too slow? Let me improve it by 500x..." That's just scary. I feel I now need to pick over that with a fine-toothed comb, but who am I to question Knuth's methods. 😄 Regarding the carry push/pop, it's a small point, but I'm afraid I still disagree. I think we'd be better served by having two variants of Regarding API and implementations -- technically we could do whatever we want in implementation even now, as long as the public interfaces remain. Patch comments inbound. (Some of these may not even be for new changes, just things I didn't notice before.) |
let y = rng.gen_bigint(ybits); | ||
|
||
b.iter(|| &x / &y); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about some pre-iter assertions that these huge divisions are actually correct? e.g. check the round trip to multiply and add the remainder back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, Nov 30, 2015 at 03:38:38PM -0800, Josh Stone wrote:
- let x = rng.gen_bigint(xbits);
- let y = rng.gen_bigint(ybits);
- b.iter(|| &x * &y);
+}
+fn divide_bench(b: &mut Bencher, xbits: usize, ybits: usize) {
- let seed: &[_] = &[1, 2, 3, 4];
- let mut rng: StdRng = SeedableRng::from_seed(seed);
- let x = rng.gen_bigint(xbits);
- let y = rng.gen_bigint(ybits);
- b.iter(|| &x / &y);
+}How about some pre-iter assertions that these huge divisions are actually correct? e.g. check the round trip to multiply and add the remainder back.
Well, then it wouln't be just benchmarking division - I do have a test that does
that though, I didn't push it because it takes way too long to run if you don't
do a release build.
I don't know a good way of putting in these kinds of torture tests where you
want it to run "as long as you have time for", not any fixed amount of time. I
guess I could just make it a #[ignore] test, like you suggested for the other
division test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "pre-iter" I meant before the b.iter()
call which loops itself. Just like the gen_bigint
calls are not included in the benchmark results -- this is just setup, and I'm only suggesting a quick sanity check before hammering at it. :)
} | ||
} | ||
|
||
fn mul3(x: &[BigDigit], y: &[BigDigit]) -> BigUint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what is the 3 in mul3
? But anyway, this function is only used by Mul::mul
, which itself isn't doing anything else, which raises the question why this should be separate at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, Nov 30, 2015 at 03:42:11PM -0800, Josh Stone wrote:
p.data.extend(repeat(0).take(len));
mac3(&mut p.data[..], &j0.data.data[..], &j1.data.data[..]);
p = p.normalize();
sub2(&mut acc[b..], &p.data[..]);
},
Minus => {
mac3(&mut acc[b..], &j0.data.data[..], &j1.data.data[..]);
},
NoSign => (),
}
- }
+}
+fn mul3(x: &[BigDigit], y: &[BigDigit]) -> BigUint {
Just curious, what is the 3 in
mul3
? But anyway, this function is only used byMul::mul
, which itself isn't doing anything else, which raises the question why this should be separate at all.
Back when I still had an add3 and needed to distinguish, I was just using the
same convention you use in assembly for talking about ops that overwrite one of
the arguments for the result, or have another argument for the destination (e.g.
two argument add/mul that overwrites an argument for the dest, or 3 argument
that doesn't - or mac3 vs. mac4 (multiply accumulate)). It probably doesn't make
much sense anymore but it still aesthetically pleases me somehow :)
Though if we do ever (hopefully) get alloca() or some other way of moving heap
allocations to the stack, it could make sense to add a mul2() - avoid doing any
heap allocations by putting all the temporaries (including the result) on the
stack, and then return the result by copying over one of the (heap allocated)
arguments.
It does also get used in other patches I have (e.g. divide and conquer division
is able to make use of multiplying slices).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you're counting number of arguments and the result (when present). OK then.
I see your divide-and-conquer branch, thanks for not stacking that on just yet. :)
We'll get this PR wrapped up first and I'll happily push a new num release for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Tue, Dec 01, 2015 at 04:50:28PM -0800, Josh Stone wrote:
I see your divide-and-conquer branch, thanks for not stacking that on just yet. :)
We'll get this PR wrapped up first and I'll happily push a new num release for it.
BTW - I haven't been able to get divide and conquor division to even be as fast
as Knuth, so I doubt I'll be pushing that out any time soon.
It turns out that division of really large numbers is in general (with any
algorithm I've looked at) pretty dependent on multiplication performance - it
looks like well before you get to the point where it's worth doing something
fancier than knuth's algorithm, our current bottleneck is that karatsuba
multiplication doesn't scale that high.
So probably the next thing to do, algorithm wise, is FFT multiplication (or
maybe toom-3, but from what I was reading if you have a good, well optimized
implementation of FFT multiplication it's debatable whether toom-3 is worth it).
But I didn't study any of the math for that, so I'm not sure I'm feeling that
ambitious :)
On Mon, Nov 30, 2015 at 03:38:14PM -0800, Josh Stone wrote:
:D The code isn't fresh in my head anymore but I had a suspicion as I was working
Alright, I can do the wrapper thing.
Good point. I went and split out the implementations of the low level arithmetic btw - the operations I have the low level arithmatic module exposing right now Although for add with carry/subtract with borrow, my language nerd friend (who's
Re: the vector constructor, if there's actual outside demand for that kind of We could do fast in place converion of vectors of arbitrary size digits to So if you want to keep that functionality and hide BigDigit I could do that. |
let (s_len, o_len) = (a.len(), b.len()); | ||
if s_len < o_len { return Less; } | ||
if s_len > o_len { return Greater; } | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make these a/b_len too please.
The new CI failure is in num-macros vs nightly rust -- I'll take a look at that. I wouldn't even change the size of the public Adding vector conversions for any size u8..u64 sounds intriguing. The example in Ideas for new PRs, not here. ;) |
A few review items haven't been addressed yet:
They're all simple things, so I don't want to hold back this PR too long. If you're already on holiday, maybe I'll go ahead and tweak these myself so we can merge. :) |
On Thu, Dec 10, 2015 at 10:11:57AM -0800, Josh Stone wrote:
Whoops, sorry, I got distracted with other code. I'll fix those up today :) |
Add benchmarks that test multiplies/divides of different sizes
On Thu, Dec 10, 2015 at 10:11:57AM -0800, Josh Stone wrote:
Did you want something besides just testing multiply/divide against each other? |
This is needed for the multiply optimizations in the next patch.
That's fine. I just want a little confidence in what we're benchmarking. :)
Right, but that torture test is slow and ignored, and it's not even testing |
On Thu, Dec 10, 2015 at 01:45:41PM -0800, Josh Stone wrote:
The torture test is doing exactly what I thought you meant for what to test in |
But it's ignored, for good reason, and not using the same input values as
the benchmark.
Let's not belabor this small point. If it's still unclear, leave it to me
to add later.
|
On Thu, Dec 10, 2015 at 02:07:47PM -0800, Josh Stone wrote:
Ok. Change it however you feel like :) |
let mut ret = BigUint::from_slice(b); | ||
sub2(&mut ret.data[..], a); | ||
BigInt::from_biguint(Minus, ret.normalize()) | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely one of these branches should be Plus
? If so, we have insufficient test coverage to notice this, so please add a new test for it too. That might be best as a direct sub_sign
unit test, so it's not relying on particular multiplication details to reach this point.
But you had this right before, and multiplication tests still passed, so I wonder how much that calling code is being exercised. In fact, I just put a panic!
in sub_sign
, and it wasn't reached at all until I enabled the torture test. With that test we can also see the plain failure here.
If we reduce the iterations of that torture test, we can leave it enabled even in debug testing. Just 1000 iterations still triggers this particular issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Thu, Dec 10, 2015 at 04:06:49PM -0800, Josh Stone wrote:
return ll + mm.shl_unit(half_len) + hh.shl_unit(half_len \* 2);
- match cmp_slice(a, b) {
Greater => {
let mut ret = BigUint::from_slice(a);
sub2(&mut ret.data[..], b);
BigInt::from_biguint(Minus, ret.normalize())
},
Less => {
let mut ret = BigUint::from_slice(b);
sub2(&mut ret.data[..], a);
BigInt::from_biguint(Minus, ret.normalize())
},
Surely one of these branches should be
Plus
? If so, we have insufficient test coverage to notice this, so please add a new test for it too. That might be best as a directsub_sign
unit test, so it's not relying on particular multiplication details to reach this point.But you had this right before, and multiplication tests still passed, so I wonder how much that calling code is being exercised. In fact, I just put a
panic!
insub_sign
, and it wasn't reached at all until I enabled the torture test. With that test we can also see the plain failure here.
swear A bug exactly there is why I overloaded add/subtract for slices...
pretty much reintroduced the same bug when I took it back out. Augh.
Yeah, part of the problem is that karatsuba multiplication doesn't become
worthwhile until the numbers are decently sized.
Does rust support gprof yet? code coverage would be really good to have for code
like this.
If we reduce the iterations of that torture test, we can leave it enabled even in debug testing. Just 1000 iterations still triggers this particular issue.
Ok yeah, good idea. Maybe I'll keep a second version that's #ignored that runs
lots more iterations.
The main idea here is to do as much as possible with slices, instead of allocating new BigUints (= heap allocations). Current performance: multiply_0: 10,507 ns/iter (+/- 987) multiply_1: 2,788,734 ns/iter (+/- 100,079) multiply_2: 69,923,515 ns/iter (+/- 4,550,902) After this patch, we get: multiply_0: 364 ns/iter (+/- 62) multiply_1: 34,085 ns/iter (+/- 1,179) multiply_2: 3,753,883 ns/iter (+/- 46,876)
Before: test divide_0 ... bench: 4,058 ns/iter (+/- 255) test divide_1 ... bench: 304,507 ns/iter (+/- 28,063) test divide_2 ... bench: 668,293,969 ns/iter (+/- 25,383,239) After: test divide_0 ... bench: 874 ns/iter (+/- 71) test divide_1 ... bench: 16,641 ns/iter (+/- 1,205) test divide_2 ... bench: 1,336,888 ns/iter (+/- 77,450)
⚡ Test exempted - status |
[RFC] Some performance improvements I added add and subtract methods that take an operand by value, instead of by reference, so they can work in place instead of always allocating new BigInts for the result. I'd like to hear if anyone knows a better way of doing this, though - with the various wrappers/forwardings doing this for all the operations is going to be a mess and add a ton of boilerplate. I haven't even tried to do a complete job of it yet. I'm starting to think the sanest thing to do would be to only have in place implements of most of the operations - and have the wrappers clone() if necessary. That way, the only extra work we're doing is a memcpy(), which is a hell of a lot faster than a heap allocation.
Thanks for your hard work!
Not directly, see rust-lang/rfcs#646. Apparently kcov works though, like in this tutorial. |
I added add and subtract methods that take an operand by value, instead of by reference, so they can work in place instead of always allocating new BigInts for the result.
I'd like to hear if anyone knows a better way of doing this, though - with the various wrappers/forwardings doing this for all the operations is going to be a mess and add a ton of boilerplate. I haven't even tried to do a complete job of it yet.
I'm starting to think the sanest thing to do would be to only have in place implements of most of the operations - and have the wrappers clone() if necessary. That way, the only extra work we're doing is a memcpy(), which is a hell of a lot faster than a heap allocation.