Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the _addcarry and _subborrow intrinsics when available #141

Merged
merged 7 commits into from
Nov 2, 2020

Conversation

ejmahler
Copy link
Contributor

When compiling for x86_64, with "u64_digit" enabled, some benchmarks are improved by using _addcarry_u64 instead of the custom-written adc function, and using _subborrow_u64) instead of the custom-written sbb function.

The fib and fib2 benchmarks improved the most, most benchmarks improved a little, and a few were worse within the margin of error.

The only benchmark that did legitimately worse was the gcd_euclid family, but there's a comment after those benchmarks saying // Integer for BigUint now uses Stein for gcd. the stein benchmarks showed improvements with this change.

Looking at the generated assembly, it was generating adcq instructions both before and after the change, but post-change the code using adc is a little shorter. It's possible that the intrinsic provided just enough of a hint to the compiler that it was able to optimize some things away. The compiler wasn't generating sbb instructions at all, so this adds them -- and once nice thing is that this change eliminates signed->unsigned conversions.

Let me know if you'd prefer a different away to organize the platform-specific code.

@ejmahler
Copy link
Contributor Author

I also tried applying _addcarry_u32 and subborrow_u32 for 32-bit digits, but it didn't improve any benchmarks, and made many worse, so I backed it out.

I also experimented with the mulx_u64 intrinsic in the mac_with_carry function, but it didn't even generate the mulx instruction and made benchmarks significantly worse.

@ejmahler
Copy link
Contributor Author

ejmahler commented Mar 24, 2020

It's worth noting that _addcarry_u64 was stabilized with rustc 1.33, so this PR would require either a MSRV increase, or some extra stuff in the build.rs file.

@ejmahler
Copy link
Contributor Author

The build failure for rustc 1.32 and 1.31 should be expected, given that these intrinsics were stabilized with rustc 1.33

@cuviper
Copy link
Member

cuviper commented Mar 24, 2020

I would like to keep 1.31 compatibility for now -- nice as the start of 2018 edition, and that's already an increase from num-bigint's compatibility. If nothing else, that will also make sure we have a u64_digit target in CI that still uses the plain code. So yeah, another autocfg test in the build script would work.

I also tried applying _addcarry_u32 and subborrow_u32 for 32-bit digits, but it didn't improve any benchmarks, and made many worse, so I backed it out.

When you tested that, were you running on i686, or x86_64 patched back to 32-bit digits?

Let me know if you'd prefer a different away to organize the platform-specific code.

It's not bad for now, but we'll need a new approach if we scale out a lot of arch intrinsics.

Maybe it would duplicate less if we abstracted this closer to adc/sbb? I think we could work with arguments in the same order, just different carry/borrow types, and let inference deal with that difference in the callers.

@ejmahler
Copy link
Contributor Author

ejmahler commented Mar 25, 2020

I would like to keep 1.31 compatibility for now -- nice as the start of 2018 edition, and that's already an increase from num-bigint's compatibility. If nothing else, that will also make sure we have a u64_digit target in CI that still uses the plain code. So yeah, another autocfg test in the build script would work.

I've never done one of these before, but I'll look into it and amend the pull request

I also tried applying _addcarry_u32 and subborrow_u32 for 32-bit digits, but it didn't improve any benchmarks, and made many worse, so I backed it out.

When you tested that, were you running on i686, or x86_64 patched back to 32-bit digits?

I tried both just editing the build.rs script to stop emitting u64_digit on x86_64, and compiling with the i686 msvc target, comparing before/after in each case.

Maybe it would duplicate less if we abstracted this closer to adc/sbb? I think we could work with arguments in the same order, just different carry/borrow types, and let inference deal with that difference in the callers.

I considered this, and I decided not to because I was worried about creating surprising situations where a contributor writes code that overflows on one platform but not another, etc. Or they try to pass the borrow/carry field along to another function, which affects the type inference, but only on some platforms, etc

I'm not too deeply against it if that's the direction you want to go.

@cuviper
Copy link
Member

cuviper commented Mar 25, 2020

I think we already need CI to make sure all of this works both in generic code and arch-specialized, so I'm not too worried about surprises in the carry/borrow type. We can document that variability, and also that they only expect to be 0/1 regardless of type.

@ejmahler
Copy link
Contributor Author

I decided to do a more thorough benchmark compilation, and found that for x86_64 forced to 32-bit digits, using _addcarry_u32 actually did make a difference. But on genuine x86, it makes things the same or worse.

Is 32-bit digits on x86_64 something worth worrying about? Seems like the only way it would happen is if someone like me forced it off to test something.

@cuviper
Copy link
Member

cuviper commented Apr 2, 2020

Is 32-bit digits on x86_64 something worth worrying about?

Not really. I was hoping the result would go the other way, showing benefit on native 32-bit. But now I feel wary that even in the 64-bit case, the benefits you've seen might be very fickle depending on specific CPUs, etc. How much was the actual improvement you saw?

@ejmahler
Copy link
Contributor Author

ejmahler commented Apr 10, 2020

I probably should have shared benchmarks in the first place. Here they are:

x86_64-pc-windows-msvc

master:
test fib2_100             ... bench:       1,068 ns/iter (+/- 6)
test fib2_1000            ... bench:      13,776 ns/iter (+/- 226)
test fib2_10000           ... bench:     715,450 ns/iter (+/- 22,143)
test fib_100              ... bench:         723 ns/iter (+/- 15)
test fib_1000             ... bench:       8,026 ns/iter (+/- 407)
test fib_10000            ... bench:     401,122 ns/iter (+/- 10,114)
test fib_to_string        ... bench:         217 ns/iter (+/- 2)

addcarry_instrinsic:
test fib2_100             ... bench:       1,040 ns/iter (+/- 16)
test fib2_1000            ... bench:      13,087 ns/iter (+/- 519)
test fib2_10000           ... bench:     599,950 ns/iter (+/- 49,125)
test fib_100              ... bench:         732 ns/iter (+/- 13)
test fib_1000             ... bench:       8,084 ns/iter (+/- 273)
test fib_10000            ... bench:     347,810 ns/iter (+/- 30,973)
test fib_to_string        ... bench:         225 ns/iter (+/- 14)

x86_64-pc-windows-gnu

master:
test fib2_100             ... bench:         982 ns/iter (+/- 14)
test fib2_1000            ... bench:      13,680 ns/iter (+/- 786)
test fib2_10000           ... bench:     717,770 ns/iter (+/- 52,712)
test fib_100              ... bench:         749 ns/iter (+/- 12)
test fib_1000             ... bench:       7,327 ns/iter (+/- 138)
test fib_10000            ... bench:     384,290 ns/iter (+/- 18,288)
test fib_to_string        ... bench:         225 ns/iter (+/- 1)

addcarry_instrinsic:
test fib2_100             ... bench:         968 ns/iter (+/- 8)
test fib2_1000            ... bench:      12,187 ns/iter (+/- 547)
test fib2_10000           ... bench:     583,310 ns/iter (+/- 43,455)
test fib_100              ... bench:         782 ns/iter (+/- 43)
test fib_1000             ... bench:       7,300 ns/iter (+/- 104)
test fib_10000            ... bench:     333,940 ns/iter (+/- 9,086)
test fib_to_string        ... bench:         226 ns/iter (+/- 19)

i686-pc-windows-msvc

master:
test fib2_100             ... bench:       2,066 ns/iter (+/- 43)
test fib2_1000            ... bench:      24,621 ns/iter (+/- 968)
test fib2_10000           ... bench:   1,629,165 ns/iter (+/- 85,346)
test fib_100              ... bench:       1,133 ns/iter (+/- 8)
test fib_1000             ... bench:      14,107 ns/iter (+/- 336)
test fib_10000            ... bench:     807,895 ns/iter (+/- 24,295)
test fib_to_string        ... bench:         300 ns/iter (+/- 22)

addcarry_instrinsic:
test fib2_100             ... bench:       2,002 ns/iter (+/- 94)
test fib2_1000            ... bench:      23,803 ns/iter (+/- 4,320)
test fib2_10000           ... bench:   1,674,750 ns/iter (+/- 89,655)
test fib_100              ... bench:       1,083 ns/iter (+/- 18)
test fib_1000             ... bench:      14,480 ns/iter (+/- 1,098)
test fib_10000            ... bench:     935,170 ns/iter (+/- 62,175)
test fib_to_string        ... bench:         298 ns/iter (+/- 16)

i686-pc-windows-gnu

master:
test fib2_100             ... bench:       1,348 ns/iter (+/- 9)
test fib2_1000            ... bench:      24,365 ns/iter (+/- 3,929)
test fib2_10000           ... bench:   1,630,710 ns/iter (+/- 22,994)
test fib_100              ... bench:       1,000 ns/iter (+/- 10)
test fib_1000             ... bench:      13,136 ns/iter (+/- 516)
test fib_10000            ... bench:     799,360 ns/iter (+/- 6,903)
test fib_to_string        ... bench:         304 ns/iter (+/- 6)

addcarry_instrinsic:
test fib2_100             ... bench:       1,231 ns/iter (+/- 7)
test fib2_1000            ... bench:      21,380 ns/iter (+/- 2,944)
test fib2_10000           ... bench:   1,654,720 ns/iter (+/- 58,099)
test fib_100              ... bench:         954 ns/iter (+/- 7)
test fib_1000             ... bench:      12,638 ns/iter (+/- 1,719)
test fib_10000            ... bench:     923,640 ns/iter (+/- 8,906)
test fib_to_string        ... bench:         831 ns/iter (+/- 24)

On both the msvc and gnu toolchains, there's a 10-20% improvement for big problems on x86_64. But on both toolchains, there's actually a 0-10% drop in performance for i686

@cuviper cuviper force-pushed the addcarry_instrinsic branch from a18350c to 029a3df Compare October 30, 2020 21:56
@cuviper cuviper force-pushed the addcarry_instrinsic branch from 029a3df to e03bbc1 Compare October 30, 2020 22:00
@cuviper
Copy link
Member

cuviper commented Oct 30, 2020

Sorry for leaving this so long! I rebased your branch, and cleaned up the feature conditions a bit (possibly subjective). Then I ran the benchmarks myself on Fedora 33, for both i686 and x86_64-unknown-linux-gnu, and on two different CPUs: Intel i7-7700K and AMD Ryzen 7 3800X. In all cases, the intrinsics look like a clear winner to me!

@ejmahler
Copy link
Contributor Author

ejmahler commented Nov 2, 2020

I'm glad to hear it was an improvement, and no problem on the delay.

Looking at the changes to cfg stuff, I think it's much more clear and upfront what the intention is.

@cuviper
Copy link
Member

cuviper commented Nov 2, 2020

bors r+

@bors
Copy link
Contributor

bors bot commented Nov 2, 2020

@bors bors bot merged commit b3d48f4 into rust-num:master Nov 2, 2020
@ejmahler ejmahler deleted the addcarry_instrinsic branch November 2, 2020 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants