-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711
Conversation
It might be better to use verifybits macros in the scalar code, instead of an unstructured verifycheck. With a fair amount effort previously, I extracted our 32bit field multiply instructions and verified it to be free of overflow and (I think) the verifybits statements in frama-c to be free of overflow. If this used the same macros a future redo would pick up the static analysis for free. I got discouraged before because frama-c didn't support __int128, but presumably it (or another tool) would eventually and it might be reasonable to automated running range analysis on the C-language multiplier functions. My main complaint with making it not conditional is that your fix isn't free. (At least I assume it isn't from a glance, haven't tested, if it's not actually a benchmarkable difference on GCC ignore the rest of this comment). It since it's only a few lines it wouldn't be too messy to make conditional. As far as testing goes, this is something we should be able to have CI test. MSVC is a common enough compiler that it's probably worth doing that regardless. I wish I saw a way to detect any more of these being added in the codebase. I could detect 64x64->64 on secret inputs in modified valgrind... but couldn't automatically exclude ones where the high words will not be zero. It's nice to document that these are the only 64-bit output multiplies that aren't really 32x32->64... as that fact might be really useful for anyone trying to make a SIMD version of these functions. |
I had verified this statement in godbolt:
I haven't run benchmarks yet but I doubt that two shifts are a benchmarkable difference. |
My preference is doing this with conditional compilation for MSVC. It's probably a trivial performance hit anyway, but it's also an unfortunate complication for someone who doesn't care about MSVC trying to audit the code. You're right that this complicates testing, but I think that's simply a side effect of MSVC being undertested as a platform. That is already a problem, and this is exposing it. Bitcoin Core runs the secp256k1 tests on MSVC via AppVeyor; I'm sure people would be willing to help set up an AppVeyor instance for the secp256k1 repo too. |
This is ready for review. Here's a compiler inspector instance to play around with this: You can recreate this by pasting the following files, comment out the now unresolved #includes and remove
|
Hm, I can make it conditional but I'm not super convinced that this is better. Making it conditional increases the burden for reviewers who care about multiple platforms. I think my comment is detailed enough to review audit these 4 multiplications quickly, and this is probably < 1 % of the work you're doing if you're auditing this code. |
@jonasnick figured out that my calculations are off: This only ensures that the upper 48 bits are non-zero (instead of the upper 32 bits). Then approach doesn't work actually, oops. Marking this as WIP for now, I need to think about this. |
Okay, I updated this to use a different approach. Also I made this conditional on MSVC. I'm still not convinced that this is better but I seem to be the only one who argues for making it unconditional. |
rebased |
I discussed this with Thomas Pornin (BearSSL) and he pointed out that
He also mentioned that a different proper way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code, and 2) we don't care about MSVC performance. But if we anyway would enable the workaround proposed in this PR conditionally only on MSVC, and with bitcoin/bitcoin@162d003 in mind (which is totally not aware of), maybe we should just fix it the proper way. |
One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code. This would aid portability to 32-bit platforms that don't have a 64-bit type. Review would be straightforward in part because the object code on non-MSVC platforms would be completely unchanged. [The related "doesn't have a 128-bit type" problem has prevented me from using various static analysis tools on the 64-bit code.] |
Yeah, and there's the So we could "solve" this one, too. |
ACK but needs rebase. |
I'm leaning towards abandoning this PR and solving the issue using a separate 32x32->64 routine, as discussed above. @sipa What do you think? |
@real-or-random I don't follow, that seems like it's addressing something unrelated. The 32x32->64 multiplications aren't the problem, because (uint64_t)a*b works just fine in that case for MSVC. The problem is the 64x32->64 multiplications, where it doesn't seem we have a solution at all in general, except the trick used in this current PR which is only applicable if we only care about the bottom 63 bits of the output. |
Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes. |
@sipa Ok, things got a little messed up in this thread.
This is where the confusion starts. When I read this for the first time (back in July), I assumed @gmaxwell meant to write "a MUL_64_64_64 macro". But now I apparently adopted this purported "typo"... So there are two different issues. a) 64x64->64 is not constant-time on MSVCWhat I want to do in order to avoid the variable-time issue is to implement a MUL64_64_64 macro or inlinable function for MSVC, This was suggested by Thomas Pornin along with this canonical implementation
The tree multiplications in this routine are all formally 32x32->64 muls. MSVC's optimizer detects that correctly and output normal This is much more straightforward and also much faster than the solution in this PR. b) 32x32->64 and 64x64->128 macrosThis is just an orthogonal different issue. These macros would make our code work on 32 bit implementations where we don't have a 64 bit type, and on 64 bit implementations where we don't have a 128 bit type (such as MSVC). This would also improve performance on MSVC in cases that MSVC is not clever enough to figure out that a ->64 multiplication is in fact 32x32->64.
That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more? |
Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version. |
Nothing too fancy on this score; the new code just avoids multiplying by the 64-bit accumulators.
@gmaxwell My immediate target is an arbitrary-degree Karatsuba (https://eprint.iacr.org/2015/1247) rewrite for 10x26, which already has some potential for vectorisation within a single fe_mul. I hope you'll be able to benchmark it on the real hardware in due course. EDIT: Maybe just to be super-clear, I'm simply using 32x32->64 multiplies. |
@real-or-random Understood now. Concept ACK on both. |
#815 removes the problematic multiplies from the 10x26 field implementation. |
The issue is that MSVC for 32-bit targets implements 64x64->64 bit multiplications using a non-constant subroutine. The subroutine is not constant-time because it shortcuts when the high 32 bits of both multiplicands are all 0.
See https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/ and also https://www.bearssl.org/ctmul.html for a broader view of the issue.
By inspection of our 8x32 scalar and 10x26 field code, I found four places in the field code where the
high bits are not guaranteed to be zero.
This PR inserts VERIFY_CHECKS in the 8x32 scalar code to ensure the high bits are indeed 0. There, all ->64 multiplications are in fact 32x32->64.
Moreover, this PR modifies the four multiplications in the 10x26 such that the right multiplicand, which is a constant, has never all high bits set to zero. This is ensured by shifting that constant to the left. The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling. The correctness follows from the VERIFY_BITS statements for the other multiplicands preceeding the multiplications, which ensure that we have enough unused high bits such that the multiplication won't overflow even with the left-shifted constant.
I feel that this is the most reasonable thing we can do without too much effort and loss of performance.
Possible alternatives:
In general, what do people think about pointing out in the README that the library is supposed to be portable but tested tested mostly on gcc and clang, and it's therefore recommend to compile it there if possible? This sounds a little bit like "Optimized for Netspace Navigator" but please read https://blog.mozilla.org/nfroyd/2018/05/29/when-implementation-monoculture-right-thing/ for why Mozilla dropped MSVC and how nice clang-cl is as a drop-in replacement for MSVC.
This is WIP because I need to add detailed comments, and I first want to see what people think.
edit: And I had a godbolt environment to play around with this but I lost it. I'll make a new one if people are interested.