Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

real-or-random · 2020-01-11T14:32:34Z

The issue is that MSVC for 32-bit targets implements 64x64->64 bit multiplications using a non-constant subroutine. The subroutine is not constant-time because it shortcuts when the high 32 bits of both multiplicands are all 0.

See https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/ and also https://www.bearssl.org/ctmul.html for a broader view of the issue.

By inspection of our 8x32 scalar and 10x26 field code, I found four places in the field code where the
high bits are not guaranteed to be zero.

This PR inserts VERIFY_CHECKS in the 8x32 scalar code to ensure the high bits are indeed 0. There, all ->64 multiplications are in fact 32x32->64.

Moreover, this PR modifies the four multiplications in the 10x26 such that the right multiplicand, which is a constant, has never all high bits set to zero. This is ensured by shifting that constant to the left. The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling. The correctness follows from the VERIFY_BITS statements for the other multiplicands preceeding the multiplications, which ensure that we have enough unused high bits such that the multiplication won't overflow even with the left-shifted constant.

I feel that this is the most reasonable thing we can do without too much effort and loss of performance.

Possible alternatives:

Do the same but with MSVC conditional compilation: I think that's also okay but conditional compilation makes makes testing harder, in particular because noone here uses Windows.
Write assembly for the multiplication. That may also improve performance for MSVC because their routine is not only variable time but also slow. But this is more work, harder to review and I don't care about performance for MSVC. Moreover, they have a different asm syntax...
Blacklist the 32-bit for MSVC 32-bit. The drawback is that this leaves us with no option there, and people will probably just comment out the blacklisting. And MSVC is indeed used, for example it's a travis-tested target for rust-secp256k1.
Do nothing (and blame the compiler). But I don't think that's clever given that this PR is simple. Of course there are many other non-constant multiplication issues for different platform but I don't think that's a reason to ignore this one.

In general, what do people think about pointing out in the README that the library is supposed to be portable but tested tested mostly on gcc and clang, and it's therefore recommend to compile it there if possible? This sounds a little bit like "Optimized for Netspace Navigator" but please read https://blog.mozilla.org/nfroyd/2018/05/29/when-implementation-monoculture-right-thing/ for why Mozilla dropped MSVC and how nice clang-cl is as a drop-in replacement for MSVC.

This is WIP because I need to add detailed comments, and I first want to see what people think.

edit: And I had a godbolt environment to play around with this but I lost it. I'll make a new one if people are interested.

gmaxwell · 2020-01-11T15:52:28Z

It might be better to use verifybits macros in the scalar code, instead of an unstructured verifycheck.

With a fair amount effort previously, I extracted our 32bit field multiply instructions and verified it to be free of overflow and (I think) the verifybits statements in frama-c to be free of overflow. If this used the same macros a future redo would pick up the static analysis for free. I got discouraged before because frama-c didn't support __int128, but presumably it (or another tool) would eventually and it might be reasonable to automated running range analysis on the C-language multiplier functions.

My main complaint with making it not conditional is that your fix isn't free. (At least I assume it isn't from a glance, haven't tested, if it's not actually a benchmarkable difference on GCC ignore the rest of this comment).

It since it's only a few lines it wouldn't be too messy to make conditional. As far as testing goes, this is something we should be able to have CI test. MSVC is a common enough compiler that it's probably worth doing that regardless.

I wish I saw a way to detect any more of these being added in the codebase. I could detect 64x64->64 on secret inputs in modified valgrind... but couldn't automatically exclude ones where the high words will not be zero. It's nice to document that these are the only 64-bit output multiplies that aren't really 32x32->64... as that fact might be really useful for anyone trying to make a SIMD version of these functions.

real-or-random · 2020-02-12T13:16:02Z

My main complaint with making it not conditional is that your fix isn't free. (At least I assume it isn't from a glance, haven't tested, if it's not actually a benchmarkable difference on GCC ignore the rest of this comment).

I had verified this statement in godbolt:

The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling.

I haven't run benchmarks yet but I doubt that two shifts are a benchmarkable difference.

sipa · 2020-03-24T17:03:19Z

My preference is doing this with conditional compilation for MSVC. It's probably a trivial performance hit anyway, but it's also an unfortunate complication for someone who doesn't care about MSVC trying to audit the code.

You're right that this complicates testing, but I think that's simply a side effect of MSVC being undertested as a platform. That is already a problem, and this is exposing it. Bitcoin Core runs the secp256k1 tests on MSVC via AppVeyor; I'm sure people would be willing to help set up an AppVeyor instance for the secp256k1 repo too.

real-or-random · 2020-03-24T17:08:24Z

This is ready for review.

Here's a compiler inspector instance to play around with this:
https://godbolt.org/z/58L7rR

You can recreate this by pasting the following files, comment out the now unresolved #includes and remove SECP256K1_INLINE and static for all functions you're interested in.

incude/secp256k1.h
src/util.h
src/field_10x26.h
src/field_10x26_impl.h
src/scalar_8x32.h
src/scalar_8x32_impl.h

real-or-random · 2020-03-24T17:12:53Z

My preference is doing this with conditional compilation for MSVC. It's probably a trivial performance hit anyway, but it's also an unfortunate complication for someone who doesn't care about MSVC trying to audit the code.

Hm, I can make it conditional but I'm not super convinced that this is better. Making it conditional increases the burden for reviewers who care about multiple platforms. I think my comment is detailed enough to review audit these 4 multiplications quickly, and this is probably < 1 % of the work you're doing if you're auditing this code.

real-or-random · 2020-04-02T11:45:12Z

@jonasnick figured out that my calculations are off: This only ensures that the upper 48 bits are non-zero (instead of the upper 32 bits). Then approach doesn't work actually, oops. Marking this as WIP for now, I need to think about this.

real-or-random · 2020-04-22T15:18:39Z

Okay, I updated this to use a different approach.

Also I made this conditional on MSVC. I'm still not convinced that this is better but I seem to be the only one who argues for making it unconditional.

src/field_10x26_impl.h

real-or-random · 2020-07-26T11:29:31Z

rebased

real-or-random · 2020-07-28T19:20:55Z

I discussed this with Thomas Pornin (BearSSL) and he pointed out that

The proof can be made simpler and more "obvious" in two different
ways:

With maths:

"and 2^n-1" is equivalent to "mod 2^n". Thus, given a and b,
you want a*b mod 2^n. "b or 2^n" is really b + 2^n, which is
equal to b modulo 2^n, and thus the result is correct.

By looking at carries:

Multiplication is really a bunch of left-shifts and additions
(that's the "schoolbook method"). When you modify a bit at
rank k, it can impact only bits at rank k or higher, because
carry propagation in additions is only to the left. So, setting
bit 63 in an operand cannot modify bits 0-62 in the result.

He also mentioned that a different proper way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code, and 2) we don't care about MSVC performance. But if we anyway would enable the workaround proposed in this PR conditionally only on MSVC, and with bitcoin/bitcoin@162d003 in mind (which is totally not aware of), maybe we should just fix it the proper way.

gmaxwell · 2020-07-28T23:13:35Z

way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code

One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code. This would aid portability to 32-bit platforms that don't have a 64-bit type. Review would be straightforward in part because the object code on non-MSVC platforms would be completely unchanged.

[The related "doesn't have a 128-bit type" problem has prevented me from using various static analysis tools on the 64-bit code.]

real-or-random · 2020-07-29T06:05:46Z

[The related "doesn't have a 128-bit type" problem has prevented me from using various static analysis tools on the 64-bit code.]

Yeah, and there's the _umul128 intrinsic for this on MSVC (https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2019). It has a highly intuitive syntax. (Why didn't they simply provide a 128 bit type...?)

So we could "solve" this one, too.

sipa · 2020-09-09T03:45:42Z

ACK but needs rebase.

real-or-random · 2020-09-09T09:56:00Z

I'm leaning towards abandoning this PR and solving the issue using a separate 32x32->64 routine, as discussed above. @sipa What do you think?

sipa · 2020-09-10T02:31:48Z

@real-or-random I don't follow, that seems like it's addressing something unrelated. The 32x32->64 multiplications aren't the problem, because (uint64_t)a*b works just fine in that case for MSVC. The problem is the 64x32->64 multiplications, where it doesn't seem we have a solution at all in general, except the trick used in this current PR which is only applicable if we only care about the bottom 63 bits of the output.

peterdettman · 2020-09-10T03:22:49Z

Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes.

real-or-random · 2020-09-10T12:45:09Z

@sipa Ok, things got a little messed up in this thread.

way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code

One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code.

This is where the confusion starts. When I read this for the first time (back in July), I assumed @gmaxwell meant to write "a MUL_64_64_64 macro". But now I apparently adopted this purported "typo"...

So there are two different issues.

a) 64x64->64 is not constant-time on MSVC

What I want to do in order to avoid the variable-time issue is to implement a MUL64_64_64 macro or inlinable function for MSVC, This was suggested by Thomas Pornin along with this canonical implementation

    static inline uint64_t
    mul64x64to64(uint64_t x, uint64_t y)
    {
        uint32_t xl, xh, yl, yh;

        xl = (uint32_t)x;
        xh = (uint32_t)(x >> 32);
        yl = (uint32_t)y;
        yh = (uint32_t)(y >> 32);
        return (uint64_t)xl * (uint64_t)yl
            + ((uint64_t)(xl * yh + xh * yl) << 32);
    }

The tree multiplications in this routine are all formally 32x32->64 muls. MSVC's optimizer detects that correctly and output normal mul and imul instruction. Even if MSVC implemented the multiplication using its llmul routine, the routine is constant-time over 32x32 inputs.

This is much more straightforward and also much faster than the solution in this PR.

b) 32x32->64 and 64x64->128 macros

This is just an orthogonal different issue. These macros would make our code work on 32 bit implementations where we don't have a 64 bit type, and on 64 bit implementations where we don't have a 128 bit type (such as MSVC). This would also improve performance on MSVC in cases that MSVC is not clever enough to figure out that a ->64 multiplication is in fact 32x32->64.

Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes.

That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more?

gmaxwell · 2020-09-10T12:48:00Z

Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version.

peterdettman · 2020-09-10T13:08:20Z

That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more?

Nothing too fancy on this score; the new code just avoids multiplying by the 64-bit accumulators.

Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version.

@gmaxwell My immediate target is an arbitrary-degree Karatsuba (https://eprint.iacr.org/2015/1247) rewrite for 10x26, which already has some potential for vectorisation within a single fe_mul. I hope you'll be able to benchmark it on the real hardware in due course.

EDIT: Maybe just to be super-clear, I'm simply using 32x32->64 multiplies.

sipa · 2020-09-10T16:22:52Z

@real-or-random Understood now. Concept ACK on both.

peterdettman · 2020-09-13T16:42:25Z

#815 removes the problematic multiplies from the 10x26 field implementation.

real-or-random · 2020-09-13T20:43:15Z

#815 removes the problematic multiplies from the 10x26 field implementation.

Okay, closing here for now. Depending on the the progress in #815, it may still be useful to have a temporary solution but it's clear that I won't be this PR.

real-or-random force-pushed the llmul branch from 180faa9 to ed4c023 Compare March 24, 2020 16:32

real-or-random marked this pull request as ready for review March 24, 2020 16:43

real-or-random changed the title ~~WIP Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86~~ Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Mar 24, 2020

elichai mentioned this pull request Mar 26, 2020

Constant-time behaviour test using valgrind memtest. #708

Merged

real-or-random changed the title ~~Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86~~ WIP: Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Apr 2, 2020

real-or-random marked this pull request as draft April 22, 2020 10:15

real-or-random force-pushed the llmul branch from ed4c023 to edc2d92 Compare April 22, 2020 15:15

real-or-random changed the title ~~WIP: Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86~~ Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Apr 22, 2020

real-or-random marked this pull request as ready for review April 22, 2020 15:17

sipa reviewed Apr 22, 2020

View reviewed changes

src/field_10x26_impl.h Show resolved Hide resolved

real-or-random force-pushed the llmul branch from edc2d92 to 1f18ac7 Compare April 23, 2020 11:16

Make 64x64->64 multiplications constant-time with MSVC on 32bit x86

ad62627

real-or-random force-pushed the llmul branch from 1f18ac7 to ad62627 Compare July 26, 2020 11:26

real-or-random mentioned this pull request Jul 26, 2020

Improve constant-timeness on PowerPC #772

Merged

real-or-random mentioned this pull request Aug 9, 2020

Make scalar/field choice depend on C-detected __int128 availability #793

Merged

real-or-random closed this Sep 13, 2020

real-or-random mentioned this pull request Dec 1, 2022

64x64->64 muls are not constant-time with MSVC on 32bit x86 #1164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

real-or-random commented Jan 11, 2020 •

edited

Loading

gmaxwell commented Jan 11, 2020 •

edited

Loading

real-or-random commented Feb 12, 2020

sipa commented Mar 24, 2020

real-or-random commented Mar 24, 2020

real-or-random commented Mar 24, 2020

real-or-random commented Apr 2, 2020 •

edited

Loading

real-or-random commented Apr 22, 2020

real-or-random commented Jul 26, 2020

real-or-random commented Jul 28, 2020

gmaxwell commented Jul 28, 2020

real-or-random commented Jul 29, 2020

sipa commented Sep 9, 2020

real-or-random commented Sep 9, 2020

sipa commented Sep 10, 2020 •

edited

Loading

peterdettman commented Sep 10, 2020

real-or-random commented Sep 10, 2020 •

edited

Loading

gmaxwell commented Sep 10, 2020

peterdettman commented Sep 10, 2020 •

edited

Loading

sipa commented Sep 10, 2020

peterdettman commented Sep 13, 2020

real-or-random commented Sep 13, 2020

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

Conversation

real-or-random commented Jan 11, 2020 • edited Loading

gmaxwell commented Jan 11, 2020 • edited Loading

real-or-random commented Feb 12, 2020

sipa commented Mar 24, 2020

real-or-random commented Mar 24, 2020

real-or-random commented Mar 24, 2020

real-or-random commented Apr 2, 2020 • edited Loading

real-or-random commented Apr 22, 2020

real-or-random commented Jul 26, 2020

real-or-random commented Jul 28, 2020

gmaxwell commented Jul 28, 2020

real-or-random commented Jul 29, 2020

sipa commented Sep 9, 2020

real-or-random commented Sep 9, 2020

sipa commented Sep 10, 2020 • edited Loading

peterdettman commented Sep 10, 2020

real-or-random commented Sep 10, 2020 • edited Loading

a) 64x64->64 is not constant-time on MSVC

b) 32x32->64 and 64x64->128 macros

gmaxwell commented Sep 10, 2020

peterdettman commented Sep 10, 2020 • edited Loading

sipa commented Sep 10, 2020

peterdettman commented Sep 13, 2020

real-or-random commented Sep 13, 2020

real-or-random commented Jan 11, 2020 •

edited

Loading

gmaxwell commented Jan 11, 2020 •

edited

Loading

real-or-random commented Apr 2, 2020 •

edited

Loading

sipa commented Sep 10, 2020 •

edited

Loading

real-or-random commented Sep 10, 2020 •

edited

Loading

peterdettman commented Sep 10, 2020 •

edited

Loading