Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 #711

Closed
wants to merge 1 commit into from

Conversation

real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Jan 11, 2020

The issue is that MSVC for 32-bit targets implements 64x64->64 bit multiplications using a non-constant subroutine. The subroutine is not constant-time because it shortcuts when the high 32 bits of both multiplicands are all 0.

See https://research.kudelskisecurity.com/2017/01/16/when-constant-time-source-may-not-save-you/ and also https://www.bearssl.org/ctmul.html for a broader view of the issue.

By inspection of our 8x32 scalar and 10x26 field code, I found four places in the field code where the
high bits are not guaranteed to be zero.

This PR inserts VERIFY_CHECKS in the 8x32 scalar code to ensure the high bits are indeed 0. There, all ->64 multiplications are in fact 32x32->64.

Moreover, this PR modifies the four multiplications in the 10x26 such that the right multiplicand, which is a constant, has never all high bits set to zero. This is ensured by shifting that constant to the left. The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling. The correctness follows from the VERIFY_BITS statements for the other multiplicands preceeding the multiplications, which ensure that we have enough unused high bits such that the multiplication won't overflow even with the left-shifted constant.

I feel that this is the most reasonable thing we can do without too much effort and loss of performance.

Possible alternatives:

  • Do the same but with MSVC conditional compilation: I think that's also okay but conditional compilation makes makes testing harder, in particular because noone here uses Windows.
  • Write assembly for the multiplication. That may also improve performance for MSVC because their routine is not only variable time but also slow. But this is more work, harder to review and I don't care about performance for MSVC. Moreover, they have a different asm syntax...
  • Blacklist the 32-bit for MSVC 32-bit. The drawback is that this leaves us with no option there, and people will probably just comment out the blacklisting. And MSVC is indeed used, for example it's a travis-tested target for rust-secp256k1.
  • Do nothing (and blame the compiler). But I don't think that's clever given that this PR is simple. Of course there are many other non-constant multiplication issues for different platform but I don't think that's a reason to ignore this one.

In general, what do people think about pointing out in the README that the library is supposed to be portable but tested tested mostly on gcc and clang, and it's therefore recommend to compile it there if possible? This sounds a little bit like "Optimized for Netspace Navigator" but please read https://blog.mozilla.org/nfroyd/2018/05/29/when-implementation-monoculture-right-thing/ for why Mozilla dropped MSVC and how nice clang-cl is as a drop-in replacement for MSVC.

This is WIP because I need to add detailed comments, and I first want to see what people think.

edit: And I had a godbolt environment to play around with this but I lost it. I'll make a new one if people are interested.

@gmaxwell
Copy link
Contributor

gmaxwell commented Jan 11, 2020

It might be better to use verifybits macros in the scalar code, instead of an unstructured verifycheck.

With a fair amount effort previously, I extracted our 32bit field multiply instructions and verified it to be free of overflow and (I think) the verifybits statements in frama-c to be free of overflow. If this used the same macros a future redo would pick up the static analysis for free. I got discouraged before because frama-c didn't support __int128, but presumably it (or another tool) would eventually and it might be reasonable to automated running range analysis on the C-language multiplier functions.

My main complaint with making it not conditional is that your fix isn't free. (At least I assume it isn't from a glance, haven't tested, if it's not actually a benchmarkable difference on GCC ignore the rest of this comment).

It since it's only a few lines it wouldn't be too messy to make conditional. As far as testing goes, this is something we should be able to have CI test. MSVC is a common enough compiler that it's probably worth doing that regardless.

I wish I saw a way to detect any more of these being added in the codebase. I could detect 64x64->64 on secret inputs in modified valgrind... but couldn't automatically exclude ones where the high words will not be zero. It's nice to document that these are the only 64-bit output multiplies that aren't really 32x32->64... as that fact might be really useful for anyone trying to make a SIMD version of these functions.

@real-or-random
Copy link
Contributor Author

My main complaint with making it not conditional is that your fix isn't free. (At least I assume it isn't from a glance, haven't tested, if it's not actually a benchmarkable difference on GCC ignore the rest of this comment).

I had verified this statement in godbolt:

The costs are two additional shift instructions for shifting the product back to right, for each field element multiplication and doubling.

I haven't run benchmarks yet but I doubt that two shifts are a benchmarkable difference.

@real-or-random real-or-random marked this pull request as ready for review March 24, 2020 16:43
@real-or-random real-or-random changed the title WIP Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Mar 24, 2020
@sipa
Copy link
Contributor

sipa commented Mar 24, 2020

My preference is doing this with conditional compilation for MSVC. It's probably a trivial performance hit anyway, but it's also an unfortunate complication for someone who doesn't care about MSVC trying to audit the code.

You're right that this complicates testing, but I think that's simply a side effect of MSVC being undertested as a platform. That is already a problem, and this is exposing it. Bitcoin Core runs the secp256k1 tests on MSVC via AppVeyor; I'm sure people would be willing to help set up an AppVeyor instance for the secp256k1 repo too.

@real-or-random
Copy link
Contributor Author

This is ready for review.

Here's a compiler inspector instance to play around with this:
https://godbolt.org/z/58L7rR

You can recreate this by pasting the following files, comment out the now unresolved #includes and remove SECP256K1_INLINE and static for all functions you're interested in.

  • incude/secp256k1.h
  • src/util.h
  • src/field_10x26.h
  • src/field_10x26_impl.h
  • src/scalar_8x32.h
  • src/scalar_8x32_impl.h

@real-or-random
Copy link
Contributor Author

My preference is doing this with conditional compilation for MSVC. It's probably a trivial performance hit anyway, but it's also an unfortunate complication for someone who doesn't care about MSVC trying to audit the code.

Hm, I can make it conditional but I'm not super convinced that this is better. Making it conditional increases the burden for reviewers who care about multiple platforms. I think my comment is detailed enough to review audit these 4 multiplications quickly, and this is probably < 1 % of the work you're doing if you're auditing this code.

@real-or-random
Copy link
Contributor Author

real-or-random commented Apr 2, 2020

@jonasnick figured out that my calculations are off: This only ensures that the upper 48 bits are non-zero (instead of the upper 32 bits). Then approach doesn't work actually, oops. Marking this as WIP for now, I need to think about this.

@real-or-random real-or-random changed the title Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 WIP: Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Apr 2, 2020
@real-or-random real-or-random marked this pull request as draft April 22, 2020 10:15
@real-or-random real-or-random changed the title WIP: Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 Apr 22, 2020
@real-or-random real-or-random marked this pull request as ready for review April 22, 2020 15:17
@real-or-random
Copy link
Contributor Author

Okay, I updated this to use a different approach.

Also I made this conditional on MSVC. I'm still not convinced that this is better but I seem to be the only one who argues for making it unconditional.

@real-or-random
Copy link
Contributor Author

rebased

@real-or-random
Copy link
Contributor Author

I discussed this with Thomas Pornin (BearSSL) and he pointed out that

The proof can be made simpler and more "obvious" in two different
ways:

  • With maths:

    "and 2^n-1" is equivalent to "mod 2^n". Thus, given a and b,
    you want a*b mod 2^n. "b or 2^n" is really b + 2^n, which is
    equal to b modulo 2^n, and thus the result is correct.

  • By looking at carries:

    Multiplication is really a bunch of left-shifts and additions
    (that's the "schoolbook method"). When you modify a bit at
    rank k, it can impact only bits at rank k or higher, because
    carry propagation in additions is only to the left. So, setting
    bit 63 in an operand cannot modify bits 0-62 in the result.

He also mentioned that a different proper way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code, and 2) we don't care about MSVC performance. But if we anyway would enable the workaround proposed in this PR conditionally only on MSVC, and with bitcoin/bitcoin@162d003 in mind (which is totally not aware of), maybe we should just fix it the proper way.

@gmaxwell
Copy link
Contributor

way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code

One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code. This would aid portability to 32-bit platforms that don't have a 64-bit type. Review would be straightforward in part because the object code on non-MSVC platforms would be completely unchanged.

[The related "doesn't have a 128-bit type" problem has prevented me from using various static analysis tools on the 64-bit code.]

@real-or-random
Copy link
Contributor Author

[The related "doesn't have a 128-bit type" problem has prevented me from using various static analysis tools on the 64-bit code.]

Yeah, and there's the _umul128 intrinsic for this on MSVC (https://docs.microsoft.com/en-us/cpp/intrinsics/umul128?view=vs-2019). It has a highly intuitive syntax. (Why didn't they simply provide a 128 bit type...?)

So we could "solve" this one, too.

@sipa
Copy link
Contributor

sipa commented Sep 9, 2020

ACK but needs rebase.

@real-or-random
Copy link
Contributor Author

I'm leaning towards abandoning this PR and solving the issue using a separate 32x32->64 routine, as discussed above. @sipa What do you think?

@sipa
Copy link
Contributor

sipa commented Sep 10, 2020

@real-or-random I don't follow, that seems like it's addressing something unrelated. The 32x32->64 multiplications aren't the problem, because (uint64_t)a*b works just fine in that case for MSVC. The problem is the 64x32->64 multiplications, where it doesn't seem we have a solution at all in general, except the trick used in this current PR which is only applicable if we only care about the bottom 63 bits of the output.

@peterdettman
Copy link
Contributor

Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes.

@real-or-random
Copy link
Contributor Author

real-or-random commented Sep 10, 2020

@sipa Ok, things got a little messed up in this thread.

way would be to simply provide our own (inlinable) imlementation of 64x64->64 bit multiplication. That's true. So far I haven't chosen this approach because 1) it would touch more of the existing code

One way to do that is to make a MUL_32_32_64 macro, and on msvc it gets converted to an inlinable function call and on everything else it just gets converted to the existing code.

This is where the confusion starts. When I read this for the first time (back in July), I assumed @gmaxwell meant to write "a MUL_64_64_64 macro". But now I apparently adopted this purported "typo"...

So there are two different issues.

a) 64x64->64 is not constant-time on MSVC

What I want to do in order to avoid the variable-time issue is to implement a MUL64_64_64 macro or inlinable function for MSVC, This was suggested by Thomas Pornin along with this canonical implementation

    static inline uint64_t
    mul64x64to64(uint64_t x, uint64_t y)
    {
        uint32_t xl, xh, yl, yh;

        xl = (uint32_t)x;
        xh = (uint32_t)(x >> 32);
        yl = (uint32_t)y;
        yh = (uint32_t)(y >> 32);
        return (uint64_t)xl * (uint64_t)yl
            + ((uint64_t)(xl * yh + xh * yl) << 32);
    }

The tree multiplications in this routine are all formally 32x32->64 muls. MSVC's optimizer detects that correctly and output normal mul and imul instruction. Even if MSVC implemented the multiplication using its llmul routine, the routine is constant-time over 32x32 inputs.

This is much more straightforward and also much faster than the solution in this PR.

b) 32x32->64 and 64x64->128 macros

This is just an orthogonal different issue. These macros would make our code work on 32 bit implementations where we don't have a 64 bit type, and on 64 bit implementations where we don't have a 128 bit type (such as MSVC). This would also improve performance on MSVC in cases that MSVC is not clever enough to figure out that a ->64 multiplication is in fact 32x32->64.

Just a heads-up that I have a rewrite of the 10x26 field mul in progress that contains only 32x32 multiplications, amongst other changes.

That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more?

@gmaxwell
Copy link
Contributor

Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version.

@peterdettman
Copy link
Contributor

peterdettman commented Sep 10, 2020

That sounds great and this rewrite would solve a) and at least the 32x32->64 / field part of b). Can you say more?

Nothing too fancy on this score; the new code just avoids multiplying by the 64-bit accumulators.

Fast code that uses only 32-bit multiplies would also be a great prototype for a fast N-way SIMD version.

@gmaxwell My immediate target is an arbitrary-degree Karatsuba (https://eprint.iacr.org/2015/1247) rewrite for 10x26, which already has some potential for vectorisation within a single fe_mul. I hope you'll be able to benchmark it on the real hardware in due course.

EDIT: Maybe just to be super-clear, I'm simply using 32x32->64 multiplies.

@sipa
Copy link
Contributor

sipa commented Sep 10, 2020

@real-or-random Understood now. Concept ACK on both.

@peterdettman
Copy link
Contributor

#815 removes the problematic multiplies from the 10x26 field implementation.

@real-or-random
Copy link
Contributor Author

#815 removes the problematic multiplies from the 10x26 field implementation.

Okay, closing here for now. Depending on the the progress in #815, it may still be useful to have a temporary solution but it's clear that I won't be this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants