-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative to uint128 #1560
Comments
(Some years ago, before Fiat outputted nice standalone files, Andres and I tried this and we instead crashed MSVC. But that was an older MSVC and Fiat's output has since changed, so I'm hopeful that it'll work now.) |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Correction: code generated using |
Assuming we want to match https://boringssl.googlesource.com/boringssl/+/aa31748bc84f0cd499e4b0337bf929b132aab1cc, we could add a fiat-crypto/src/Stringification/C.v Lines 116 to 166 in 457cb5e
to include
and emits the relevant definitions for the primitives (presumably this is easiest to do if I add a new primitive ident for |
Hm, but it's not entirely clear how this should be structured. Maybe we should add support for each backend to describe a collection of "here's the string body of this function", to be used if the users request |
Also not wedded to the current the structure if something else be more convenient on the fiat side. The main nuisance is that, when Andres and I tested it, generating MSVC-style code[*] for u128-capable targets produced dramatically worse code for Clang and GCC. It seems compilers are not very good at switching the u64 pairs back into a u128, and having them separate impedes some kind of analysis. :-( Though it occurs to me we never tried teaching MSVC to fit the u128 APIs. That is, on MSVC, we could define u128 as a [*] That is, we tried implementing the "intrinsics" using u128 and then splitting the u128s into pairs of u64s to use a single abstraction. Naively, one would think this works because a u128 needs to be split into u64s by the compiler anyway! |
This is what we do in libsecp256k1. See files https://github.com/bitcoin-core/secp256k1/blob/master/src/int128_struct_impl.h, https://github.com/bitcoin-core/secp256k1/blob/master/src/int128_struct.h and https://github.com/bitcoin-core/secp256k1/blob/master/src/int128.h. This uint128 struct implementation was formally proven correct in VST by @roconnor-blockstream (at least when it was initially added), see bitcoin-core/secp256k1#1000 (comment). However, we don't have the proofs in our repo, and we don't check them continuously or adapt them to code changes. Perhaps @roconnor-blockstream maintains an up-to-date version? Edit: Maybe adding code that uses a generic interface such as in https://github.com/bitcoin-core/secp256k1/blob/master/src/int128.h will be a good first step, even without including a (formally verified) implementation. As I understand, it would not really worsen the guarantees of fiat-crypto, because the conversion to strings is anyway not covered by the proofs. This would help us move forward with the integration in libsecp256k1. |
Running fiat-crypto with
I would be happy to assist with all required fiat-crypto adaptations and proofs of polyfills, but I would very much appreciate if somebody else did the job of hunting down appropriate incantations to get fast full addition, subtraction, multiplication, and cmov for 64-bit (and 32-bit?) numbers for different targets. I propose we'd then
|
Here's a mashup of fiat-crypto secpk256k1 C code with uint128 only used to implement full 64-bit arithmetic: https://gist.github.com/andres-erbsen/7dd3ce006f7361840225f081bcdd9ad7 This should be a good starting point for experimentation with intrinsics and such. |
Ah that could also be worth trying out. We know that Clang and GCC generate worse code when MSVC's free-floating u64s are implemented in terms of u128, but we didn't try the converse. If u128 is implemented using a struct in just MSVC, presumably that won't impact GCC/Clang and hopefully MSVC can promote the struct to registers. Then we could do it in one file. |
Indeed, polyfilling u128 sounds good as well. In case we do both, |
Restricting this to MSVC is what we did, and it works rather well.
At least the docs are good: https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170 I think we could help here in general, but I don't have the time right now. |
I think these are all relevant intrinsics on x64 MSVC for this code:
If the goal is polyfilling u128, these are interesting, too:
And maybe these are interesting in general, I don't know:
Would it be feasible to generate a mashup of secp256k1 C that uses functions like |
This should be pretty easy on the Fiat Crypto codegen side: add an option |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
One detail to keep in mind: the natural mul function for u128 is not u128 x u128 -> u128, but u64 x u64 -> u128. That's the operation you get from the CPU and what MSVC gives you intrinsics for. (Indeed we're currently relying on GCC and Clang to recognize the lower bound on the casted inputs to generate fast and constant-time code.) Addition is a bit more flexible, but if there are a few variants where the inputs have tighter bounds, it might help the compiler along slightly, depending on how smart MSVC is. That's probably a place to iterate on. Also, since u128 will be abstracted, the casts will also need to go through functions. |
Hi,
I'd just kindly wanted to bump this and ask what the status of the rewrite pass I feel, if we want to advance on bitcoin-core/secp256k1#1319 (using Fiat-Crypto generated C code in Bitcoin-core), we would need this. Thanks! |
No, I haven't had a chance to work on this, and don't expect to have time soon. @andres-erbsen are you working on this? |
No, and I don't understand your suggestion. What does |
I sketched up a bedrock2-style version that passes uint128 by address at https://godbolt.org/z/qe5veE4dE and it seems that clang allocates a needless stack frame in that style. Passing uint128 by pointer (or probably by value) does not have that issue, but that is not straightforward to support in Bedrock2. Thus the choice we make here potentially has implications for the goal of replacing the unverified C backend with an (improved) Bedrock2 backend. |
Probably |
I am further leaning against designing fiat-crypto around uint128 after learning more about it. Yes, it's currently 10% faster than our next best alternative on clang, but it's also nonstandard without a stable ABI and would require legwork to support in a more rigorous Bedrock2-based C backend. @dderjoel would you be willing to benchmark and investigate how well the approach based on mulx-style intrinsics works for secp256k1?
|
This comment was marked as duplicate.
This comment was marked as duplicate.
It is actually part of the cryptopt generated code. It does not segfault, when I do not specify the git clone https://github.com/dderjoel/secp256k1
cd secp256k1
git checkout only-asm
./autogen.sh
CC=clang CFLAGS='-march=skylake' ./configure --with-asm=x86_64
make clean bench_ecmult
./bench_ecmult It segfaults at the
I believe it is because I use
Using only
|
Oh, you mean our benchmark segfaults, not clang itself! That could be as well be a bug on our side... |
Just updated the comment. That also means that we should not merge the code at the moment. There is no tests in the CI using |
Yep I think that Also does libsecp256k1 support building with |
that should be okay, no? it's wrong it its based off |
Yep, I think this is llvm/llvm-project#58528. When I compile with
which corresponds to the line
in the source. (This line: https://github.com/dderjoel/secp256k1/blob/c010c156424d5c06f9b6d16a839e04bbea5b31c1/src/third_party/field_5x52_asm_impl.h#L35 ) Look at the "addresses" of the variables here:
Indeed, it seems that using (Happy to move this discussion to the PR, it's a bit off-topic here.) |
Okay, thanks for investigating. (extended answer moved to PR) But let me quickly come back on topic of uint128's. Based on Andres'
I feel we want something like the mashup, as the default and if one wants, (Sorry for getting off topic again) |
Yeah that sounds good, we can keep uint128 behind a flag in cases its already implemented, but off by default. The Bedrock2 backend never used uint128 outside its mulx function. In the glorious future where the Bedrock2-based backend becomes the default, we may consider dropping the wholly unverified C backend. I would appreciate performance numbers, but no rush. I'm out for a much-needed vacation starting this weekend so probably I won't look at anything until two weeks from now. If I was doing bitcoin-core/secp256k1#1319 I would rerun fiat-crypto with the command at the top of the benchmark file (here) and diff the output with the benchmark file, then flag any differences for manual review (I hope it's just the |
Hm, I see. I can't speak for the other contributors, but I think this makes fiat-crypto the C output much less attractive for libsecp256k1. The reasons are:
I think one possible way forward for us is to use fiat-crypto's C output (with uint128) and rewrite it manually to use our generic uint128 interface. Since the backend is not formally verified, this would not forego any formal guarantees. The result of the rewriting should be equivalent to our code. In fact, if it turns out to be equivalent, then we don't even need to replace our code. (That sounds a bit silly or even sad then, but it still would give us a lot of confidence that our code implements a correct algorithm.) |
Thank you for the feedback. I am thinking whether there's something we can do along the lines of what you suggested but less manual. The Coq code that prints the C you would be rewriting is here and if only it had types there it could just be changed to print uint128 operations as function calls. For multiplication it would probably suffice to check whether one argument has a cast to uint128, and maybe a similar recursive check would work for uint128 add, but I am not sure. If this sounds like it would do the thing for you, I am happy to give it a try. On what architectures do you care about performance with GCC? |
Okay, yes, this is what I brougt up earlier, and this would, of course, be optimal for us.
This is probably not a very focused answer, but I think the story here is that aim for decent performance on GCC and clang. Official Bitcoin Core builds use GCC for Linux and Windows, and clang for macOS, which is something we keep in mind in libsecp256k1, but we really try to offer a portable library that works well more or less everywhere, and don't want to force our users to use a specific compiler. Architectures where we really care about performance are x86_64 and maybe ARM64. Hardware wallets are also interesting, but the landscape there is a bit more complicated (some are ARM32 for example.) |
Ok I started a prototype to print generic uint128 code at andres-erbsen@3626fec and kicked off a compile. This exact change is not enough, but if it does the right thing for the handled case, more of the same should be straightforward. I forget what the expected compilation time for secp256k1 in fiat-crypto is, but seconds wasn't enough. (The point in comment you link to still misses me, but alas, I think I now understand what you would like to happen here.) The performance answer is along the lines of what I was hoping for, thanks. Just to clarify: is performance with GCC on x86-64 important or not given that there is also an assembly implementation? |
I'm back with more numbers. I've fixed the segfault in
|
Good point. Yeah, if you ask me, I think it's reasonable to care a bit less about the performance of the C code on x86-64 given that we'll have assembly there. |
Yep, looks great. As far as I can understand the diff, this is what I had in mind. |
Hi everyone, just a quick follow up question on what the status is here. |
I am back and I should work on it more. That build never finished, so it'll be a fun debugging session. |
master...andres-erbsen:fiat-crypto:uint128-extern generates C code whose compilation only errors on |
Alright, cool. If I remember correctly, we wanted to use this implementation for field mul in combination with the already verified primitives (u128 \gets u64*u64, etc), right? I'm trying to incorporate that, and then see what works and what does not. I thought I could for now just use search and replace to get the result symbol in there as the first argument, but that is a bit too tricky with all those implicit values in eg. u128_add_u128_u128(&x4, Then I see that in Bitcoin's version we only have the Also I do not see an void u64_and_u128_u64(uint64_t *r, const secp256k1_uint128 *a, u64 b){
uint64_t t1 = secp256k1_u128_to_u64(a);
*r = t1 & b;
} work? |
You can use the assignment operator to copy:
looks correct to me |
How about the following? secp256k1_u128 u128_mul_u64_u64(uint64_t a, uint64_t b) {
secp256k1_u128 r;
secp256k1_u128_mul(&r, a, b);
return r;
} |
AFAICT, libsecp256k1 never returns structures from functions. I'm not entirely sure if this is deliberate or not, but I can add, FWIW, that VST is unable to reason about functions that return structures or take structures as arguments by copy or copying structures by assignment, none of which happen in libsecp256k1 that I am aware of. VST being https://vst.cs.princeton.edu/, the tool I've been using to reason about C code. |
My (fallible) recollection is that the VST limitation originates from a similar limitation CompCert used to have. Given how simple these wrappers are, I think it is fine to not prove anything about them for now. Is there another reason not to pass structs by value? |
The derailing of VST would be more than just the wrappers, Any function that calls these wrappers also couldn't be reasoned about. That said, libsecp256k1 didn't conform to this restriction for the sake of VST; it was like that before I even started. Nor do I necessarily expect that libsecp256k1 maintain these restrictions just for the sake of VST, though it might be worth being at least aware of. |
Yeah, I don't think you'd want to use that wrapper anywhere outside fiat-crypto-generated code. Generating struct-by-pointer from the fiat-crypto C backend seems like it could be a rabbithole, so I'd rather not look into that unless there is a specific reason. As discussed above, we already have uint64s-by-pointer code and now the struct-by-value branch here. Adding struct-by-pointer support would be easier (and have proofs) in the bedrock2-based backend, but that's not mature enough yet for me to want to propose its use here. |
okay, inline secp256k1_uint128 u128_mul_u64_u64(uint64_t a, uint64_t b) {
secp256k1_uint128 r;
secp256k1_u128_mul(&r, a, b);
return r;
}
inline uint64_t u64_shr_u128(secp256k1_uint128 a, unsigned int n) {
secp256k1_uint128 r = a;
secp256k1_u128_rshift(&r, n);
return secp256k1_u128_to_u64(&r);
}
inline uint64_t u64_and_u128_u64(secp256k1_uint128 a, uint64_t b) {
return secp256k1_u128_to_u64(&a) & b;
}
inline secp256k1_uint128 u128_add_u64_u128(uint64_t a, secp256k1_uint128 b) {
secp256k1_uint128 r = b;
secp256k1_u128_accum_u64(&r, a);
return r;
}
inline secp256k1_uint128 u128_add_u128_u128(secp256k1_uint128 a, secp256k1_uint128 b) {
secp256k1_uint128 r = a;
// adding low b
secp256k1_u128_accum_u64(&r, b);
// adding high b
// ???
return r;
}
But I am missing something to add two u128: |
Edit: I stand corrected 👇 |
Here's how libsecp256k1 adds uint128 |
I've done it in https://github.com/dderjoel/secp256k1/blob/only-c/src/field_5x52_int128_impl.h#L60-L82 And updated the PR bitcoin-core/secp256k1#1319 |
The 64-bit fiat-c files rely on uint128, which doesn't work in MSVC. Instead, MSVC has its own intrinsics. (Irritatingly, they're different between x86_64 and aarch64. There's
_umul128
on x86_64 and then__umulh
on aarch64.)@andres-erbsen mentioned the
--split-multiret
exists which would get most of the way there, so if we could regenerate fiat-c with that, then we could probably, at least as a start, just patch out the resulting multiplication intrinsic and see how it works. And then perhaps, from there, figure out how to make the output do it automatically.The text was updated successfully, but these errors were encountered: