Initial Intel HEXL integration #312

fboemer · 2021-03-29T22:44:52Z

Initial integration with Intel HEXL (https://github.com/intel/hexl)

Co-authored-by: Gelila Seifu gelila.seifu@intel.com
Co-authored-by: Jeremy Bottleson jeremy.bottleson@intel.com

fionser · 2021-03-30T03:01:22Z

@fboemer Great boost. HEXL saves ~50% computation time for my program :).
But, my project uses SEAL as a submodulue and it failed to build with HEXL in a submodulue.
I need to build & install SEAL seperately, and then build my own project.

Another concern is that does it possible reuse HEXL's twiddle factors table via the SEAL's NTTTable object,
which seem can save a pretty memory.

WeiDaiWD · 2021-03-30T04:26:17Z

Great boost. HEXL saves ~50% computation time for my program :).

Can you share your CPU spec? Does it have AVX512IFMA?

fionser · 2021-03-30T06:48:29Z

Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Does not have AVX512IFMA
gcc version 7.2.1 20180104

CPU-Flags:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch arat invpcid_single fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1

WeiDaiWD · 2021-03-30T07:41:43Z

Cool! It's remarkably faster even with avx512dq.

WeiDaiWD · 2021-03-30T08:57:42Z

@fboemer This is superb. Thanks for making SEAL faster. :)

A few things to add to this PR:

reinterpret_cast<const uint64_t *> and reinterpret_cast<uint64_t *> in polyarithsmallmod.h and polyarithsmallmod.cpp are not necessary, since we defined using CoeffIter = PtrIter<std::uint64_t *> in iterator.h.
Two CMake options, BUILD_PIC and BUILD_TESTING, are propagated into SEAL from Intel HEXL. It's not obvious to me how they end up there by eyeballing HEXL's CMakeLists.txt. Could you please figure it out and remove or hide them?
Could you please provide descriptions (of functionality and important arguments) to the three functions in intel_seal_ext.h?

Questions:

SEAL_ALIGNED_ALLOC calls malloc if size is not a multiple of alignment. What's the effect on performance here? My only worry is that a user can be falsely convinced that all allocations are aligned to 64 as long as SEAL_USE_ALIGN_64 is ON.
In HEXL's README, it says "Intel HEXL targets integer arithmetic with word-sized primes, typically 40-60 bits." BFV auxiliary primes are 61-bit. Does AVX512IFMA only work on 54-bit or less? What will happen if a prime is larger than the bound (60 or 54)?

Recommendations to Intel HEXL:

As fionser said, would it make sense for Intel HEXL to use the powers of root (specifically their allocation) generated by SEAL, or disable SEAL's NTT precomputation if SEAL_USE_INTEL_HEXL=ON? The former choice removes precomputation in the first call to intel::seal_ext::get_ntt.
When compiling your branch in debug mode (several warnings enabled), I've got a long list of warnings. They are categorized into followings:
- sign conversion
- implicit int float conversion
- shorten 64 to 32
- c++14 binary literal: maybe Intel HEXL should use C++14 instead of C++11

Co-authored-by: Gelila Seifu <gelila.seifu@intel.com> Co-authored-by: Jeremy Bottleson <jeremy.bottleson@intel.com> Update to new HEXL Remove unnecessary casts Log options

fboemer · 2021-04-02T20:35:53Z

Thanks for the feedback, @WeiDaiWD and @fionser.

A few notes:

@fionser , do you mind trying the submodule approach again and reporting the errors? We're happy to support this workflow if possible
We've removed the unnecessary reinterpret_casts. Thanks for pointing this out; it's a clean approach!
BUILD_PIC and BUILD_TESTING should no longer be leaking (they were stemming from some 3rd-party dependenceis)
Within Intel HEXL, I've measured the 64-byte aligned allocations to yield ~5-7% speedup on the NTT. It looks like whenever we allocate memory for cipher/plaintexts, the allocation sizes will be a multiple of 64 bytes. So the current implementation should be efficient. If preferred, for other-sized memory allocations, we could use an approach similar to Boost, which will allocate extra memory.
We've added documentation for the intel_seal_ext.h
AVX512IFMA52 performs 52-bit integer arithmetic. We need a few extra bits in the NTT, so choosing coefficient moduli < 50 bits should suffice for best performance. For large primes, e.g. the auxiliary prime, HEXL will choose an AVX512DQ implementation, which still yields some speedup (as @fionser observed), but less than the IFMA52 approach. See Tables 1-4 in our arXiv paper for more information
Regarding NTT pre-computation: Intel HEXL uses two forms of pre-computation. One is based on Barrett factors floor(2^64/modulus) and today happens to have the same bit-scrambled order as SEAL's pre-computation. A second pre-computation vector is based on floor(2^52/modulus). In general, the pre-computation is not part of HEXL's public API. If it changes down the road, we don't want to break the SEAL integration, hence why we didn't use SEAL's pre-computed factors. Yes, we could omit the SEAL NTT tables pre-computation if you like. We thought the current implementation would be cleanest and safest (not breaking any programs that may rely on SEAL's pre-computed NTT tables.)
Thanks for pointing our the warnings. We've updated Intel HEXL v1.0.0, which should have resolved them. Just a note that we plan to update the HEXL v1.0.0 tag a few more times for minor README changes, but will freeze once this PR is approved.

Also a note that this updated PR makes two more changes:

Run pre-commit on all the files. Maybe my system is different, but this led to a few minor changes
Adds STATUS messages for the CMake options. This makes it easier to tell what options are enabled when building from command line.

WeiDaiWD · 2021-04-03T05:36:01Z

@fionser Does your program use BGV or BFV?

WeiDaiWD · 2021-04-03T06:12:52Z

@fboemer Everything looks good now. Would you let me know when you think 1.0.0 is stable enough to freeze the tag? I'm ready to merge this into SEAL at any time, then why not wait for your upcoming commit hashes.

Side note:
I did some experiment with BFV. I set prime bit limits to 49 and 50 and deleted two default parameters to avoid errors. Compared to your branch, BFV / EvaluateMultCt gets faster in my branch for larger parameters. I'm testing on a Intel® Xeon(R) Silver 4108 @ 1.80 GHz, GCC 7.5.0. You should try on an IceLake processor. If it helps to get BFV faster, I can try to make those prime size limits configurable easier (without errors).

fboemer · 2021-04-05T20:53:46Z

@WeiDaiWD , we updated the tag to v1.0.1 after all. No more changes will be made to this tag, so feel free to do any final testing and merge.

Thanks for trying out the prime bit-width changes. I'll try them out on an IceLake processor and report my findings here.

Edit: findings below. Still need to investigate if the AVX512IFMA52 instructions are being called or not with the smaller primes.

Benchmark	HEXL=OFF	HEXL=ON default	HEXL=ON smaller primes
n=1024 / log(q)=27 / BFV / EvaluateMulCt/iterations:1000	489	327	292
n=1024 / log(q)=27 / CKKS / EvaluateMulCt/iterations:1000	18.5	3.38	3.35
n=4096 / log(q)=109 / BFV / EvaluateMulCt/iterations:1000	3384	2488	2322
n=4096 / log(q)=109 / CKKS / EvaluateMulCt/iterations:1000	145	55.1	55.7
n=8192 / log(q)=218 / BFV / EvaluateMulCt/iterations:1000	13198	9328	8601
n=8192 / log(q)=218 / CKKS / EvaluateMulCt/iterations:1000	631	217	211
n=16384 / log(q)=438 / BFV / EvaluateMulCt/iterations:1000	56597	40629	38734
n=16384 / log(q)=438 / CKKS / EvaluateMulCt/iterations:1000	2557	902	901

WeiDaiWD · 2021-04-05T21:05:57Z

Cool. I'll just merge this. Let me know if you see important to update the tag/commit before the next SEAL's release. Thanks!

WeiDaiWD · 2021-04-05T21:23:15Z

One little suggestion: when I build it on a Core i7-10700K that does not have AVX512, I have the following warning:
warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]

Maybe you want to explain to users what it means.

fionser · 2021-04-07T00:57:31Z

@fionser Does your program use BGV or BFV?

CKKS only

fionser · 2021-05-10T09:08:25Z

@WeiDaiWD @fboemer

Interestingly, when used HEXL, the BFV decryption slow down quite significantly.

/
| Encryption parameters :
|   scheme: BFV
|   poly_modulus_degree: 4096
|   #moduli: 2
|   #special_primes: 1
|   coeff_modulus size: 109 (59 + 50) bits
|   plain_modulus: 4194304 (23) bits
\

BFV decryption took 3.1ms/ 0.4ms (w and w/o HEXL) in my machine (gcc version 7.2.1, Red Hat 7.2.0-5, Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz)

I have tried to turn off HEXL in polyarithsmallmod.cpp but it did not help.
I am wondering do you guy know what is going on ?

fboemer · 2021-05-11T23:04:20Z

@fiosner, SEAL performs NTT pre-computations during configuration, while HEXL performs NTT pre-computations during the first use of the NTT. So the first run of BFV decryption may be slower, but I would expect repeated runs (e.g. in the benchmark suite using 1000 iterations) using HEXL to be similar to or faster than the SEAL implementation. The default iteration count of 10 seems small enough that a slow first run with HEXL may skew the average runtime.

I just tested on a similar machine (avx512dq, but not avx512ifma) (with gcc-9) and don't see this degradation.

Removing the SEAL_USE_INTEL_HEXL throughout is a good way to debug the slowdown. You could try removing the NTT HEXL integration to nail down where slowdown is coming from.

As another note, I've intermittently seen some very strange slowdowns in the past, similar to https://stackoverflow.com/questions/42358211/adding-a-print-statement-speeds-up-code-by-an-order-of-magnitude. If this problem still persists, perhaps try compiling SEAL with -march=native?

By the way, you may wish to see if the degradation you observe persists in the latest version of HEXL: #332

fionser · 2021-05-12T09:03:17Z

@fboemer Nice thank you for the information.

fboemer added 2 commits April 2, 2021 12:28

Initial Intel HEXL integration

f4a4def

Co-authored-by: Gelila Seifu <gelila.seifu@intel.com> Co-authored-by: Jeremy Bottleson <jeremy.bottleson@intel.com> Update to new HEXL Remove unnecessary casts Log options

Run pre-commit on all files

3f5f8a9

fboemer force-pushed the fboemer/hexl branch 2 times, most recently from 9ade672 to 3f5f8a9 Compare April 2, 2021 20:16

Update HEXL version to 1.0.1

4b2e222

WeiDaiWD merged commit aa476c7 into microsoft:contrib Apr 5, 2021

fboemer mentioned this pull request Jun 7, 2021

Cleaner HEXL NTT integration #349

Merged

fboemer deleted the fboemer/hexl branch November 3, 2021 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Intel HEXL integration #312

Initial Intel HEXL integration #312

fboemer commented Mar 29, 2021

fionser commented Mar 30, 2021 •

edited

Loading

WeiDaiWD commented Mar 30, 2021 •

edited

Loading

fionser commented Mar 30, 2021

WeiDaiWD commented Mar 30, 2021

WeiDaiWD commented Mar 30, 2021

fboemer commented Apr 2, 2021 •

edited

Loading

WeiDaiWD commented Apr 3, 2021

WeiDaiWD commented Apr 3, 2021

fboemer commented Apr 5, 2021 •

edited

Loading

WeiDaiWD commented Apr 5, 2021

WeiDaiWD commented Apr 5, 2021

fionser commented Apr 7, 2021

fionser commented May 10, 2021 •

edited

Loading

fboemer commented May 11, 2021

fionser commented May 12, 2021

Initial Intel HEXL integration #312

Initial Intel HEXL integration #312

Conversation

fboemer commented Mar 29, 2021

fionser commented Mar 30, 2021 • edited Loading

WeiDaiWD commented Mar 30, 2021 • edited Loading

fionser commented Mar 30, 2021

WeiDaiWD commented Mar 30, 2021

WeiDaiWD commented Mar 30, 2021

fboemer commented Apr 2, 2021 • edited Loading

WeiDaiWD commented Apr 3, 2021

WeiDaiWD commented Apr 3, 2021

fboemer commented Apr 5, 2021 • edited Loading

WeiDaiWD commented Apr 5, 2021

WeiDaiWD commented Apr 5, 2021

fionser commented Apr 7, 2021

fionser commented May 10, 2021 • edited Loading

fboemer commented May 11, 2021

fionser commented May 12, 2021

fionser commented Mar 30, 2021 •

edited

Loading

WeiDaiWD commented Mar 30, 2021 •

edited

Loading

fboemer commented Apr 2, 2021 •

edited

Loading

fboemer commented Apr 5, 2021 •

edited

Loading

fionser commented May 10, 2021 •

edited

Loading