Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fboemer/faster decrypt #363

Merged
merged 3 commits into from
Sep 15, 2021
Merged

Conversation

fboemer
Copy link
Contributor

@fboemer fboemer commented Jun 28, 2021

Optimize CKKS and (to a lesser degree) BFV Decrypt. Also fixes BGV compilation error.

On ICX with clang-12:

  • with HEXL=ON, CKKS decrypt shows:
N Time before (us) Time after (us) Speedup
1024 1.70 1.42 1.20x
2048 4.09 3.69 1.10x
4096 12.5 8.56 1.46x
8192 76.1 60.4 1.26x
16384 381 216 1.76x
  • with HEXL=OFF, CKKS decrypt shows:
N Time before (us) Time after (us) Speedup
1024 6.16 5.29 1.16x
2048 11.9 10.1 1.18x
4096 48.1 39.9 1.20x
8192 189 156 1.21x
16384 805 635 1.26x
  • with HEXL=ON, BFV decrypt shows:
N Time before (us) Time after (us) Speedup
1024 27.8 26.6 1.04x
2048 62.0 61.6 1.01x
4096 135 136 0.99x
8192 474 451 1.05x
16384 1692 1561 1.08x

@fionser
Copy link
Contributor

fionser commented Jun 29, 2021

  1. A quick question, why write explicitly out the tile = 1024 would give a better performance than a simple for-loop ?
    Or why the compiler would not optimize this for us :(.
  2. Another question. Do you try to run this program on the same machine on parallel.
    I mean when multiple SEAL's programs run on the same time, should the L1 Cache still in-and-out for these different program ?

@WeiDaiWD
Copy link
Contributor

Thanks, Fabian! We are developing version 4.0.0 based on the BGV PR from Alibaba this summer. The progress is slow, like continental drift. :) For now, I'm restraining from releasing patches unless there is serious bug. PRs will be merged with more delay than usual. Please be patient with me.

@fboemer fboemer force-pushed the fboemer/faster-decrypt branch from deef261 to 75e0dda Compare August 3, 2021 19:13
@fboemer
Copy link
Contributor Author

fboemer commented Aug 3, 2021

  1. A quick question, why write explicitly out the tile = 1024 would give a better performance than a simple for-loop ?
    Or why the compiler would not optimize this for us :(.
  2. Another question. Do you try to run this program on the same machine on parallel.
    I mean when multiple SEAL's programs run on the same time, should the L1 Cache still in-and-out for these different program ?

@fiosner, apologies for the delay on this.

  1. Interestingly, I'm seeing some varying speedup/slowdown with the tiling approach, so I've removed the tiling for now.
    Screen Shot 2021-08-03 at 12 16 12 PM

  2. I've only tested single-threaded. To my knowledge, the L1 cache is unique to each core, so there should be no L1 cache interference between different threads of SEALon different CPUs. In general, I would expect the L2 cache could create some slowdowns due to contention.

@WeiDaiWD WeiDaiWD merged commit 427c55f into microsoft:contrib Sep 15, 2021
@WeiDaiWD
Copy link
Contributor

Since this PR is made to contrib branch that includes BGV code, I cannot merge it with 3.6.6 easily. I applied your changes and named the commit as a pull request merge. So your commits won't be in the commit history; but your id and fork/branch/pr name will. I hope that it is fine with you.

@fboemer fboemer deleted the fboemer/faster-decrypt branch November 3, 2021 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants