Threadpool support for matrix multiplication #124

bcebere · 2020-07-29T19:52:02Z

Description

The current flow spawns a number of threads on every multiplication.This can lead to denial of service, we should have a controlled way of creating new threads.

For that, this PR extends the current approach with:

A blocking queue and a thread pool implementation, for executing the jobs.
Move the thread creating responsability in the TenSEAL context.
Rework the multiplication with globals/mutexes by spliting the input in batches and using C++'s future promises in the main thread.
Added Google benchmark for measuring the performance per operations.
TODO: Comment the code.

fixes #118

Checklist

I have followed the Contribution Guidelines and Code of Conduct
I have commented my code following the OpenMined Styleguide
I have labeled this PR with the relevant Type labels
My changes are covered by tests

youben11

Amazing work Bogdan! Threadpool is gonna serve us a lot, and I love the benchmarks.

README.md

tenseal/cpp/context/tensealcontext.h

tenseal/cpp/tensors/ckksvector.h

youben11 · 2020-07-30T07:37:14Z

tenseal/cpp/tensors/utils/matrix_ops.h

+    if (tenseal_context->get_concurrency() == 1)
+        return worker_func(0, vector_size);
+
+    size_t thread_cnt = tenseal_context->get_concurrency();
+    std::vector<std::future<Ciphertext>> future_results;
+    size_t batch_size = (vector_size + thread_cnt - 1) / thread_cnt;
+
+    for (size_t i = 0; i < thread_cnt; i++) {
+        future_results.push_back(tenseal_context->dispatcher()->enqueue_task(
+            worker_func, i * batch_size,
+            std::min((i + 1) * batch_size, vector_size)));
+    }
+
+    for (size_t i = 0; i < tenseal_context->get_concurrency(); i++) {
+        tenseal_context->evaluator->add_inplace(result,
+                                                future_results[i].get());
+    }


I like this new design without threads synchronization! But I'm wondering about the distribution of ranges, my thoughts about whether specify the range in advance vs. let them compute collectively till the end is that the former doesn't include synchronization but works well in case where threads might not run in an equitable manner (suppose one thread doesn't run like the other, so they will still wait for it to finish its range when other threads might have finished it), while the latter include synchronization but make the threads work hard till the end. All this is just me with my imagination, I don't know how this translate in practice, do you have any thoughts about this? But of course, when we don't have synchronization, we can go up to many threads without the fear of threads' synchronization becoming a bottleneck.

So, you're saying that this loop might get stuck at 0, let's say, waiting for a slow thread, while the others finished the job, right?

Windows has a mechanism for waiting for multiple objects efficiently, unfortunately I cannot find something similar in standard C++ library, but I'll research it. I was looking for a solution too here, it's not ideal to loop over the futures and do a blocking wait.

I could experiment with more designs. One other idea was to add the results to a blocking queue, and notify the main thread using a condition variable. That might schedule the operations more efficiently.

The current design is great, I don't think we should jump directly into a different one, I just wanted to give it some thoughts. But this is definitely something I would merge

I'll make a test with a condition variable&mutex, to see how the benchmarks look that way.

I did a test using a mutex and a condition variable to notify the main thread(instead of waiting for a std::future), but the benchmarks look pretty similar.

I think it needs better load testing here, I am not sure how to simulate the scenario.

Maybe a random sleep in each worker might show the performance differences(cond variable vs future.get() )
I have the git stash anyway, I can try to add a test in another PR.

Co-authored-by: Ayoub Benaissa <ayouben9@gmail.com>

…o threadpool

youben11

LGTM! I just think the API for matmul with n_jobs would be great

vec.matmul(matrix, n_jobs=j) with:

0 being the default (which uses the current number of threads in the threadpool)
j > 0, first verify that the current number of threads is greater or equal than j, and run only j jobs

tenseal/binding.cpp

youben11

LGTM! Just one last thing, is it okay to create more jobs than there is threads? I thought we must throw an error there... not sure though

* add benchmarks for mamul_plain operation * add threadpool implementation Co-authored-by: Ayoub Benaissa <ayouben9@gmail.com>

bcebere added 12 commits July 29, 2020 12:42

add benchmarks for mamul_plain operation

9bfa6af

add initial threadpool implementation

bf4e5c0

update readme

ea409e2

add benchmark min time

fb846c7

cleanup

3a2647d

update benchmarks

c5c149e

cleanup benchamrks

fb67895

update benchmarks

91c8933

cleanup, small improvement

f0e24d8

lock-free multiplications

c26c767

cleanup

2d2330e

bugfixing

e8e3a0d

bcebere requested review from youben11 and philomath213 July 29, 2020 19:52

youben11 added the Type: Improvement 📈 Performance improvement not introducing a new feature or requiring a major refactor label Jul 29, 2020

update docs

ecc868c

bcebere added the Type: Refactor 🔨 A complete overhaul of a file, feature, or codebase label Jul 29, 2020

youben11 reviewed Jul 30, 2020

View reviewed changes

bcebere and others added 4 commits July 30, 2020 10:50

Update README.md

2d1762e

Co-authored-by: Ayoub Benaissa <ayouben9@gmail.com>

dispatcher: always fallback on get_concurrency

07e39a6

fix tests

80132d4

Merge branch 'threadpool' of https://github.com/OpenMined/TenSEAL int…

78da80f

…o threadpool

youben11 requested changes Jul 30, 2020

View reviewed changes

bcebere added 3 commits July 30, 2020 13:05

add configurable number of batches

9ba57b7

batch_count -> n_batches

500f843

add basic docs

97559af

youben11 requested changes Jul 30, 2020

View reviewed changes

tenseal/binding.cpp Outdated Show resolved Hide resolved

bcebere requested a review from youben11 July 30, 2020 11:22

n_batches -> n_jobs

aac6032

youben11 approved these changes Jul 30, 2020

View reviewed changes

bcebere added 4 commits July 30, 2020 15:27

update gtest

355ab44

bazel workflow issue

3786587

workflow issue

03ab3cd

workflow timeout

d35639c

bcebere merged commit 5d64a98 into master Jul 30, 2020

delete-merged-branch bot deleted the threadpool branch July 30, 2020 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threadpool support for matrix multiplication #124

Threadpool support for matrix multiplication #124

bcebere commented Jul 29, 2020 •

edited

Loading

youben11 left a comment

youben11 Jul 30, 2020

bcebere Jul 30, 2020

youben11 Jul 30, 2020

bcebere Jul 30, 2020

bcebere Jul 30, 2020

youben11 left a comment •

edited

Loading

youben11 left a comment

Threadpool support for matrix multiplication #124

Threadpool support for matrix multiplication #124

Conversation

bcebere commented Jul 29, 2020 • edited Loading

Description

Checklist

youben11 left a comment

Choose a reason for hiding this comment

youben11 Jul 30, 2020

Choose a reason for hiding this comment

bcebere Jul 30, 2020

Choose a reason for hiding this comment

youben11 Jul 30, 2020

Choose a reason for hiding this comment

bcebere Jul 30, 2020

Choose a reason for hiding this comment

bcebere Jul 30, 2020

Choose a reason for hiding this comment

youben11 left a comment • edited Loading

Choose a reason for hiding this comment

youben11 left a comment

Choose a reason for hiding this comment

bcebere commented Jul 29, 2020 •

edited

Loading

youben11 left a comment •

edited

Loading