Clarify expected speedup with NormalEstimationOMP #5721

mynickmynick · 2023-05-24T14:19:39Z

I think this (as a consequence of my test mentioned in #5710) is a more correct comment in the tutorial

mvieth · 2023-05-25T08:32:54Z

While I am not against a rephrasing, your suggestion with n/m and 1<=m<n/2 is not very easy to understand and gives an unnecessarily large range (e.g. for 8 cores this would translate to 2-8 times faster), in my opinion.
Of course, if it says 6-8 times faster, one can find situations where the speed-up is worse, but I don't think anyone would take this as a guarantee? It is more meant as a "typical" value, to give the user a rough idea what to expect. And from what I can recall, I usually did get a speed-up between 6 and 8 on 8 cores (will try to verify this again).
I think the main points are 1) under perfect conditions, the estimation would be as many times faster as there are CPU cores (e.g. 8 times faster with 8 cores), 2) in reality the speed-up will be less, how much depends on CPU architecture, point cloud, etc.

mynickmynick · 2023-05-25T13:40:41Z

the rephrase could be
"On a system with n cores, you should get m times faster computation, with m<=n depending on several factors including CPU architecture, point cloud characteristics, search parameters chosen, CPU global load."

In my experiment m=3.33 and n=8 threads but with 12 cores (if it was with 8 cores it would be less than 3.33)

6 has basically no meaning and is misleading. it would have more meaning on a specific CPU at a specific year in a specific benchmark

mvieth · 2023-07-02T16:53:27Z

the rephrase could be "On a system with n cores, you should get m times faster computation, with m<=n depending on several factors including CPU architecture, point cloud characteristics, search parameters chosen, CPU global load."

I am fine with this phrasing. Feel free to update the PR accordingly.

By the way, I extended the benchmark of the normal estimation (in benchmarks/features/normal_3d.cpp), here is what I got (first column is the point cloud name, fifth column are run times without and with parallelization, last column is speedup which would optimally be 6 on Laptop 1 and 4 on Laptop 2):

Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6):
mug  organized   radius   10  512/146   3.506849315
mug  organized   radius   20  1664/508  3.275590551
mug  organized   nearestk 50  551/129   4.271317829
mug  organized   nearestk 100 1167/268  4.354477612
mug  unorganized radius   10  1108/337  3.287833828
mug  unorganized radius   20  3158/1042 3.030710173
mug  unorganized nearestk 50  902/207   4.357487923
mug  unorganized nearestk 100 1882/434  4.33640553
milk organized   radius   10  345/106   3.254716981
milk organized   radius   20  1089/343  3.174927114
milk organized   nearestk 50  767/180   4.261111111
milk organized   nearestk 100 1651/394  4.19035533
milk unorganized radius   10  879/260   3.380769231
milk unorganized radius   20  2251/703  3.201991465
milk unorganized nearestk 50  1056/245  4.310204082
milk unorganized nearestk 100 2205/512  4.306640625

Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4):
mug  organized   radius    10  839/353   2.3783
mug  organized   radius    20  2888/1251 2.3080
mug  organized   nearestk  50  774/214   3.6189
mug  organized   nearestk  100 1632/452  3.6099
mug  unorganized radius    10  1683/696  2.4180
mug  unorganized radius    20  4782/2158 2.2156
mug  unorganized nearestk  50  1278/366  3.4894
mug  unorganized nearestk  100 2430/709  3.4292
milk organized   radius    10  573/203   2.8173
milk organized   radius    20  1891/703  2.6911
milk organized   nearestk  50  1045/313  3.3345
milk organized   nearestk  100 2230/670  3.3274
milk unorganized radius    10  1335/463  2.8814
milk unorganized radius    20  3318/1254 2.6466
milk unorganized nearestk  50  1471/465  3.1601
milk unorganized nearestk  100 3137/860  3.6459

So the speedup is worse when using radius search instead of nearest-k search. This seems to be because of load imbalance between threads. I have to do more tests, but it seems like using dynamic scheduling improves the speedup in all cases (currently no scheduling strategy is specified, which apparently defaults to static scheduling (at least on gcc)).

mynickmynick · 2023-07-03T07:39:52Z

ok interesting

comment on multithread openMP speed up

mynickmynick · 2023-07-06T09:23:40Z

@mvieth I have pushed the rephrase. Regarding the details of multithr.: as long as the algorithm is split between a number of threads <= ph. cores as you did, I think that scheduling has influence mainly in the competition for the CPU with other threads of other applications running. (It would be better to test the algorithm in multithreading on a light embedded system with little going on together, no antivirus, no desktop GUIs etc.). What I think is more relevant in this contest as I mentioned before is the sharing (so contention) of memory access. I would like to know more about this topic: I suppose the only parallel level of memory access is at lower level(s) of cache, while at cache higher levels(s) and at RAM level the access is not parallel: that's the main bottle neck on which to try to optimize memory sharing and access (but offten there is not much control and to do on that I think). Even after you optimized and minimized the critical sections bottle necks, the bottle neck is still there and hardly debuggable and monitorizable.

mvieth

Thanks.

I currently don't really have the time, nor do I have a "light embedded system" to test the way you suggested. If you do, feel free to share the results.
However, I have tested sufficiently to say that dynamic scheduling is better than the current, default scheduling, and I will create a PR to change the scheduling soon.

mynickmynick added 2 commits May 24, 2023 16:07

Update normal_estimation.rst

3209967

Update normal_estimation.rst

2b9a10f

Update normal_estimation.rst, comment on multithread openMP speed up

964ed69

comment on multithread openMP speed up

mvieth approved these changes Jul 17, 2023

View reviewed changes

mvieth changed the title ~~Update normal_estimation.rst~~ Clarify expected speedup with NormalEstimationOMP Jul 17, 2023

mvieth added the module: tutorials label Jul 17, 2023

mvieth merged commit 6bb70d6 into PointCloudLibrary:master Jul 17, 2023
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify expected speedup with NormalEstimationOMP #5721

Clarify expected speedup with NormalEstimationOMP #5721

mynickmynick commented May 24, 2023

mvieth commented May 25, 2023

mynickmynick commented May 25, 2023 •

edited

Loading

mvieth commented Jul 2, 2023

mynickmynick commented Jul 3, 2023

mynickmynick commented Jul 6, 2023

mvieth left a comment

Clarify expected speedup with NormalEstimationOMP #5721

Clarify expected speedup with NormalEstimationOMP #5721

Conversation

mynickmynick commented May 24, 2023

mvieth commented May 25, 2023

mynickmynick commented May 25, 2023 • edited Loading

mvieth commented Jul 2, 2023

mynickmynick commented Jul 3, 2023

mynickmynick commented Jul 6, 2023

mvieth left a comment

Choose a reason for hiding this comment

mynickmynick commented May 25, 2023 •

edited

Loading