Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify expected speedup with NormalEstimationOMP #5721

Merged

Conversation

mynickmynick
Copy link
Contributor

I think this (as a consequence of my test mentioned in #5710) is a more correct comment in the tutorial

@mvieth
Copy link
Member

mvieth commented May 25, 2023

While I am not against a rephrasing, your suggestion with n/m and 1<=m<n/2 is not very easy to understand and gives an unnecessarily large range (e.g. for 8 cores this would translate to 2-8 times faster), in my opinion.
Of course, if it says 6-8 times faster, one can find situations where the speed-up is worse, but I don't think anyone would take this as a guarantee? It is more meant as a "typical" value, to give the user a rough idea what to expect. And from what I can recall, I usually did get a speed-up between 6 and 8 on 8 cores (will try to verify this again).
I think the main points are 1) under perfect conditions, the estimation would be as many times faster as there are CPU cores (e.g. 8 times faster with 8 cores), 2) in reality the speed-up will be less, how much depends on CPU architecture, point cloud, etc.

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 25, 2023

the rephrase could be
"On a system with n cores, you should get m times faster computation, with m<=n depending on several factors including CPU architecture, point cloud characteristics, search parameters chosen, CPU global load."

In my experiment m=3.33 and n=8 threads but with 12 cores (if it was with 8 cores it would be less than 3.33)

6 has basically no meaning and is misleading. it would have more meaning on a specific CPU at a specific year in a specific benchmark

@mvieth
Copy link
Member

mvieth commented Jul 2, 2023

the rephrase could be "On a system with n cores, you should get m times faster computation, with m<=n depending on several factors including CPU architecture, point cloud characteristics, search parameters chosen, CPU global load."

I am fine with this phrasing. Feel free to update the PR accordingly.

By the way, I extended the benchmark of the normal estimation (in benchmarks/features/normal_3d.cpp), here is what I got (first column is the point cloud name, fifth column are run times without and with parallelization, last column is speedup which would optimally be 6 on Laptop 1 and 4 on Laptop 2):

Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6):
mug  organized   radius   10  512/146   3.506849315
mug  organized   radius   20  1664/508  3.275590551
mug  organized   nearestk 50  551/129   4.271317829
mug  organized   nearestk 100 1167/268  4.354477612
mug  unorganized radius   10  1108/337  3.287833828
mug  unorganized radius   20  3158/1042 3.030710173
mug  unorganized nearestk 50  902/207   4.357487923
mug  unorganized nearestk 100 1882/434  4.33640553
milk organized   radius   10  345/106   3.254716981
milk organized   radius   20  1089/343  3.174927114
milk organized   nearestk 50  767/180   4.261111111
milk organized   nearestk 100 1651/394  4.19035533
milk unorganized radius   10  879/260   3.380769231
milk unorganized radius   20  2251/703  3.201991465
milk unorganized nearestk 50  1056/245  4.310204082
milk unorganized nearestk 100 2205/512  4.306640625

Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4):
mug  organized   radius    10  839/353   2.3783
mug  organized   radius    20  2888/1251 2.3080
mug  organized   nearestk  50  774/214   3.6189
mug  organized   nearestk  100 1632/452  3.6099
mug  unorganized radius    10  1683/696  2.4180
mug  unorganized radius    20  4782/2158 2.2156
mug  unorganized nearestk  50  1278/366  3.4894
mug  unorganized nearestk  100 2430/709  3.4292
milk organized   radius    10  573/203   2.8173
milk organized   radius    20  1891/703  2.6911
milk organized   nearestk  50  1045/313  3.3345
milk organized   nearestk  100 2230/670  3.3274
milk unorganized radius    10  1335/463  2.8814
milk unorganized radius    20  3318/1254 2.6466
milk unorganized nearestk  50  1471/465  3.1601
milk unorganized nearestk  100 3137/860  3.6459

So the speedup is worse when using radius search instead of nearest-k search. This seems to be because of load imbalance between threads. I have to do more tests, but it seems like using dynamic scheduling improves the speedup in all cases (currently no scheduling strategy is specified, which apparently defaults to static scheduling (at least on gcc)).

@mynickmynick
Copy link
Contributor Author

ok interesting

@mynickmynick
Copy link
Contributor Author

@mvieth I have pushed the rephrase. Regarding the details of multithr.: as long as the algorithm is split between a number of threads <= ph. cores as you did, I think that scheduling has influence mainly in the competition for the CPU with other threads of other applications running. (It would be better to test the algorithm in multithreading on a light embedded system with little going on together, no antivirus, no desktop GUIs etc.). What I think is more relevant in this contest as I mentioned before is the sharing (so contention) of memory access. I would like to know more about this topic: I suppose the only parallel level of memory access is at lower level(s) of cache, while at cache higher levels(s) and at RAM level the access is not parallel: that's the main bottle neck on which to try to optimize memory sharing and access (but offten there is not much control and to do on that I think). Even after you optimized and minimized the critical sections bottle necks, the bottle neck is still there and hardly debuggable and monitorizable.

Copy link
Member

@mvieth mvieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I currently don't really have the time, nor do I have a "light embedded system" to test the way you suggested. If you do, feel free to share the results.
However, I have tested sufficiently to say that dynamic scheduling is better than the current, default scheduling, and I will create a PR to change the scheduling soon.

@mvieth mvieth changed the title Update normal_estimation.rst Clarify expected speedup with NormalEstimationOMP Jul 17, 2023
@mvieth mvieth merged commit 6bb70d6 into PointCloudLibrary:master Jul 17, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants