-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify expected speedup with NormalEstimationOMP #5721
Clarify expected speedup with NormalEstimationOMP #5721
Conversation
While I am not against a rephrasing, your suggestion with |
the rephrase could be In my experiment m=3.33 and n=8 threads but with 12 cores (if it was with 8 cores it would be less than 3.33) 6 has basically no meaning and is misleading. it would have more meaning on a specific CPU at a specific year in a specific benchmark |
I am fine with this phrasing. Feel free to update the PR accordingly. By the way, I extended the benchmark of the normal estimation (in
So the speedup is worse when using radius search instead of nearest-k search. This seems to be because of load imbalance between threads. I have to do more tests, but it seems like using dynamic scheduling improves the speedup in all cases (currently no scheduling strategy is specified, which apparently defaults to static scheduling (at least on gcc)). |
ok interesting |
comment on multithread openMP speed up
@mvieth I have pushed the rephrase. Regarding the details of multithr.: as long as the algorithm is split between a number of threads <= ph. cores as you did, I think that scheduling has influence mainly in the competition for the CPU with other threads of other applications running. (It would be better to test the algorithm in multithreading on a light embedded system with little going on together, no antivirus, no desktop GUIs etc.). What I think is more relevant in this contest as I mentioned before is the sharing (so contention) of memory access. I would like to know more about this topic: I suppose the only parallel level of memory access is at lower level(s) of cache, while at cache higher levels(s) and at RAM level the access is not parallel: that's the main bottle neck on which to try to optimize memory sharing and access (but offten there is not much control and to do on that I think). Even after you optimized and minimized the critical sections bottle necks, the bottle neck is still there and hardly debuggable and monitorizable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I currently don't really have the time, nor do I have a "light embedded system" to test the way you suggested. If you do, feel free to share the results.
However, I have tested sufficiently to say that dynamic scheduling is better than the current, default scheduling, and I will create a PR to change the scheduling soon.
I think this (as a consequence of my test mentioned in #5710) is a more correct comment in the tutorial