-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NormalEstimationOMP: use dynamic scheduling for faster computation #5775
Conversation
So far, no scheduling was specified, which seems to result in a behaviour similar to static scheduling. However, this is suboptimal, as the workload is not balanced well between the threads, especially when using radius search. With dynamic scheduling (default chunk size of 256), the speedup (ratio of computation time of NormalEstimation and NormalEstimationOMP) is better. The speedup for organized datasets is slightly higher than for unorganized datasets, possibly because FLANN (used for unorganized datasets) already uses some parallelization, while OrganizedNeighbor does not. Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6): dataset | | | #/mm | speedup before | speedup after -----|-------------|----------|------|----------------|-------------- mug | organized | radius | 10 | 3.4857 | 5.2508 mug | organized | radius | 20 | 3.3441 | 5.1059 mug | organized | nearestk | 50 | 4.7033 | 5.0594 mug | organized | nearestk | 100 | 4.5808 | 4.9751 mug | unorganized | radius | 10 | 3.3374 | 4.8992 mug | unorganized | radius | 20 | 3.0206 | 4.7978 mug | unorganized | nearestk | 50 | 4.5841 | 4.9189 mug | unorganized | nearestk | 100 | 4.7062 | 4.8844 milk | organized | radius | 10 | 3.5140 | 5.1686 milk | organized | radius | 20 | 3.2605 | 5.1719 milk | organized | nearestk | 50 | 4.3245 | 4.9924 milk | organized | nearestk | 100 | 4.4170 | 4.9207 milk | unorganized | radius | 10 | 3.4451 | 4.8029 milk | unorganized | radius | 20 | 3.1887 | 4.8810 milk | unorganized | nearestk | 50 | 4.3789 | 4.6894 milk | unorganized | nearestk | 100 | 4.2717 | 4.7473 Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4): dataset | | | #/mm | speedup before | speedup after -----|-------------|----------|------|----------------|-------------- mug | organized | radius | 10 | 2.3783 | 3.9812 mug | organized | radius | 20 | 2.3080 | 3.9753 mug | organized | nearestk | 50 | 3.6190 | 3.9595 mug | organized | nearestk | 100 | 3.6100 | 3.9590 mug | unorganized | radius | 10 | 2.4181 | 3.7466 mug | unorganized | radius | 20 | 2.2157 | 3.8890 mug | unorganized | nearestk | 50 | 3.4894 | 3.6551 mug | unorganized | nearestk | 100 | 3.4293 | 3.7825 milk | organized | radius | 10 | 2.8174 | 3.8209 milk | organized | radius | 20 | 2.6911 | 3.9722 milk | organized | nearestk | 50 | 3.3346 | 3.9433 milk | organized | nearestk | 100 | 3.3275 | 3.9798 milk | unorganized | radius | 10 | 2.8815 | 3.5443 milk | unorganized | radius | 20 | 2.6467 | 3.7990 milk | unorganized | nearestk | 50 | 3.1602 | 3.6469 milk | unorganized | nearestk | 100 | 3.6460 | 3.7981
Looks good 👍 I also read that there exist a guided as well as runtime (Set by env OMP_SCHEDULE variable) scheduling. Btw. should we make directory with all these "compare" programs that gets created - I assume it would be nice to have such, when working with improving things. Or do you use the output of ie. the already added google benchmarks and then do the math(speedup factor calculations) elsewhere? |
Yes, that's true. Technically, the guided schedule should have less overhead than the dynamic schedule. However, I read somewhere that the guided schedule is realized in a bad way in some OpenMP implementations, namely that the first chunk is too large and thus the work is again unbalanced between the threads. If I remember correctly, I tested the guided schedule some time ago and it was worse than the dynamic schedule for the normal estimation.
I wrote a quick Python script (see below) that reads from a json file, created by the google benchmark, and computes the speedup overview. But I don't think it is nice enough to put it into the repo permanently. I did however extend our google benchmark for the normal estimation, maybe I can make a pull request to add that sometime. #!/usr/bin/env python3
import json
import sys
with open(sys.argv[1]) as json_data:
data = json.load(json_data)
average_speedup = 0
average_parallelization = 0
for dataset in ["mug", "milk"]:
for typ in ["organized", "unorganized"]:
search = "radius"
for param in [10, 20]:
time_w_omp = 1
time_wo_omp = 1
time_w_omp_cpu = 1
for benchmark in data["benchmarks"]:
if benchmark["name"] == "BM_NormalEstimation_" + dataset + "_" + typ + "_radius/" + str(param) + "/iterations:5/repeats:3_mean":
time_wo_omp = benchmark["real_time"]
if benchmark["name"] == "BM_NormalEstimationOMP_" + dataset + "_" + typ + "_radius/" + str(param) + "/iterations:10/repeats:3/process_time/real_time_mean":
time_w_omp = benchmark["real_time"]
time_w_omp_cpu = benchmark["cpu_time"]
print(dataset, typ, search, param, int(time_wo_omp+0.5), "/", int(time_w_omp+0.5), time_wo_omp/time_w_omp, time_w_omp_cpu/time_w_omp)
average_speedup += time_wo_omp/time_w_omp
average_parallelization += time_w_omp_cpu/time_w_omp
search = "nearestk"
for param in [50, 100]:
time_w_omp = 1
time_wo_omp = 1
time_w_omp_cpu = 1
for benchmark in data["benchmarks"]:
if benchmark["name"] == "BM_NormalEstimation_" + dataset + "_" + typ + "_nearest_k/" + str(param) + "/iterations:5/repeats:3_mean":
time_wo_omp = benchmark["real_time"]
if benchmark["name"] == "BM_NormalEstimationOMP_" + dataset + "_" + typ + "_nearest_k/" + str(param) + "/iterations:10/repeats:3/process_time/real_time_mean":
time_w_omp = benchmark["real_time"]
time_w_omp_cpu = benchmark["cpu_time"]
print(dataset, typ, search, param, int(time_wo_omp+0.5), "/", int(time_w_omp+0.5), time_wo_omp/time_w_omp, time_w_omp_cpu/time_w_omp)
average_speedup += time_wo_omp/time_w_omp
average_parallelization += time_w_omp_cpu/time_w_omp
print("average speedup=", average_speedup/16)
print("average parallelization=", average_parallelization/16) |
I guess other OMP implementations could use this setting as well, since most of those also require some search for neighbors, which can vary a lot and hence vary the computation between each iteration? |
Each iteration does a radius search, which does not take the same amount of time for each point. Specifying no schedule usually results in a static schedule. Related to PointCloudLibrary#5775 Benchmarks with table_scene_mug_stereo_textured.pcd (nan points removed before convolution) on Intel Core i7-9850H: GCC: threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2267 | 1725 | 1283 | 1039 | 863 | 744 | dynamic | 2269 | 1155 | 795 | 611 | 497 | 427 | MSVC 2022 (release configuration): threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2400 | 1886 | 1478 | 1176 | 972 | 857 | dynamic | 2501 | 1281 | 919 | 704 | 593 | 537 |
Each iteration does a radius search, which does not take the same amount of time for each point. Specifying no schedule usually results in a static schedule. Related to #5775 Benchmarks with table_scene_mug_stereo_textured.pcd (nan points removed before convolution) on Intel Core i7-9850H: GCC: threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2267 | 1725 | 1283 | 1039 | 863 | 744 | dynamic | 2269 | 1155 | 795 | 611 | 497 | 427 | MSVC 2022 (release configuration): threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2400 | 1886 | 1478 | 1176 | 972 | 857 | dynamic | 2501 | 1281 | 919 | 704 | 593 | 537 |
So far, no scheduling was specified, which seems to result in a behaviour similar to static scheduling. However, this is suboptimal, as the workload is not balanced well between the threads, especially when using radius search. With dynamic scheduling (default chunk size of 256), the speedup (ratio of computation time of NormalEstimation and NormalEstimationOMP) is better. The speedup for organized datasets is slightly higher than for unorganized datasets, possibly because FLANN (used for unorganized datasets) already uses some parallelization, while OrganizedNeighbor does not.
Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6):
Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4):