NormalEstimationOMP: use dynamic scheduling for faster computation #5775

mvieth · 2023-07-30T14:58:56Z

So far, no scheduling was specified, which seems to result in a behaviour similar to static scheduling. However, this is suboptimal, as the workload is not balanced well between the threads, especially when using radius search. With dynamic scheduling (default chunk size of 256), the speedup (ratio of computation time of NormalEstimation and NormalEstimationOMP) is better. The speedup for organized datasets is slightly higher than for unorganized datasets, possibly because FLANN (used for unorganized datasets) already uses some parallelization, while OrganizedNeighbor does not.

Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6):

dataset			#/mm	speedup before	speedup after
mug	organized	radius	10	3.4857	5.2508
mug	organized	radius	20	3.3441	5.1059
mug	organized	nearestk	50	4.7033	5.0594
mug	organized	nearestk	100	4.5808	4.9751
mug	unorganized	radius	10	3.3374	4.8992
mug	unorganized	radius	20	3.0206	4.7978
mug	unorganized	nearestk	50	4.5841	4.9189
mug	unorganized	nearestk	100	4.7062	4.8844
milk	organized	radius	10	3.5140	5.1686
milk	organized	radius	20	3.2605	5.1719
milk	organized	nearestk	50	4.3245	4.9924
milk	organized	nearestk	100	4.4170	4.9207
milk	unorganized	radius	10	3.4451	4.8029
milk	unorganized	radius	20	3.1887	4.8810
milk	unorganized	nearestk	50	4.3789	4.6894
milk	unorganized	nearestk	100	4.2717	4.7473

Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4):

dataset			#/mm	speedup before	speedup after
mug	organized	radius	10	2.3783	3.9812
mug	organized	radius	20	2.3080	3.9753
mug	organized	nearestk	50	3.6190	3.9595
mug	organized	nearestk	100	3.6100	3.9590
mug	unorganized	radius	10	2.4181	3.7466
mug	unorganized	radius	20	2.2157	3.8890
mug	unorganized	nearestk	50	3.4894	3.6551
mug	unorganized	nearestk	100	3.4293	3.7825
milk	organized	radius	10	2.8174	3.8209
milk	organized	radius	20	2.6911	3.9722
milk	organized	nearestk	50	3.3346	3.9433
milk	organized	nearestk	100	3.3275	3.9798
milk	unorganized	radius	10	2.8815	3.5443
milk	unorganized	radius	20	2.6467	3.7990
milk	unorganized	nearestk	50	3.1602	3.6469
milk	unorganized	nearestk	100	3.6460	3.7981

So far, no scheduling was specified, which seems to result in a behaviour similar to static scheduling. However, this is suboptimal, as the workload is not balanced well between the threads, especially when using radius search. With dynamic scheduling (default chunk size of 256), the speedup (ratio of computation time of NormalEstimation and NormalEstimationOMP) is better. The speedup for organized datasets is slightly higher than for unorganized datasets, possibly because FLANN (used for unorganized datasets) already uses some parallelization, while OrganizedNeighbor does not. Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6): dataset | | | #/mm | speedup before | speedup after -----|-------------|----------|------|----------------|-------------- mug | organized | radius | 10 | 3.4857 | 5.2508 mug | organized | radius | 20 | 3.3441 | 5.1059 mug | organized | nearestk | 50 | 4.7033 | 5.0594 mug | organized | nearestk | 100 | 4.5808 | 4.9751 mug | unorganized | radius | 10 | 3.3374 | 4.8992 mug | unorganized | radius | 20 | 3.0206 | 4.7978 mug | unorganized | nearestk | 50 | 4.5841 | 4.9189 mug | unorganized | nearestk | 100 | 4.7062 | 4.8844 milk | organized | radius | 10 | 3.5140 | 5.1686 milk | organized | radius | 20 | 3.2605 | 5.1719 milk | organized | nearestk | 50 | 4.3245 | 4.9924 milk | organized | nearestk | 100 | 4.4170 | 4.9207 milk | unorganized | radius | 10 | 3.4451 | 4.8029 milk | unorganized | radius | 20 | 3.1887 | 4.8810 milk | unorganized | nearestk | 50 | 4.3789 | 4.6894 milk | unorganized | nearestk | 100 | 4.2717 | 4.7473 Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4): dataset | | | #/mm | speedup before | speedup after -----|-------------|----------|------|----------------|-------------- mug | organized | radius | 10 | 2.3783 | 3.9812 mug | organized | radius | 20 | 2.3080 | 3.9753 mug | organized | nearestk | 50 | 3.6190 | 3.9595 mug | organized | nearestk | 100 | 3.6100 | 3.9590 mug | unorganized | radius | 10 | 2.4181 | 3.7466 mug | unorganized | radius | 20 | 2.2157 | 3.8890 mug | unorganized | nearestk | 50 | 3.4894 | 3.6551 mug | unorganized | nearestk | 100 | 3.4293 | 3.7825 milk | organized | radius | 10 | 2.8174 | 3.8209 milk | organized | radius | 20 | 2.6911 | 3.9722 milk | organized | nearestk | 50 | 3.3346 | 3.9433 milk | organized | nearestk | 100 | 3.3275 | 3.9798 milk | unorganized | radius | 10 | 2.8815 | 3.5443 milk | unorganized | radius | 20 | 2.6467 | 3.7990 milk | unorganized | nearestk | 50 | 3.1602 | 3.6469 milk | unorganized | nearestk | 100 | 3.6460 | 3.7981

larshg · 2023-08-03T06:01:59Z

Looks good 👍

I also read that there exist a guided as well as runtime (Set by env OMP_SCHEDULE variable) scheduling.

Btw. should we make directory with all these "compare" programs that gets created - I assume it would be nice to have such, when working with improving things.

Or do you use the output of ie. the already added google benchmarks and then do the math(speedup factor calculations) elsewhere?

mvieth · 2023-08-03T12:40:03Z

I also read that there exist a guided as well as runtime (Set by env OMP_SCHEDULE variable) scheduling.

Yes, that's true. Technically, the guided schedule should have less overhead than the dynamic schedule. However, I read somewhere that the guided schedule is realized in a bad way in some OpenMP implementations, namely that the first chunk is too large and thus the work is again unbalanced between the threads. If I remember correctly, I tested the guided schedule some time ago and it was worse than the dynamic schedule for the normal estimation.

Btw. should we make directory with all these "compare" programs that gets created - I assume it would be nice to have such, when working with improving things.

Or do you use the output of ie. the already added google benchmarks and then do the math(speedup factor calculations) elsewhere?

I wrote a quick Python script (see below) that reads from a json file, created by the google benchmark, and computes the speedup overview. But I don't think it is nice enough to put it into the repo permanently. I did however extend our google benchmark for the normal estimation, maybe I can make a pull request to add that sometime.

#!/usr/bin/env python3
import json
import sys
with open(sys.argv[1]) as json_data:
    data = json.load(json_data)
average_speedup = 0
average_parallelization = 0
for dataset in ["mug", "milk"]:
    for typ in ["organized", "unorganized"]:
        search = "radius"
        for param in [10, 20]:
            time_w_omp = 1
            time_wo_omp = 1
            time_w_omp_cpu = 1
            for benchmark in data["benchmarks"]:
                if benchmark["name"] == "BM_NormalEstimation_" + dataset + "_" + typ + "_radius/" + str(param) + "/iterations:5/repeats:3_mean":
                    time_wo_omp = benchmark["real_time"]
                if benchmark["name"] == "BM_NormalEstimationOMP_" + dataset + "_" + typ + "_radius/" + str(param) + "/iterations:10/repeats:3/process_time/real_time_mean":
                    time_w_omp = benchmark["real_time"]
                    time_w_omp_cpu = benchmark["cpu_time"]
            print(dataset, typ, search, param, int(time_wo_omp+0.5), "/", int(time_w_omp+0.5), time_wo_omp/time_w_omp, time_w_omp_cpu/time_w_omp)
            average_speedup += time_wo_omp/time_w_omp
            average_parallelization += time_w_omp_cpu/time_w_omp
        search = "nearestk"
        for param in [50, 100]:
            time_w_omp = 1
            time_wo_omp = 1
            time_w_omp_cpu = 1
            for benchmark in data["benchmarks"]:
                if benchmark["name"] == "BM_NormalEstimation_" + dataset + "_" + typ + "_nearest_k/" + str(param) + "/iterations:5/repeats:3_mean":
                    time_wo_omp = benchmark["real_time"]
                if benchmark["name"] == "BM_NormalEstimationOMP_" + dataset + "_" + typ + "_nearest_k/" + str(param) + "/iterations:10/repeats:3/process_time/real_time_mean":
                    time_w_omp = benchmark["real_time"]
                    time_w_omp_cpu = benchmark["cpu_time"]
            print(dataset, typ, search, param, int(time_wo_omp+0.5), "/", int(time_w_omp+0.5), time_wo_omp/time_w_omp, time_w_omp_cpu/time_w_omp)
            average_speedup += time_wo_omp/time_w_omp
            average_parallelization += time_w_omp_cpu/time_w_omp
print("average speedup=", average_speedup/16)
print("average parallelization=", average_parallelization/16)

larshg · 2023-08-03T14:54:30Z

I guess other OMP implementations could use this setting as well, since most of those also require some search for neighbors, which can vary a lot and hence vary the computation between each iteration?

Each iteration does a radius search, which does not take the same amount of time for each point. Specifying no schedule usually results in a static schedule. Related to PointCloudLibrary#5775 Benchmarks with table_scene_mug_stereo_textured.pcd (nan points removed before convolution) on Intel Core i7-9850H: GCC: threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2267 | 1725 | 1283 | 1039 | 863 | 744 | dynamic | 2269 | 1155 | 795 | 611 | 497 | 427 | MSVC 2022 (release configuration): threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2400 | 1886 | 1478 | 1176 | 972 | 857 | dynamic | 2501 | 1281 | 919 | 704 | 593 | 537 |

Each iteration does a radius search, which does not take the same amount of time for each point. Specifying no schedule usually results in a static schedule. Related to #5775 Benchmarks with table_scene_mug_stereo_textured.pcd (nan points removed before convolution) on Intel Core i7-9850H: GCC: threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2267 | 1725 | 1283 | 1039 | 863 | 744 | dynamic | 2269 | 1155 | 795 | 611 | 497 | 427 | MSVC 2022 (release configuration): threads | 1 | 2 | 3 | 4 | 5 | 6 | before | 2400 | 1886 | 1478 | 1176 | 972 | 857 | dynamic | 2501 | 1281 | 919 | 704 | 593 | 537 |

mvieth added changelog: enhancement Meta-information for changelog generation module: features labels Jul 30, 2023

larshg approved these changes Aug 3, 2023

View reviewed changes

mvieth mentioned this pull request Aug 7, 2023

Check which OpenMP for-loops could benefit from dynamic schedule #5785

Open

mvieth merged commit 97ef1b7 into PointCloudLibrary:master Aug 7, 2023
13 checks passed

mvieth deleted the normal3d_dynamic_omp branch August 7, 2023 12:24

mvieth mentioned this pull request Oct 19, 2024

Convolution3D: use dynamic schedule for OpenMP #6155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NormalEstimationOMP: use dynamic scheduling for faster computation #5775

NormalEstimationOMP: use dynamic scheduling for faster computation #5775

mvieth commented Jul 30, 2023

larshg commented Aug 3, 2023

mvieth commented Aug 3, 2023

larshg commented Aug 3, 2023

NormalEstimationOMP: use dynamic scheduling for faster computation #5775

NormalEstimationOMP: use dynamic scheduling for faster computation #5775

Conversation

mvieth commented Jul 30, 2023

larshg commented Aug 3, 2023

mvieth commented Aug 3, 2023

larshg commented Aug 3, 2023