Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-46% faster on non organzied clouds, much faster appearantly on organized clouds #5710

Open
mynickmynick opened this issue May 11, 2023 · 29 comments
Labels
status: triage Labels incomplete

Comments

@mynickmynick
Copy link
Contributor

mynickmynick commented May 11, 2023

I have implemented a std multithread version of ConditionalEuclideanClustering::segment and called it
void
segmentMT (IndicesClusters &clusters, const size_t threadNumber=2);
on my fork branch attemptConditionalEuclideanMT
on my PC it improves of only 21-38% the time to produce the results using 3 or 4 threads
12th Gen Intel(R) Core(TM) i7-12700 2.10 GHz
20 Logical CPUs - 32.0 GB
Windows 10 Pro

Setting1:
cloud is dense and NOT organized and has 2 million points XYZRGB
so the search used is pcl::search::OrganizedNeighbor
which takes 350 msec just to execute searcher_->setInputCloud
11 clusters are segmented

5 threads: 2.1-2.3 sexc
4 threads: 1.9-2.2 sec
3 threads: 1.8-2 sec
2 threads: 2.1 sec
1 thread: 2.4sec
21% improvement

Setting2:
cloud is dense and NOT organized and has 300 000 points XYZRGB
3 threads: 268-295 msec
1 thread: 420msec
4 clusters are found
32% improvement

Setting3:
cloud is NOT dense and NOT organized and has 485 000 points XYZRGB
3 threads: 480-500 msec
1 thread: 790msec
1 clusters is found
38% improvement

Setting 4: like Setting 1 but running on a different computer, a laptop less stable (less reliable for benchmark) because with heavy sw running in background:
4 threads: 2.9-3.4 sec
1 thread: 5.6-6.1 sec
46% improvement

I am still investigating the reasons for this limited improvement. On a first analysis the main reason could be that the algorithm is intinsecally not much parallelizable. I can split in parallel the growing of the clusters by splitting the point sequence in subsequences, but then at the end I have to reconnect the clusters that have grown separately in different threads ( I record the connections to be made at the end)

you can check I have quite optimized the critical scetions using also shared_mutexes

still I think it would be an interesting alternative to openMP or CUDA also for other tools/features that like this one do not look feasible for openMP or CUDA

I have tried 3 versions one with a searcher for each thread, one with a unique searcher prpeared by the main thread and one with a unique searcher prepared by the first launched thread. The latter would be faster but once in a while it gets much slower, probably because I had to implement it with a std::condition_variable that when triggered probably it slows down the thread pool. So in the end I opted for the 2nd solution (no std::condition_variable)

@mynickmynick mynickmynick added the status: triage Labels incomplete label May 11, 2023
@mynickmynick mynickmynick changed the title [ConditionalEuclideanClustering::segment] std::(multi)thread (only) 20% faster [ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-38% faster May 11, 2023
@mynickmynick mynickmynick changed the title [ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-38% faster [ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-46% faster May 13, 2023
@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 15, 2023

i made a mistake in designing the sharing strategy, so I am reviewing it
@mvieth I noticed you set C++14, why not using a more recent standard? in this implementation I would need scoped_lock which requires C++17 in theory. I am now compiling my fork with C++20, I see no reason why not, please advice on this, thanks.

Set target C++ standard and required compiler features

set(CMAKE_CXX_STANDARD 20 CACHE STRING "The target C++ standard. PCL requires C++20 or higher.")
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
set(PCL_CXX_COMPILE_FEATURES cxx_std_20)

set(CMAKE_CUDA_STANDARD 20 CACHE STRING "The target CUDA/C++ standard. PCL requires CUDA/C++ 20 or higher.")
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

@mvieth
Copy link
Member

mvieth commented May 15, 2023

We currently still aim to have PCL compatible with C++14 so that people using older compilers (which do not fully support C++17) can still use PCL. This may change in the future, but there is no plan yet when we drop C++14 and switch to C++17.

Regarding your multithreaded algorithm: to be honest, 21-46% improvement isn't that great when using 3-4 threads. I think when using 4 threads for example, one should expect to get at most half of the one-thread computing time, better only a third. Have you analysed where the segmentation algorithm spends most of its time?

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 15, 2023

@mvieth

We currently still aim to have PCL compatible with C++14 so that people using older compilers (which do not fully support C++17) can still use PCL. This may change in the future, but there is no plan yet when we drop C++14 and switch to C++17.

why not setting a compiler option about it. C++14 is 9 years old and there is a lot of useful stuff from C++17-20 . you can introduce a compilation option like WITH_C20 or something like that that if false disables the methods that rely on C++20. By the way I would not extend to C++17, I would extend to C++20.

Regarding your multithreaded algorithm: to be honest, 21-46% improvement isn't that great when using 3-4 threads. I think when using 4 threads for example, one should expect to get at most half of the one-thread computing time, better only a third. Have you analysed where the segmentation algorithm spends most of its time?

Notice I wrote "only" in the title. 46% is twice faster anyway. not all scenarios have the same speed up.
I am working on it
What you say it would happen if it was a bunch of data to be processed independently. In that case it would be sufficient openMP.
the processing is not independent, it is deeply correlated. the threads have to share in read/write the vector processed[] and the segmentation results plus also the intermiedary connection data (connection between clusters grown by different threads that need to be connected in the final stage). they start independently from different points but then they intersect as the clusters grow and touch each other. I am testing different strategies but there is a lot of shared mutex mutual interblocking. One of these strategies require a scoped_lock (C++17) which if not avaiable might be inefficiently mimiced by more complex mutex-lock calls. PCL compiles in C++20: it is sufficient to change those lines I copied and pasted above. Then you can introduce a compilation option like WITH_C20 or something like that . In any case not all the strategies I am using require C++17.

A main bottle neck I think is the use of Indices and the KDtree. I have available only not organized clouds. I bet with an organized cloud the speed up would be higher. Can you suggest a big organized cloud you already used for other testing?

Do you think by openMP or CUDA you could get better results on this algorithm? I don't think so. Have you ever tried standard C++ multithreading in PCL? This would be the first time?

@mvieth
Copy link
Member

mvieth commented May 15, 2023

why not setting a compiler option about it.

There is an option to control the C++ standard used to compile PCL: https://github.com/PointCloudLibrary/pcl/blob/master/CMakeLists.txt#L11

C++14 is 9 years old and there is a lot of useful stuff from C++17-20

Totally agree that they are useful.

you can introduce a compilation option like WITH_C20 or something like that that if false disables the methods that rely on C++20. By the way I would not extend to C++17, I would extend to C++20.

According to Wikipedia, only Visual Studio has full support for C++20, while GCC and Clang only have partial support, and only in relatively new compiler versions. So requiring C++20 is not a good idea currently, in my opinion. I am open to discuss requiring C++17.
Enabling or disabling whole methods or classed based on the C++ standard may be very confusing and I would refrain from that. I would be okay with choosing the algorithm based on the standard, so that method can run faster (but produce the same result) with a newer standard. However, for the sake of code simplicity and maintainability, I would only do that if there is really no other way to do the code improvement with the older C++ standard.

Notice I wrote "only" in the title. 46% is twice faster anyway. not all scenarios have the same speed up. I am working on it What you say it would happen if it was a bunch of data to be processed independently. In that case it would be sufficient openMP. the processing is not independent, it is deeply correlated. the threads have to share in read/write the vector processed[] and the segmentation results plus also the intermiedary connection data (connection between clusters grown by different threads that need to be connected in the final stage). they start independently from different points but then they intersect as the clusters grow and touch each other. I am testing different strategies but there is a lot of shared mutex mutual interblocking. One of these strategies require a scoped_lock (C++17) which if not avaiable might be inefficiently mimiced by more complex mutex-lock calls. PCL compiles in C++20: it is sufficient to change those lines I copied and pasted above. Then you can introduce a compilation option like WITH_C20 or something like that . In any case not all the strategies I am using require C++17.

A main bottle neck I think is the use of Indices and the KDtree. I have available only not organized clouds. I bet with an organized cloud the speed up would be higher. Can you suggest a big organized cloud you already used for other testing?

The biggest organized clouds used are currently 480x640 (milk_cartoon_all_small_clorox.pcd and table_scene_mug_stereo_textured.pcd for example). So not very big.

Do you think by openMP or CUDA you could get better results on this algorithm? I don't think so. Have you ever tried standard C++ multithreading in PCL? This would be the first time?

I can't think of a place where standard C++ multithreading is currently used in PCL.

I would suspect that most of the time is spent searching for neighbours, but I haven't done any exact analysis yet (that's why I was interested whether you have done this). If most of the time is indeed spent this way, an option could be to first search for the neighbours of all points, and store the results. This should be easy to parallelize, with OpenMP or with standard multithreading. Then the rest of the algorithm could use the stored results. Would just be important to check that this approach does not require gigabytes of RAM.

@mynickmynick
Copy link
Contributor Author

I would suspect that most of the time is spent searching for neighbours, but I haven't done any exact analysis yet (that's why I was interested whether you have done this). If most of the time is indeed spent this way, an option could be to first search for the neighbours of all points, and store the results. This should be easy to parallelize, with OpenMP or with standard multithreading. Then the rest of the algorithm could use the stored results. Would just be important to check that this approach does not require gigabytes of RAM.

I am trying different implementations but all giving more or less the same speed results. in Settings 1 above of the 2 sec, 350msec are required by the searcher_->setInputCloud that does some work in preparation, the rest is required by parsing the neighbours and chcecking connection and condition function. I am not understanding why it speeds up so little (and less with higher (5,6) numbers of threads). I even tested with mutex lock disabled and the processing time would nearly not change, so it doesn't seem a mutex excessive locks problem. I tried making thread-local copies of cloud and indices and the processing time would increase (I expected it but I just wanted to be sure). I suppose the main problem is the use of Indices which make memory access very inefficient and scattered (so different threads access different RAM areas continuosly making the cache of little use).

If most of the time is indeed spent this way, an option could be to first search for the neighbours of all points, and store the results.

I don't know how to do this as the neigbourhood changes at every run, function of different cloud and also potentially different radius (the user can change its value anytime). I suppose it would be also huge in RAM.

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 16, 2023

even if the cloud is accessed without indices the class builds a trivial Indices with values 0,1,2,3,etc. so enforcing double redirection and inefficient memory access, this because the searcher class wants Indices

@mvieth
Copy link
Member

mvieth commented May 16, 2023

in Settings 1 above of the 2 sec, 350msec are required by the searcher_->setInputCloud that does some work in preparation

That time is probably spent building the kd-tree (used when the cloud is not organized).

I suppose the main problem is the use of Indices which make memory access very inefficient and scattered (so different threads access different RAM areas continuosly making the cache of little use).

How would that be relevant to multi-threading? Cloud and indices are only read from, never written to inside the segment function. So they don't have to be synchronised between caches. I don't see how this could influence the multi-threading speed-up.

I don't know how to do this as the neigbourhood changes at every run, function of different cloud and also potentially different radius (the user can change its value anytime). I suppose it would be also huge in RAM.

I mean at the beginning of the segment function, where the input cloud and the radius do not change any more.
I don't know if it would really use so much RAM, let's say for example 300000 points, each point might have approx 50 neighbours inside the radius (just a guess, no idea if this is plausible), 4 byte needed for the neighbour index and 4 byte for the neighbour distance, would result in 114 megabytes. That would still be acceptable, in my opinion. If the condition function was also applied by each thread directly after searching for neighbours, it could "forget" about the distances immediately afterwards and only store for each point, which other points (indices) are close enough and fulfil the condition function.
I just did a test with an unorganized point cloud (approx 200k points), 86% of the time was spent only inside radiusSearch. Let's say we manage to parallelize that nicely and use 4 threads, then we could get almost 3 times faster (100-86+86/4=35.5)

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 16, 2023

I am doing tests with the organized cloud and the speed up (as expected) is much higher and growing with number of threads
milk_cartoon_all_small_clorox.pcd filtered z (300-800)
10.0 mm euclidean connection
2 segments (the same)

12th Gen Intel(R) Core(TM) i7-12700 2.10 GHz
20 Logical CPUs - 32.0 GB
Windows 10 Pro

ordinary (1 thread ) method in 910msec
4 threads 420msec
8 threads 220msec ( x4.13 speed up)

Another PC
12 logical CPUs
ordinary (1 thread ) method in 1300msec
4 threads 500msec
8 threads 325msec ( x4 speed up)

@mynickmynick
Copy link
Contributor Author

image

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 16, 2023

I mean at the beginning of the segment function, where the input cloud and the radius do not change any more.
I don't know if it would really use so much RAM, let's say for example 300000 points, each point might have approx 50 neighbours inside the radius (just a guess, no idea if this is plausible), 4 byte needed for the neighbour index and 4 byte for the neighbour distance, would result in 114 megabytes. That would still be acceptable, in my opinion. If the condition function was also applied by each thread directly after searching for neighbours, it could "forget" about the distances immediately afterwards and only store for each point, which other points (indices) are close enough and fulfil the condition function.
I just did a test with an unorganized point cloud (approx 200k points), 86% of the time was spent only inside radiusSearch. Let's say we manage to parallelize that nicely and use 4 threads, then we could get almost 3 times faster (100-86+86/4=35.5)

i see, OK I don't know I'll think about it. right now I think in my multithread implementation radiusearch is already called by only one thread only once on each point added to a cluster, while points that are not added to a cluster are not associated to a radiussearch call, ( searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances) )so you would not diminish the computation by calling it in advance,
the number of calls to radiussearch would be the same or more, I think, but i will check that

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 16, 2023

what I would do is also investigate the use of different searchers for not organized clouds, searchers that might be more multithread prone. any suggestion about this are welcome

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 16, 2023

How would that be relevant to multi-threading? Cloud and indices are only read from, never written to inside the segment function. So they don't have to be synchronised between caches. I don't see how this could influence the multi-threading speed-up.

the fact that organized clouds are so very much more sped up might confirm my suspects about it, because of more contiguous memory access, less cache invalidations and more time spent by different threads on different non contiguous areas . it's just a rule of thumb guess.

@mynickmynick mynickmynick changed the title [ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-46% faster [ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-46% faster on non organzied clouds, much faster appearantly on organized clouds May 16, 2023
@mvieth
Copy link
Member

mvieth commented May 17, 2023

I am not really convinced that organized vs unorganized makes such a big difference. To eliminate all other effects, you could use an organized cloud, and make that unorganized by

cloud.width = cloud.width*cloud.height;
cloud.height = 1;

and then compare the speed-up when the cloud is organized vs when it is not organized.

Honestly, a x4 or x4.13 speedup when using 8 threads is still not great, considering a good parallelization would lead to a speedup somewhat close to x8

Another thought: doing the tests with a point cloud and a radius that result in only 2 clusters (top of milk carton and the rest) might not be really representative (not how this clustering would likely be used in practice IMO), and might give an unrealistic impression of the performance benefit (not sure whether higher or lower than realistic).

... so you would not diminish the computation by calling it in advance,
the number of calls to radiussearch would be the same or more, I think, but i will check that

radiusSearch should be called exactly once for each point in the input cloud, or, when indices are used, exactly once for each index.

what I would do is also investigate the use of different searchers for not organized clouds, searchers that might be more multithread prone. any suggestion about this are welcome

PCL also has an octree search ( https://pointclouds.org/documentation/classpcl_1_1search_1_1_octree.html ), but the kdtree search is more common for unorganized point clouds.

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 17, 2023

radiusSearch should be called exactly once for each point in the input cloud, or, when indices are used, exactly once for each index.

yes correct now in the multithread implementation radius search is called exactly once for every point , I was mistaken before. either as first point of the cluster or as an added point to the cluster. So I don't know if making the calls all at the beginning and storiing the result for successive retrieval would speed up. A speed up would come by exploitng somehow the symmetry of the relationship to cut x2 the time but I don't think the radiussearch methods do that

@mynickmynick
Copy link
Contributor Author

Another thought: doing the tests with a point cloud and a radius that result in only 2 clusters (top of milk carton and the rest) might not be really representative (not how this clustering would likely be used in practice IMO), and might give an unrealistic impression of the performance benefit (not sure whether higher or lower than realistic).

I am doing other stuff, this was just a fast test, I have no interest in segmenting milk bottles right now in this application , I have no conditional function designed for that . the qualitative result is totally different though in growing with the number of threads while with other unorganized clouds it would imporve only up to 4 threads in ALL segmentation configurations, 2 or 4 or 10 or more segments

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 17, 2023

Honestly, a x4 or x4.13 speedup when using 8 threads is still not great, considering a good parallelization would lead to a speedup somewhat close to x8

this algorithm is intrinsecally not very parallelizable for memory access, memory thread assignment/separation and points parsing. I wonder what level of speed up is documented of other parallelized PCL methods which very probably (like normals) are more parallelizable?? I have in mind anyway some other strategy possible improvements.

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 17, 2023

PCL also has an octree search ( https://pointclouds.org/documentation/classpcl_1_1search_1_1_octree.html ), but the kdtree search is more common for unorganized point clouds.

I had tried it this early morning with a resolution of 128.0 and it was extremely slower than kdtree. I don't know how to tune the resolution though. The FLANN seems to have results similar to kdtree.

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 18, 2023

Honestly, a x4 or x4.13 speedup when using 8 threads is still not great, considering a good parallelization would lead to a speedup somewhat close to x8

https://unix.stackexchange.com/questions/80424/why-using-more-threads-makes-it-slower-than-using-less-threads

https://stackoverflow.com/questions/46759930/multithreading-slower-than-single-threading

i suppose this method is mainly memory bound, not for the size but for the very scattered access as the clusters grow and intersect on different threads, hence cache efficiency problems and mutual interlock problems. the example of guys in the corridor gives the idea. I am investigating on a more efficient memory share

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 18, 2023

I obtained an interesting result working only with Setting 1 (to be tested with other settings... it's late night... tomorrow)

First I made several optimizations to the strategy and memory sharing which led appearantly to modest improvements (about 21->26%)

But then I modiified this call as follows

searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances, 8)

and that (probably also playing together with the previous optimizations) revolutionized the results. Now I have

original segment with 1 thread : 2400msec

segment 1 thread but
radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances, 8)
2200msec

multithread with
radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances, 8)

3th: about 1250msec
4th: about 1070msec
8th: about 900msec

before 4th was already worse than 3th, now it keeps improving even if slower

I should study more in detail the search algorithms to better understand this, interpret the speed results and deduce if the result is guaranteed to be near to the same or not. Here the result is not exactly the same, there is a tiny difference. Fundamental is wether the unsigned int max_nn are just the first max_nn found or the unsigned int max_nn nearest ones (i guess the first option). Probably running with max_nn much bigger is a better compromise, to be tested. like max_nn > max number of points in the sphere at that resolution or something similar.

ordinary segment:
------------- SEGMENTATION: 2.472 seconds.
Found a nr of Cluster: 11
Cluster 0 size 1092682
Cluster 1 size 127381
Cluster 2 size 115088
Cluster 3 size 117979
Cluster 4 size 117065
Cluster 5 size 39080
Cluster 6 size 7364
Cluster 7 size 52760
Cluster 8 size 284160
Cluster 9 size 12231
Cluster 10 size 13224

new segment 3 threads
------------- SEGMENTATION: 1.191 seconds.
Found a nr of Cluster: 11
Cluster 0 size 115088
Cluster 1 size 1092682
Cluster 2 size 117576
Cluster 3 size 127381
Cluster 4 size 117061
Cluster 5 size 39080
Cluster 6 size 52760
Cluster 7 size 284160
Cluster 8 size 12231
Cluster 9 size 13224
Cluster 10 size 7364

difference
ordinary segment:
Cluster 3 size 117979
Cluster 4 size 117065

new segment 3 threads
Cluster 2 size 117576
Cluster 4 size 117061

any suggestions are welcome

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 18, 2023

that result might confirm my hypothesis of a memory bound efficiency of multith.
Sett 1, threads 8, max_nn =64 : about 950msec

@mynickmynick
Copy link
Contributor Author

@mvieth please don't tell me you're not happy even with this ;)) !!
and I am still waiting to know figures about other parallel PCL benchmarks :))

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 19, 2023

max_nn could be handled as follows:
be given as an input default=indices.size()
if it is defaulted or equal to indices.size() the actual behaviour of running radiusSearch with a default max_nn would still hold
else if it is >0 its value will be applied to radiusSearch
else if it is ==0 an automatic setting of max_nn will rule that could be: launching some radiusSearch(max_nn =0) and then launch radiusSearch with max_nn = max_factor* nn_indices.size() exponentially-filtered-average with max_factor defaulted to 1 or given as another input
the user depending on the application can decide which mode to use, the default is the simplest slowest one

this automatic handling is not very convincing anyway, I have to better study this matter

I would like to understand what is that makes multithreading so inefficient when max_nn defaults to 0, probably a more scattered memory access. probably radiussearch causes false cache invalidations...

@mynickmynick
Copy link
Contributor Author

i did a test just to have a "cross benchmark" with openMP

I tested on this same setting

NormalEstimation: 3300msec
NormalEstimationOMP(3): 1400msec
NormalEstimationOMP(4):1200msec
NormalEstimationOMP(6):920msec
NormalEstimationOMP(8):840-900msec

and I didn't take note of some spikes of higher values they have once in a while

so even if the normal computation I suppose in theory should be more localized (even if it uses search too) and so should be less multi-thread memory inefficient with less cache invalidations and interlocking, still also here you can see we are far from the 1/numberOfThreads value

@mynickmynick
Copy link
Contributor Author

mynickmynick commented May 23, 2023

I have now benchmarked an implementation I did in multithread std::thread of NormalEstimation
the speed is nearly the same as NormalEstimationMP, just a little worse very probably because i create and join the thread reentrant at each compute() function call while openMP perhaps creates server threads in background (so there is probably no overhead for creating and joining threads but just some probable condition wait)

the results (on another machine so different timings than in previous comment) in msec:

MT(4): 1816-1860msec
MT(5): 1660-1700msec
MT(6): 1610-1750
MT(8): 1530-1600

MP(4): 1800-1930
MP(5): 1588-1660
MP(6): 1640-1800
MP(8): 1500-1580

1th 5000-5150msec

i will resuse some useful information i gathered from this exercise back on the conditional euclidean clustering

@mynickmynick
Copy link
Contributor Author

the previous benchmark confirms some of my ideas:
(1) multithreading (either std::thread or openMP) in general will not multiply processing time by 1/ThreadsNumber and may be very far from it with increasing ThreadsNumber
(2) what you do with openMP can generally be done with std::thread but not viceversa. In general it is convenient to use openMP for simpler cases when the compiler can figure out with a few programmer directives how to optimize

In the case of conditional euclidean clustering I suppose openMP is not a good candidate for the reason (2)

@mynickmynick
Copy link
Contributor Author

with recent optimizations and a more complex condition functor
the situation of the multi threads implementation of

pcl::ConditionalEuclideanClustering::
void
segment (IndicesClusters &clusters);

  void
  segmentMT (IndicesClusters &clusters, size_t threadNumber=2);

is as follows

with

// Search for neighbors around the current seed point of the current cluster
if (searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances, 64) < 1)
(applied also to single thread implementation)

on a not organized cloud 2 million points XYZRGB
there is no difference in resulting clusters
( there is a negligible difference in resulting clusters with the case of
if (searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances) < 1))

one thread: 2.3 sec
th 4: 1-1.1 sec
th 6; 0.9-0.950 sec
th 8: 0.85-0.9 sec
th 12: 0.95 sec

speed up of x2.7

with

// Search for neighbors around the current seed point of the current cluster
if (searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances) < 1)
on a not organized cloud 2 million points XYZRGB
there is no difference in resulting clusters

one thread: 2.5-2.6 sec
th 3: 1.9 sec
th 4: 1.9-2.1 sec
th 6; 2.3 sec

modest speed up of x1.3

i did not repeat the test with not organized cloud which i expect to remain similar to the previous

ordinary (1 thread ) method in 910msec
4 threads 420msec
8 threads 220msec ( x4.13 speed up)

@mynickmynick
Copy link
Contributor Author

so shall we insert this implementation in the PCL?

it could be as i mentioned before in the forms (with an extenction also of the ordinary method)

pcl::ConditionalEuclideanClustering::
void
segment (IndicesClusters &clusters, unsigned int max_nn = 0);

void
segmentMT (IndicesClusters &clusters, size_t threadNumber=2, unsigned int max_nn = 0);

or

pcl::ConditionalEuclideanClustering::
void
segment (IndicesClusters &clusters);

void
segmentMT (IndicesClusters &clusters, size_t threadNumber=2);

void set_max_nn(unsigned int max_nn);

@mynickmynick
Copy link
Contributor Author

@mvieth please update on this multithread implementation I did and also on why not passing to C++17 or C++20 support. From here
https://en.cppreference.com/w/cpp/17
https://en.cppreference.com/w/cpp/20

I suppose C++17 support is full for main compilers and C++20 is mostly covered as well? isn't that so? My multithread implementation I have to revise how much it needs C++17, now I don't remember, probably doesn't need it as I had wrote it to run with or without C++17. Probably the use of shared_mutex i did at first was then eliminated, but again I have to better check.

@mynickmynick
Copy link
Contributor Author

mynickmynick commented Sep 11, 2023

@mvieth this multithread implementation does not "need" C++17
There are some alignas(std::hardware_destructive_interference_size) directives that are put as MACROS automatically disabled if the configuration does not support std>=C++17 (and I can delete them anyway from the push)
They are optimization for multithread and cache efficiency which are desirable but not necessary.

So please update on this issue for pull? It could be the first std multithread implementation of a tool in PCL in a case where openmp cannot do its job. I have it and keep it in my fork anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: triage Labels incomplete
Projects
None yet
Development

No branches or pull requests

2 participants