-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ConditionalEuclideanClustering::segment] std::(multi)thread (only) 21-46% faster on non organzied clouds, much faster appearantly on organized clouds #5710
Comments
i made a mistake in designing the sharing strategy, so I am reviewing it Set target C++ standard and required compiler featuresset(CMAKE_CXX_STANDARD 20 CACHE STRING "The target C++ standard. PCL requires C++20 or higher.") set(CMAKE_CUDA_STANDARD 20 CACHE STRING "The target CUDA/C++ standard. PCL requires CUDA/C++ 20 or higher.") |
We currently still aim to have PCL compatible with C++14 so that people using older compilers (which do not fully support C++17) can still use PCL. This may change in the future, but there is no plan yet when we drop C++14 and switch to C++17. Regarding your multithreaded algorithm: to be honest, 21-46% improvement isn't that great when using 3-4 threads. I think when using 4 threads for example, one should expect to get at most half of the one-thread computing time, better only a third. Have you analysed where the segmentation algorithm spends most of its time? |
why not setting a compiler option about it. C++14 is 9 years old and there is a lot of useful stuff from C++17-20 . you can introduce a compilation option like WITH_C20 or something like that that if false disables the methods that rely on C++20. By the way I would not extend to C++17, I would extend to C++20.
Notice I wrote "only" in the title. 46% is twice faster anyway. not all scenarios have the same speed up. A main bottle neck I think is the use of Indices and the KDtree. I have available only not organized clouds. I bet with an organized cloud the speed up would be higher. Can you suggest a big organized cloud you already used for other testing? Do you think by openMP or CUDA you could get better results on this algorithm? I don't think so. Have you ever tried standard C++ multithreading in PCL? This would be the first time? |
There is an option to control the C++ standard used to compile PCL: https://github.com/PointCloudLibrary/pcl/blob/master/CMakeLists.txt#L11
Totally agree that they are useful.
According to Wikipedia, only Visual Studio has full support for C++20, while GCC and Clang only have partial support, and only in relatively new compiler versions. So requiring C++20 is not a good idea currently, in my opinion. I am open to discuss requiring C++17.
The biggest organized clouds used are currently 480x640 (milk_cartoon_all_small_clorox.pcd and table_scene_mug_stereo_textured.pcd for example). So not very big.
I can't think of a place where standard C++ multithreading is currently used in PCL. I would suspect that most of the time is spent searching for neighbours, but I haven't done any exact analysis yet (that's why I was interested whether you have done this). If most of the time is indeed spent this way, an option could be to first search for the neighbours of all points, and store the results. This should be easy to parallelize, with OpenMP or with standard multithreading. Then the rest of the algorithm could use the stored results. Would just be important to check that this approach does not require gigabytes of RAM. |
I am trying different implementations but all giving more or less the same speed results. in Settings 1 above of the 2 sec, 350msec are required by the searcher_->setInputCloud that does some work in preparation, the rest is required by parsing the neighbours and chcecking connection and condition function. I am not understanding why it speeds up so little (and less with higher (5,6) numbers of threads). I even tested with mutex lock disabled and the processing time would nearly not change, so it doesn't seem a mutex excessive locks problem. I tried making thread-local copies of cloud and indices and the processing time would increase (I expected it but I just wanted to be sure). I suppose the main problem is the use of Indices which make memory access very inefficient and scattered (so different threads access different RAM areas continuosly making the cache of little use).
I don't know how to do this as the neigbourhood changes at every run, function of different cloud and also potentially different radius (the user can change its value anytime). I suppose it would be also huge in RAM. |
even if the cloud is accessed without indices the class builds a trivial Indices with values 0,1,2,3,etc. so enforcing double redirection and inefficient memory access, this because the searcher class wants Indices |
That time is probably spent building the kd-tree (used when the cloud is not organized).
How would that be relevant to multi-threading? Cloud and indices are only read from, never written to inside the
I mean at the beginning of the |
I am doing tests with the organized cloud and the speed up (as expected) is much higher and growing with number of threads 12th Gen Intel(R) Core(TM) i7-12700 2.10 GHz ordinary (1 thread ) method in 910msec Another PC |
i see, OK I don't know I'll think about it. right now I think in my multithread implementation radiusearch is already called by only one thread only once on each point added to a cluster, while points that are not added to a cluster are not associated to a radiussearch call, ( searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances) )so you would not diminish the computation by calling it in advance, |
what I would do is also investigate the use of different searchers for not organized clouds, searchers that might be more multithread prone. any suggestion about this are welcome |
the fact that organized clouds are so very much more sped up might confirm my suspects about it, because of more contiguous memory access, less cache invalidations and more time spent by different threads on different non contiguous areas . it's just a rule of thumb guess. |
I am not really convinced that organized vs unorganized makes such a big difference. To eliminate all other effects, you could use an organized cloud, and make that unorganized by cloud.width = cloud.width*cloud.height;
cloud.height = 1; and then compare the speed-up when the cloud is organized vs when it is not organized. Honestly, a x4 or x4.13 speedup when using 8 threads is still not great, considering a good parallelization would lead to a speedup somewhat close to x8 Another thought: doing the tests with a point cloud and a radius that result in only 2 clusters (top of milk carton and the rest) might not be really representative (not how this clustering would likely be used in practice IMO), and might give an unrealistic impression of the performance benefit (not sure whether higher or lower than realistic).
PCL also has an octree search ( https://pointclouds.org/documentation/classpcl_1_1search_1_1_octree.html ), but the kdtree search is more common for unorganized point clouds. |
yes correct now in the multithread implementation radius search is called exactly once for every point , I was mistaken before. either as first point of the cluster or as an added point to the cluster. So I don't know if making the calls all at the beginning and storiing the result for successive retrieval would speed up. A speed up would come by exploitng somehow the symmetry of the relationship to cut x2 the time but I don't think the radiussearch methods do that |
I am doing other stuff, this was just a fast test, I have no interest in segmenting milk bottles right now in this application , I have no conditional function designed for that . the qualitative result is totally different though in growing with the number of threads while with other unorganized clouds it would imporve only up to 4 threads in ALL segmentation configurations, 2 or 4 or 10 or more segments |
this algorithm is intrinsecally not very parallelizable for memory access, memory thread assignment/separation and points parsing. I wonder what level of speed up is documented of other parallelized PCL methods which very probably (like normals) are more parallelizable?? I have in mind anyway some other strategy possible improvements. |
I had tried it this early morning with a resolution of 128.0 and it was extremely slower than kdtree. I don't know how to tune the resolution though. The FLANN seems to have results similar to kdtree. |
https://stackoverflow.com/questions/46759930/multithreading-slower-than-single-threading i suppose this method is mainly memory bound, not for the size but for the very scattered access as the clusters grow and intersect on different threads, hence cache efficiency problems and mutual interlock problems. the example of guys in the corridor gives the idea. I am investigating on a more efficient memory share |
I obtained an interesting result working only with Setting 1 (to be tested with other settings... it's late night... tomorrow) First I made several optimizations to the strategy and memory sharing which led appearantly to modest improvements (about 21->26%) But then I modiified this call as follows searcher_->radiusSearch ((*input_)[current_cluster[cii]], cluster_tolerance_, nn_indices, nn_distances, 8) and that (probably also playing together with the previous optimizations) revolutionized the results. Now I have original segment with 1 thread : 2400msec segment 1 thread but multithread with 3th: about 1250msec before 4th was already worse than 3th, now it keeps improving even if slower I should study more in detail the search algorithms to better understand this, interpret the speed results and deduce if the result is guaranteed to be near to the same or not. Here the result is not exactly the same, there is a tiny difference. Fundamental is wether the unsigned int max_nn are just the first max_nn found or the unsigned int max_nn nearest ones (i guess the first option). Probably running with max_nn much bigger is a better compromise, to be tested. like max_nn > max number of points in the sphere at that resolution or something similar. ordinary segment: new segment 3 threads difference new segment 3 threads any suggestions are welcome |
that result might confirm my hypothesis of a memory bound efficiency of multith. |
@mvieth please don't tell me you're not happy even with this ;)) !! |
max_nn could be handled as follows: this automatic handling is not very convincing anyway, I have to better study this matter I would like to understand what is that makes multithreading so inefficient when max_nn defaults to 0, probably a more scattered memory access. probably radiussearch causes false cache invalidations... |
i did a test just to have a "cross benchmark" with openMP I tested on this same setting NormalEstimation: 3300msec and I didn't take note of some spikes of higher values they have once in a while so even if the normal computation I suppose in theory should be more localized (even if it uses search too) and so should be less multi-thread memory inefficient with less cache invalidations and interlocking, still also here you can see we are far from the 1/numberOfThreads value |
I have now benchmarked an implementation I did in multithread std::thread of NormalEstimation the results (on another machine so different timings than in previous comment) in msec: MT(4): 1816-1860msec MP(4): 1800-1930 1th 5000-5150msec i will resuse some useful information i gathered from this exercise back on the conditional euclidean clustering |
the previous benchmark confirms some of my ideas: In the case of conditional euclidean clustering I suppose openMP is not a good candidate for the reason (2) |
with recent optimizations and a more complex condition functor pcl::ConditionalEuclideanClustering::
is as follows with // Search for neighbors around the current seed point of the current cluster on a not organized cloud 2 million points XYZRGB one thread: 2.3 sec speed up of x2.7 with // Search for neighbors around the current seed point of the current cluster one thread: 2.5-2.6 sec modest speed up of x1.3 i did not repeat the test with not organized cloud which i expect to remain similar to the previous ordinary (1 thread ) method in 910msec |
so shall we insert this implementation in the PCL? it could be as i mentioned before in the forms (with an extenction also of the ordinary method) pcl::ConditionalEuclideanClustering:: void or pcl::ConditionalEuclideanClustering:: void void set_max_nn(unsigned int max_nn); |
@mvieth please update on this multithread implementation I did and also on why not passing to C++17 or C++20 support. From here I suppose C++17 support is full for main compilers and C++20 is mostly covered as well? isn't that so? My multithread implementation I have to revise how much it needs C++17, now I don't remember, probably doesn't need it as I had wrote it to run with or without C++17. Probably the use of shared_mutex i did at first was then eliminated, but again I have to better check. |
@mvieth this multithread implementation does not "need" C++17 So please update on this issue for pull? It could be the first std multithread implementation of a tool in PCL in a case where openmp cannot do its job. I have it and keep it in my fork anyway |
I have implemented a std multithread version of ConditionalEuclideanClustering::segment and called it
void
segmentMT (IndicesClusters &clusters, const size_t threadNumber=2);
on my fork branch attemptConditionalEuclideanMT
on my PC it improves of only 21-38% the time to produce the results using 3 or 4 threads
12th Gen Intel(R) Core(TM) i7-12700 2.10 GHz
20 Logical CPUs - 32.0 GB
Windows 10 Pro
Setting1:
cloud is dense and NOT organized and has 2 million points XYZRGB
so the search used is pcl::search::OrganizedNeighbor
which takes 350 msec just to execute searcher_->setInputCloud
11 clusters are segmented
5 threads: 2.1-2.3 sexc
4 threads: 1.9-2.2 sec
3 threads: 1.8-2 sec
2 threads: 2.1 sec
1 thread: 2.4sec
21% improvement
Setting2:
cloud is dense and NOT organized and has 300 000 points XYZRGB
3 threads: 268-295 msec
1 thread: 420msec
4 clusters are found
32% improvement
Setting3:
cloud is NOT dense and NOT organized and has 485 000 points XYZRGB
3 threads: 480-500 msec
1 thread: 790msec
1 clusters is found
38% improvement
Setting 4: like Setting 1 but running on a different computer, a laptop less stable (less reliable for benchmark) because with heavy sw running in background:
4 threads: 2.9-3.4 sec
1 thread: 5.6-6.1 sec
46% improvement
I am still investigating the reasons for this limited improvement. On a first analysis the main reason could be that the algorithm is intinsecally not much parallelizable. I can split in parallel the growing of the clusters by splitting the point sequence in subsequences, but then at the end I have to reconnect the clusters that have grown separately in different threads ( I record the connections to be made at the end)
you can check I have quite optimized the critical scetions using also shared_mutexes
still I think it would be an interesting alternative to openMP or CUDA also for other tools/features that like this one do not look feasible for openMP or CUDA
I have tried 3 versions one with a searcher for each thread, one with a unique searcher prpeared by the main thread and one with a unique searcher prepared by the first launched thread. The latter would be faster but once in a while it gets much slower, probably because I had to implement it with a std::condition_variable that when triggered probably it slows down the thread pool. So in the end I opted for the 2nd solution (no std::condition_variable)
The text was updated successfully, but these errors were encountered: