-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I plan on porting this SDK to CUDA 7.0 from the ancient OpenCL implementation. Few quick questions. #29
Comments
Actually, a few more questions. Whats OP_REC2 and OP_REC1? Are you doing a tree transversal with those, and they're showing which way it branches next? (Sorry if I create a plethora of questions, porting uncommented code often leads to confusion on all ends haha.) |
I made the current OpenCL version, but actually it is slower than CPU even for GPUs with many ALUs. The reason is that I used a suboptimal strategy: I reimplemented the stack-based recursion mechanism in OpenCL but creating a software stack, which means that all ALUs in a group would wait until the longest execution path has terminated. This is clearly not the good solution. What is your parallelization strategy for the CUDA implementation? |
This code is almost 5 years old, so I have to re-interpret it almost as much as you ;-) The
Feel free to ask further questions if this answer is not clear. Also, if you make a better strategy in CUDA, it would be great to back-port it to OpenCL. |
Well if it wasn't already I planned to change how the trees themselves were stored before being copied to the gpu. I was going to make sure the gpu version was a linear stack of nodes. It seems that doing a search against the leafs is quite a bad thing to let a gpu try by default (after an hour or two of research.). The approach I was considering would limit depth to 22 in most cases, so I need to go back to the drawing board. I was going to recursively call the kernel every time I followed a branch. I think there is a way to alleviate this limit, but I can't test that since I recently bricked my titan and only have and 760 to play around with. The 760 only supports 22 depth at most. I was also considering presorting search keys if libnabo doesn't do that already. That alone could create a 3x performance increase gpu side and perhaps an equally as big increase cpu side. So it would almost certainly improve performance for even the normal CPU side of things. I'm not entirely sure. I need to do more research into exactly how the library works. Oh just a heads up, I will primarily be testing the code that I write for this on Debian. I could loosely insure that it'll work on Windows, but no solid promises. |
I would but I haven't done any work with OpenCL in years :( I'm very rusty, and I know I would do a far better job in CUDA. Few other questions though. //How and where is tree splitting done? I've seen a fair share of approaches to GPGPU KD trees, and //almost always the split length is given to the GPU to help speed things up a fair bit. Is it there, just //under a different naming convention? Also, I am going to rewrite the sort function as well. It would not perform well at all in CUDA :P (Seeing how there is no sort function haha) |
Okay I've got a really good architecture in mind (Snr Software architect, I usually nerd out more about the architectures than I do the actual code haha). Thought I'd share it.
I think (At least I hope) this would be roughly 9x faster than the CPU version. There is hopefully very little coalescing, most memory accesses would be shared. The memory overhead should be small since most of this would be copy once and forget. There are a few more ways to expand this though mostly with how querying is done. I could implement a quick merge sort that runs over the query points before the search function executes. That would, for most low dimension cases, be fine (Eg: k < 20 or so). But beyond that, it would be awful and I would highly recommend that some form of clustering algorithm is used. (I would advise against SVMs. I always get awful performance no matter what the implementation is.) This really only works for breadth first though. I was trying to incorporate as much caching as possible in order to reduce potential memory issues with larger trees. If the tree's depth is significant, I can have a second mode for depth first searching. |
I also have a few more very complex ideas and I'm not sure if they'll work but I want to try them (Eg: GPU based clustering and broadphase collision checking against the KD tree. Any node that doesn't collide with us is skipped. This could potentially greatly reduce the number of steps it takes to find where we are in the KD tree, but clustering like this may take a while) Edit: They didn't work :( |
Unfortunately OpenCL does not support dynamic parallelism or ballots. Although I do believe that Vulkan does, so this is something that should be easily portable once Vulkan becomes available to the general public. Also: What was the bug? And: It works (Excluding the fact that it crashes on 3/4ths of the runs, and most branches don't function haha). Edit: |
I do not remember, it seems right while reading it now, but it is four years old. |
So I have some news. In most cases it is significantly slower than the CPU, since I can only do so much to prevent divergence. But in some cases it does fully match. It hasn't fully exceeded CPU performance. It also doesn't help that the CPU I'm running this on is one of the fastest single threaded CPUs available on the market, and the GPU is quite awful haha. I need to finish hooking it into libnabo, but I'll push after that. |
One strategy that would certainly work well with GPU is to do multiple search in parallel without backtracking. That is the strategy used by FLANN. The bad thing is that it leads to non-deterministic and unbounded error, while the epsilon-based approximation allows a controlled and bounded error. But backtracking is really bad on GPU. I imagine that people have been researching other strategies. I am currently not working on that question any more, but it seems that a Google Scholar search for kd-tree GPU leads interesting results. |
I tried countless approaches, and nothing seemed to create any form of benefit over a CPU implementation :/ It may be possible, it it would require more research than I am willing to dedicate right now. I'll push some sample CUDA code to the main repo if you want so people can tinker around with it. |
@MatrixCompSci thanks for the status update And thanks for spending time on this. It would be great if you could push your state and a description of your experiences on a pull-request so we can archieve it. Thanks again for spending time on this |
@LouisCastricato Hope it is OK to re-surface this issue. Your experiments were very cool and had a strong potential to improve performance of the library (which is already quite good). I have a few questions. Did you try or consider
Thank you very much in advance. Best, |
Funnily enough Im actually working on a LSH (locality sensitive hashing) variant of this with some folks at ETH Zurich (Seeing that libnabo is also from ETH, I'm collaborating with Bastian Rieck) and a security company. So for kNN, this method is awful. The one I created here. The vote-ballot architecture does not scale well as you increase k since the cache coherency penalty is massive. Linear searching through all the points is awful. The key to preserving cache and SIMD coherency is skipping branches. Thats why you can use LSH. Lets say that your data is N dimensional. Provably, you can construct an n << N dimensional coordinate space using LSH such that you can determine the first n < m < N branches that your data is under. The dataset I am working with is in the range of 5PB - 10PB (Cant give specifics, sorry). If we were to construct a tree such that the GPU kernel thread blocks diverged early on, that would mean we would be heavily straining the bandwidth of the card as well as straining the network infrastructure and losing all hope of SIMD (given a nonbalanced tree). Said being, assuming we can predict the first m branches we can greatly reduce the strain on the GPU. This is the main issue that I found when trying to port libnabo to CUDA. The strain was massive due to early on divergence. As such, the best way to do this would be to cluster on hashes and then execute kNN. So, that answers (2). Don't use a kd-tree. Atleast, do not use exclusively kd-trees. You won't get anywhere. When trying to make this parallel your number one priority is figuring out how to order the queries optimally. (3) The later steps are by far the slowest since thats when all the threads are diverged. Increasing performance here requires decreasing divergence early on. (4) The bound is like 50% memory bandwidth and 50% SIMD vs MIMD. Bastian and the fellow at the security company are discussing more abstract methods that I dont know if I am allowed to discuss to predict early branches. Personally I am about to start a CS+Math PhD (wow have things changed since I posted this) so I might eventually come back to something like this. I doubt it would be libnabo in specific. Edit: Clustering is faster because it removes the memory bandwidth penalty for the first m steps, and it diminishes the cache coherency penalty after that. |
@LouisCastricato Thank you for your thorough and prompt reply. I'm writing a few answers/questions below your comments:
Sorry if some of the questions are too basic. I'm not a K-NN expert, so I'd like to make sure that I understand what you mean in full. On a separate note, I've run libnabo on 3D point clouds through Intel VTune (hotspots, uarch, mem-access) and Advisor. Something quite interesting is that the footprint of this library is super small, as mentioned in the paper. It is really well designed. But it seems to have some back-end bounds (execution units parallelized on multiple cores through OMP taking longer than expected), which are much more influential if data is not aligned properly (25x more). This is only an initial exploration, I'll resume this as I also get more deeper knowledge on the algorithmic level, otherwise any low-level optimization might be low gain. Best, |
Kind ping @LouisCastricato :) |
Are there any gotchas in the original implementation? I plan, at least initially, to make my port as similar as possible to the original code just to get it functioning first (Eg: No dynamic parallelism, and minimal shared memory.) Anything I should look out for?
I'll probably do this sometime next week and (hopefully) create a pull request before the end of the month.
I'd port it to a newer version of OpenCL, but no reason bothering until Vulkan comes out :P
The text was updated successfully, but these errors were encountered: