-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDBSCAN bug on A100 #4024
HDBSCAN bug on A100 #4024
Conversation
@divyegala, I'm a little bit confused by this. Since this is a BFS over each level of a tree, each node is visited only once with a separate kernel launch per level. This means a child will never be visited on the same kernel launch as its parent and the threads are all acting independent of one another within each launch. The EDIT: Nevermind, I just remembered that |
@cjnolet that is true, I forgot that there is a read in the if conditional as well. In that case, we can get away without creating temporary arrays by doing a grid sync:
|
I'm not familiar w/ the cooperative groups grid sync (and I'm still out of office but just wanted to chime in on this thread to make sure this problem was actually solved). We just need to make sure it's not possible for any thread (which could be scheduled in different blocks at different times depending on the number of data samples) to be able to read the frontier in the same kernel after its parent marks its children in the frontier. According to this blog, |
@cjnolet you may be right here. For the cooperative groups grid sync to work, I think we need to be able to guarantee that all threads can fit on the device at the same time, which I don't think we can. Let me think a little more on this, otherwise, I'll go ahead and use the intermediate array solution. |
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #4024 +/- ##
===============================================
Coverage ? 85.46%
===============================================
Files ? 230
Lines ? 18133
Branches ? 0
===============================================
Hits ? 15498
Misses ? 2635
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
@cjnolet just tagging for when you are back in office, but I implemented the intermediate frontier solution. Your review would be nice to obtain on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gpucibot merge |
While this issue only appeared in A100, it could have appeared on any other GPU. In this kernel https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L85, we launch a thread for every node of a binary tree on the GPU. The problem that occurs then, is: 1. Each node marks itself out of the frontier https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L94 2. For every node that is not a leaf, it marks its left and right child into the frontier https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L117 This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier. Edit: Dropped `__threadfence()` solution as it wasn't fully correct. Using a `next_frontier` array instead to keep track of the frontier for the next BFS iteration. Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4024
While this issue only appeared in A100, it could have appeared on any other GPU. In this kernel
cuml/cpp/src/hdbscan/detail/kernels/condense.cuh
Line 85 in 033a21f
cuml/cpp/src/hdbscan/detail/kernels/condense.cuh
Line 94 in 033a21f
cuml/cpp/src/hdbscan/detail/kernels/condense.cuh
Line 117 in 033a21f
This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier.
Edit: Dropped
__threadfence()
solution as it wasn't fully correct. Using anext_frontier
array instead to keep track of the frontier for the next BFS iteration.