-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add branch detection functionality #648
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional comments and questions about specific parts of the code:
@@ -124,7 +124,7 @@ cpdef np.ndarray[np.double_t, ndim=2] mst_linkage_core_vector( | |||
continue | |||
|
|||
right_value = current_distances[j] | |||
right_source = current_sources[j] | |||
right_source = <np.intp_t> current_sources[j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here solved the following test errors on my machine:
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_high_dimensional - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer
These tests do not appear to be broken on the test runner for the master branch but were crashing my machine.
Could this be related to a specific Cython version?
union_find.union_(<np.intp_t> row[0], cluster) | ||
union_find.union_(<np.intp_t> row[1], cluster) | ||
union_find.union_(np.intp(row[0]), cluster) | ||
union_find.union_(np.intp(row[1]), cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here solved to following tests on my machine:
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_dbscan_clustering - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_distance_matrix - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_feature_vector - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_callable_metric - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_boruvka_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_hierarchy - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_high_dimensional - TypeError: 'numpy.float64' object cannot be interpreted as an integer
Similar to the previous comment, these tests do not appear to be broken on the test runner for the master branch but were crashing my machine.
@@ -483,10 +542,8 @@ cdef np.ndarray[np.intp_t, ndim=1] do_labelling( | |||
return result_arr | |||
|
|||
|
|||
cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels): | |||
|
|||
cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels, np.ndarray deaths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving deaths to be an input parameter allows me to re-use the function for branch hierarchies.
cpdef np.intp_t traverse_upwards(np.ndarray cluster_tree, np.double_t cluster_selection_epsilon, np.intp_t leaf, np.intp_t allow_single_cluster): | ||
|
||
root = cluster_tree['parent'].min() | ||
parent = cluster_tree[cluster_tree['child'] == leaf]['parent'] | ||
parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change and the one on line 691 fixed the following tests on my machine:
FAILED hdbscan/tests/test_flat.py::test_flat_base_epsilon - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_switch_to_leaf - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_approx_predict_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_approx_predict_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
Again, these tests do not appear to be broken on the test runner but were crashing my machine.
@@ -656,6 +715,78 @@ cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluste | |||
|
|||
return set(selected_clusters) | |||
|
|||
|
|||
cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of persistence differs for clusters and branches. With cluster, all data points enter the filtration at distance=0 so a persistence threshold can be implemented as a minimum epsilon value. With branches, points themselves have an eccentricity value, so we need to look at the eccentricity range to determine persistence. It was easier to implement the persistence threshold by simplifying the hierarchy rather than to adapt the epsilon search approach.
@@ -1275,6 +1293,38 @@ def generate_prediction_data(self): | |||
"than mere distances is required!" | |||
) | |||
|
|||
def generate_branch_detection_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how useful this function is. It does not re-compute the minimum spanning tree (MST) because that is quite involved, but the MST is required for detecting branches. That leaves a potentially common unsupported pattern:
# This works
clusterer = HDBSCAN().fit(data)
clusterer.generate_branch_detection_data()
# This breaks because MST not available...
branch_detector = BranchDetector(clusterer)
Is there a better way to deal with this case? Should I fail early within generate_branch_detection_data()
when an MST is not available? Is there an easy way to re-compute the MST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for such a complete and comprehensive PR! I especially appreciate the comprehensive documentation and tests. That makes this process so much easier for me. I'm just going to merge as is because I would really like to have this functionality available, and I'll handle the early error on lack on min_spanning_tree
.
Thanks again for all your work on this. I really do think this is a major contribution and will be very useful for a great many people.
@@ -1275,6 +1293,38 @@ def generate_prediction_data(self): | |||
"than mere distances is required!" | |||
) | |||
|
|||
def generate_branch_detection_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.
Dear maintainers,
This pull request adds the branch detection functionality from our preprint (https://arxiv.org/abs/2311.15887). Our main goal was to detect branch hierarchies within already detected clusters. These hierarchies describe cluster shapes, which can reveal subgroups not expressed in the density profile.
I have tried to add the functionality in way that minimises its impact on the codebase. I settled on the pattern used by the prediction module. The main usage pattern now looks like this:
The
BranchDetector
class mimics theHDBSCAN
class and provides access to labels, membership probabilities, the detected hierarchies, and more. This way, end-users that just want clusters do not have to interact with the branch detection functionality at all.I needed to make a couple of unrelated changes in Cython code to make all tests pass on my machine. I will try to mark these changes with review comments in the PR. Please advice on whether I should remove these changes from the PR or keep them in.
I hope you will consider merging this PR. Let me know if things need to be fixed / changed to better match your vision for the project.
Kind regards,
Jelmer Bot