Add branch detection functionality #648

JelmerBot · 2024-07-24T13:07:46Z

Dear maintainers,

This pull request adds the branch detection functionality from our preprint (https://arxiv.org/abs/2311.15887). Our main goal was to detect branch hierarchies within already detected clusters. These hierarchies describe cluster shapes, which can reveal subgroups not expressed in the density profile.

I have tried to add the functionality in way that minimises its impact on the codebase. I settled on the pattern used by the prediction module. The main usage pattern now looks like this:

from hdbscan import HDBSCAN, BranchDetector

clusterer = HDBSCAN(branch_detection_data=True).fit(data)
branch_detector = BranchDetector().fit(clusterer)

The BranchDetector class mimics the HDBSCAN class and provides access to labels, membership probabilities, the detected hierarchies, and more. This way, end-users that just want clusters do not have to interact with the branch detection functionality at all.

I needed to make a couple of unrelated changes in Cython code to make all tests pass on my machine. I will try to mark these changes with review comments in the PR. Please advice on whether I should remove these changes from the PR or keep them in.

I hope you will consider merging this PR. Let me know if things need to be fixed / changed to better match your vision for the project.

Kind regards,

Jelmer Bot

review-notebook-app · 2024-07-24T13:07:51Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

JelmerBot

Additional comments and questions about specific parts of the code:

JelmerBot · 2024-07-24T13:11:16Z

hdbscan/_hdbscan_linkage.pyx

@@ -124,7 +124,7 @@ cpdef np.ndarray[np.double_t, ndim=2] mst_linkage_core_vector(
                continue

            right_value = current_distances[j]
-            right_source = current_sources[j]
+            right_source = <np.intp_t> current_sources[j]


The changes here solved the following test errors on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_high_dimensional - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer

These tests do not appear to be broken on the test runner for the master branch but were crashing my machine.
Could this be related to a specific Cython version?

JelmerBot · 2024-07-24T13:16:13Z

hdbscan/_hdbscan_tree.pyx

-            union_find.union_(<np.intp_t> row[0], cluster)
-            union_find.union_(<np.intp_t> row[1], cluster)
+            union_find.union_(np.intp(row[0]), cluster)
+            union_find.union_(np.intp(row[1]), cluster)


The changes here solved to following tests on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_dbscan_clustering - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_distance_matrix - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_feature_vector - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_callable_metric - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_boruvka_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_hierarchy - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_high_dimensional - TypeError: 'numpy.float64' object cannot be interpreted as an integer

Similar to the previous comment, these tests do not appear to be broken on the test runner for the master branch but were crashing my machine.

JelmerBot · 2024-07-24T13:18:16Z

hdbscan/_hdbscan_tree.pyx

@@ -483,10 +542,8 @@ cdef np.ndarray[np.intp_t, ndim=1] do_labelling(
    return result_arr


-cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels):
-
+cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels, np.ndarray deaths):


Moving deaths to be an input parameter allows me to re-use the function for branch hierarchies.

JelmerBot · 2024-07-24T13:22:59Z

hdbscan/_hdbscan_tree.pyx

 cpdef np.intp_t traverse_upwards(np.ndarray cluster_tree, np.double_t cluster_selection_epsilon, np.intp_t leaf, np.intp_t allow_single_cluster):

    root = cluster_tree['parent'].min()
-    parent = cluster_tree[cluster_tree['child'] == leaf]['parent']
+    parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0]


This change and the one on line 691 fixed the following tests on my machine:

FAILED hdbscan/tests/test_flat.py::test_flat_base_epsilon - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_switch_to_leaf - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_approx_predict_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_approx_predict_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index

Again, these tests do not appear to be broken on the test runner but were crashing my machine.

JelmerBot · 2024-07-24T13:28:01Z

hdbscan/_hdbscan_tree.pyx

@@ -656,6 +715,78 @@ cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluste

    return set(selected_clusters)

+
+cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree, 


The idea of persistence differs for clusters and branches. With cluster, all data points enter the filtration at distance=0 so a persistence threshold can be implemented as a minimum epsilon value. With branches, points themselves have an eccentricity value, so we need to look at the eccentricity range to determine persistence. It was easier to implement the persistence threshold by simplifying the hierarchy rather than to adapt the epsilon search approach.

JelmerBot · 2024-07-24T13:51:39Z

hdbscan/hdbscan_.py

@@ -1275,6 +1293,38 @@ def generate_prediction_data(self):
                "than mere distances is required!"
            )

+    def generate_branch_detection_data(self):


Not sure how useful this function is. It does not re-compute the minimum spanning tree (MST) because that is quite involved, but the MST is required for detecting branches. That leaves a potentially common unsupported pattern:

# This works clusterer = HDBSCAN().fit(data) clusterer.generate_branch_detection_data() # This breaks because MST not available... branch_detector = BranchDetector(clusterer)

Is there a better way to deal with this case? Should I fail early within generate_branch_detection_data() when an MST is not available? Is there an easy way to re-compute the MST?

I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

lmcinnes

Thanks for such a complete and comprehensive PR! I especially appreciate the comprehensive documentation and tests. That makes this process so much easier for me. I'm just going to merge as is because I would really like to have this functionality available, and I'll handle the early error on lack on min_spanning_tree.

Thanks again for all your work on this. I really do think this is a major contribution and will be very useful for a great many people.

lmcinnes · 2024-08-05T16:42:15Z

hdbscan/hdbscan_.py

@@ -1275,6 +1293,38 @@ def generate_prediction_data(self):
                "than mere distances is required!"
            )

+    def generate_branch_detection_data(self):


I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

Add branch detection functionality

f58b545

JelmerBot commented Jul 24, 2024

View reviewed changes

Deal with os variance

04c77fe

JelmerBot commented Jul 24, 2024

View reviewed changes

lmcinnes reviewed Aug 5, 2024

View reviewed changes

lmcinnes merged commit a9f74b2 into scikit-learn-contrib:master Aug 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add branch detection functionality #648

Add branch detection functionality #648

JelmerBot commented Jul 24, 2024

review-notebook-app bot commented Jul 24, 2024

JelmerBot left a comment

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

lmcinnes Aug 5, 2024

lmcinnes left a comment

lmcinnes Aug 5, 2024

		@@ -656,6 +715,78 @@ cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluste

		return set(selected_clusters)


		cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree,

Add branch detection functionality #648

Add branch detection functionality #648

Conversation

JelmerBot commented Jul 24, 2024

review-notebook-app bot commented Jul 24, 2024

JelmerBot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmcinnes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment