Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add branch detection functionality #648

Merged
merged 2 commits into from
Aug 5, 2024

Conversation

JelmerBot
Copy link
Contributor

Dear maintainers,

This pull request adds the branch detection functionality from our preprint (https://arxiv.org/abs/2311.15887). Our main goal was to detect branch hierarchies within already detected clusters. These hierarchies describe cluster shapes, which can reveal subgroups not expressed in the density profile.

I have tried to add the functionality in way that minimises its impact on the codebase. I settled on the pattern used by the prediction module. The main usage pattern now looks like this:

from hdbscan import HDBSCAN, BranchDetector

clusterer = HDBSCAN(branch_detection_data=True).fit(data)
branch_detector = BranchDetector().fit(clusterer)

The BranchDetector class mimics the HDBSCAN class and provides access to labels, membership probabilities, the detected hierarchies, and more. This way, end-users that just want clusters do not have to interact with the branch detection functionality at all.

I needed to make a couple of unrelated changes in Cython code to make all tests pass on my machine. I will try to mark these changes with review comments in the PR. Please advice on whether I should remove these changes from the PR or keep them in.

I hope you will consider merging this PR. Let me know if things need to be fixed / changed to better match your vision for the project.

Kind regards,

Jelmer Bot

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor Author

@JelmerBot JelmerBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments and questions about specific parts of the code:

@@ -124,7 +124,7 @@ cpdef np.ndarray[np.double_t, ndim=2] mst_linkage_core_vector(
continue

right_value = current_distances[j]
right_source = current_sources[j]
right_source = <np.intp_t> current_sources[j]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here solved the following test errors on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_high_dimensional - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer

These tests do not appear to be broken on the test runner for the master branch but were crashing my machine.
Could this be related to a specific Cython version?

union_find.union_(<np.intp_t> row[0], cluster)
union_find.union_(<np.intp_t> row[1], cluster)
union_find.union_(np.intp(row[0]), cluster)
union_find.union_(np.intp(row[1]), cluster)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here solved to following tests on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_dbscan_clustering - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_distance_matrix - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_feature_vector - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_callable_metric - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_boruvka_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_hierarchy - TypeError: 'numpy.float64' object cannot be interpreted as an integer
FAILED hdbscan/tests/test_rsl.py::test_rsl_high_dimensional - TypeError: 'numpy.float64' object cannot be interpreted as an integer

Similar to the previous comment, these tests do not appear to be broken on the test runner for the master branch but were crashing my machine.

@@ -483,10 +542,8 @@ cdef np.ndarray[np.intp_t, ndim=1] do_labelling(
return result_arr


cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels):

cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels, np.ndarray deaths):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving deaths to be an input parameter allows me to re-use the function for branch hierarchies.

cpdef np.intp_t traverse_upwards(np.ndarray cluster_tree, np.double_t cluster_selection_epsilon, np.intp_t leaf, np.intp_t allow_single_cluster):

root = cluster_tree['parent'].min()
parent = cluster_tree[cluster_tree['child'] == leaf]['parent']
parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and the one on line 691 fixed the following tests on my machine:

FAILED hdbscan/tests/test_flat.py::test_flat_base_epsilon - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_switch_to_leaf - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_approx_predict_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_approx_predict_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index
FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index

Again, these tests do not appear to be broken on the test runner but were crashing my machine.

@@ -656,6 +715,78 @@ cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluste

return set(selected_clusters)


cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of persistence differs for clusters and branches. With cluster, all data points enter the filtration at distance=0 so a persistence threshold can be implemented as a minimum epsilon value. With branches, points themselves have an eccentricity value, so we need to look at the eccentricity range to determine persistence. It was easier to implement the persistence threshold by simplifying the hierarchy rather than to adapt the epsilon search approach.

@@ -1275,6 +1293,38 @@ def generate_prediction_data(self):
"than mere distances is required!"
)

def generate_branch_detection_data(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this function is. It does not re-compute the minimum spanning tree (MST) because that is quite involved, but the MST is required for detecting branches. That leaves a potentially common unsupported pattern:

# This works
clusterer = HDBSCAN().fit(data)
clusterer.generate_branch_detection_data()

# This breaks because MST not available...
branch_detector = BranchDetector(clusterer)

Is there a better way to deal with this case? Should I fail early within generate_branch_detection_data() when an MST is not available? Is there an easy way to re-compute the MST?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

Copy link
Collaborator

@lmcinnes lmcinnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such a complete and comprehensive PR! I especially appreciate the comprehensive documentation and tests. That makes this process so much easier for me. I'm just going to merge as is because I would really like to have this functionality available, and I'll handle the early error on lack on min_spanning_tree.

Thanks again for all your work on this. I really do think this is a major contribution and will be very useful for a great many people.

@@ -1275,6 +1293,38 @@ def generate_prediction_data(self):
"than mere distances is required!"
)

def generate_branch_detection_data(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

@lmcinnes lmcinnes merged commit a9f74b2 into scikit-learn-contrib:master Aug 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants