Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

geoffreydstewart · 2022-04-20T18:58:11Z

Description

This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.

After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.

This PR supersedes #230 which contains a lot of interesting code review discussion, but was closed as the result of the branch being recreated.

Motivation

K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.

Paper reference

This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517

…with features

oracle-contributor-agreement · 2022-04-20T18:58:16Z

Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

PR author: @geoffreydstewart

In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage.

When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle.

Craigacp

I'd like to properly nail down the semantics of the k-d tree building (which may just be documentation fixes now), and there are a couple of other small points.

Clustering/Hdbscan/src/test/java/org/tribuo/clustering/hdbscan/TestHdbscan.java

Craigacp · 2022-04-22T14:25:07Z

Common/NearestNeighbour/src/main/java/org/tribuo/common/nearest/KNNClassifierOptions.java

@@ -72,6 +73,11 @@ public enum EnsembleCombinerType {
     */
    @Option(longName = "knn-voting", usage = "Parallel backend to use.")
    public EnsembleCombinerType knnEnsembleCombiner = EnsembleCombinerType.VOTING;
+    /**
+     * The nearest neighbour implementation factory to use. Defaults to {@link NeighboursQueryFactoryType#KD_TREE}.


This javadoc is incorrect now as the default is BRUTE_FORCE.

Craigacp · 2022-04-22T14:43:54Z

Common/NearestNeighbour/src/test/java/org/tribuo/common/nearest/TestKNN.java

@@ -135,7 +140,8 @@ public void knnClassificationMultithreadedTest() {

    @Test
    public void knnClassificationSingleThreadedTest() {
-        KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, new VotingCombiner(), KNNModel.Backend.INNERTHREADPOOL);
+        KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1,


Similarly to TestHdbscan can one of these tests use the new constructor that accepts a NeighboursQueryFactory directly.

Math/src/main/java/org/tribuo/math/neighbour/NeighboursQueryFactory.java

Math/src/main/java/org/tribuo/math/neighbour/bruteforce/NeighboursBruteForceFactory.java

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTreeFactory.java

Craigacp · 2022-04-22T15:15:00Z

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

+    /**
+     * Partition the array into two parts around the provided array index. Those less than the value at the provided
+     * array index will be on the left. Those greater than or equal to the value at the provided array index will fall
+     * on the left. The operation is performed for a specific dimension of the vectors defined by the comparator.


I think this should say "on the right" as the values less than are on the left.

Craigacp · 2022-04-22T15:17:43Z

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

+
+        int store = left;
+        for (int idx = left; idx < right; idx++) {
+            if (compareByDimension(points[idx], pivot, dimension) <= 0) {


Shouldn't this be a strict less than? The docs say it's less than on the left and greater than or equal on the right.

Craigacp · 2022-04-22T15:30:56Z

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

+    /**
+     * Set the median point for an array of {@link IntAndVector}s based, for a specific dimension, through recursive partitioning
+     * ensuring that points before it (with lower index) will be <= median, although not sorted, and points after it
+     * (with higher index) will be >= median, again not sorted. The order of the array will almost certainly be changed.


So this says that values equal to the median can be found on both sides of the tree, but that seems to conflict with the documented behaviour of the partitionOnIndex function (though not its actual behaviour).

Thanks for catching these documentation discrepancies. For logic like this, it's really important that the docs are perfectly clear.

Craigacp · 2022-04-22T15:43:05Z

Math/src/main/java/org/tribuo/math/neighbour/kdtree/KDTree.java

+                throw new RuntimeException("Parallel execution failed", e);
+            }
+        }
+        return new ArrayList<>(Arrays.asList(indexDistancePairListArray));


I think you missed converting this over to just returning Arrays.asList.

Craigacp

LGTM

Clustering/Hdbscan/src/test/java/org/tribuo/clustering/hdbscan/TestHdbscan.java

geoffreydstewart added 5 commits April 19, 2022 17:57

Add a k-d tree implementation, and integrate it with Hdbscan and KNN.

fd8cac1

minor javadoc updates

41c51d0

These are changes to address the code review feedback

c206b8e

Most of the updates from the most recent code review

bbc42cb

Change the dimensions to start numbering at 0, the same way its done …

dff51b8

…with features

oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 20, 2022

Craigacp requested changes Apr 22, 2022

View reviewed changes

Additional code review changes

da0f389

oracle-contributor-agreement bot removed the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 22, 2022

Craigacp approved these changes Apr 25, 2022

View reviewed changes

Clustering/Hdbscan/src/test/java/org/tribuo/clustering/hdbscan/TestHdbscan.java Show resolved Hide resolved

Craigacp merged commit f4b4b5c into oracle:main Apr 25, 2022

geoffreydstewart deleted the kd-tree branch June 6, 2022 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

geoffreydstewart commented Apr 20, 2022 •

edited

Loading

oracle-contributor-agreement bot commented Apr 20, 2022

Craigacp left a comment

Craigacp Apr 22, 2022

Craigacp Apr 22, 2022

Craigacp Apr 22, 2022

Craigacp Apr 22, 2022

Craigacp Apr 22, 2022

geoffreydstewart Apr 22, 2022

Craigacp Apr 22, 2022

Craigacp left a comment

Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

Conversation

geoffreydstewart commented Apr 20, 2022 • edited Loading

Description

Motivation

Paper reference

oracle-contributor-agreement bot commented Apr 20, 2022

Craigacp left a comment

Choose a reason for hiding this comment

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

geoffreydstewart Apr 22, 2022

Choose a reason for hiding this comment

Craigacp Apr 22, 2022

Choose a reason for hiding this comment

Craigacp left a comment

Choose a reason for hiding this comment

geoffreydstewart commented Apr 20, 2022 •

edited

Loading