-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231
Conversation
Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage. When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to properly nail down the semantics of the k-d tree building (which may just be documentation fixes now), and there are a couple of other small points.
Clustering/Hdbscan/src/test/java/org/tribuo/clustering/hdbscan/TestHdbscan.java
Show resolved
Hide resolved
@@ -72,6 +73,11 @@ public enum EnsembleCombinerType { | |||
*/ | |||
@Option(longName = "knn-voting", usage = "Parallel backend to use.") | |||
public EnsembleCombinerType knnEnsembleCombiner = EnsembleCombinerType.VOTING; | |||
/** | |||
* The nearest neighbour implementation factory to use. Defaults to {@link NeighboursQueryFactoryType#KD_TREE}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This javadoc is incorrect now as the default is BRUTE_FORCE
.
@@ -135,7 +140,8 @@ public void knnClassificationMultithreadedTest() { | |||
|
|||
@Test | |||
public void knnClassificationSingleThreadedTest() { | |||
KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, new VotingCombiner(), KNNModel.Backend.INNERTHREADPOOL); | |||
KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to TestHdbscan
can one of these tests use the new constructor that accepts a NeighboursQueryFactory
directly.
Math/src/main/java/org/tribuo/math/neighbour/NeighboursQueryFactory.java
Show resolved
Hide resolved
Math/src/main/java/org/tribuo/math/neighbour/bruteforce/NeighboursBruteForceFactory.java
Show resolved
Hide resolved
/** | ||
* Partition the array into two parts around the provided array index. Those less than the value at the provided | ||
* array index will be on the left. Those greater than or equal to the value at the provided array index will fall | ||
* on the left. The operation is performed for a specific dimension of the vectors defined by the comparator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should say "on the right" as the values less than are on the left.
|
||
int store = left; | ||
for (int idx = left; idx < right; idx++) { | ||
if (compareByDimension(points[idx], pivot, dimension) <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a strict less than? The docs say it's less than on the left and greater than or equal on the right.
/** | ||
* Set the median point for an array of {@link IntAndVector}s based, for a specific dimension, through recursive partitioning | ||
* ensuring that points before it (with lower index) will be <= median, although not sorted, and points after it | ||
* (with higher index) will be >= median, again not sorted. The order of the array will almost certainly be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this says that values equal to the median can be found on both sides of the tree, but that seems to conflict with the documented behaviour of the partitionOnIndex
function (though not its actual behaviour).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching these documentation discrepancies. For logic like this, it's really important that the docs are perfectly clear.
throw new RuntimeException("Parallel execution failed", e); | ||
} | ||
} | ||
return new ArrayList<>(Arrays.asList(indexDistancePairListArray)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you missed converting this over to just returning Arrays.asList
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.
After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.
This PR supersedes #230 which contains a lot of interesting code review discussion, but was closed as the result of the branch being recreated.
Motivation
K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.
Paper reference
This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517