Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a k-d tree implementation, and integrate it with Hdbscan and KNN. #231

Merged
merged 6 commits into from
Apr 25, 2022

Conversation

geoffreydstewart
Copy link
Member

@geoffreydstewart geoffreydstewart commented Apr 20, 2022

Description

This PR introduces a k-d tree implementation, which can be used for nearest neighbour queries, and integrates this feature with Hdbscan and KNN. Initial performance measurements demonstrate that this k-d tree implementation is, in general, much faster than using a brute-force approach for larger datasets.

After these changes have been reviewed, there is a required change to the Hdbscan tutorial that will be made.

This PR supersedes #230 which contains a lot of interesting code review discussion, but was closed as the result of the branch being recreated.

Motivation

K-d trees can provide faster nearest neighbour queries compared to a brute-force technique. This results in faster model training times.

Paper reference

This is the original paper which proposes a k-d tree:
J.L. Bentley "Multidimensional Binary Search Trees Used for Associative Searching", Commun. ACM, Vol 18, Sept. 1975, 509–517

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 20, 2022
@oracle-contributor-agreement
Copy link

Oracle requires that contributors to all of its open-source projects sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

In order to sign the OCA, you need to create an Oracle account and sign the OCA in the Oracle's Contributor Agreement Application by following the steps on the homepage.

When singing the OCA, please provide your GitHub username. By doing so, this PR will be automatically updated once the signed OCA was approved by Oracle.

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to properly nail down the semantics of the k-d tree building (which may just be documentation fixes now), and there are a couple of other small points.

@@ -72,6 +73,11 @@ public enum EnsembleCombinerType {
*/
@Option(longName = "knn-voting", usage = "Parallel backend to use.")
public EnsembleCombinerType knnEnsembleCombiner = EnsembleCombinerType.VOTING;
/**
* The nearest neighbour implementation factory to use. Defaults to {@link NeighboursQueryFactoryType#KD_TREE}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc is incorrect now as the default is BRUTE_FORCE.

@@ -135,7 +140,8 @@ public void knnClassificationMultithreadedTest() {

@Test
public void knnClassificationSingleThreadedTest() {
KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1, new VotingCombiner(), KNNModel.Backend.INNERTHREADPOOL);
KNNTrainer<Label> classificationTrainer = new KNNTrainer<>(5, DistanceType.L2, 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to TestHdbscan can one of these tests use the new constructor that accepts a NeighboursQueryFactory directly.

/**
* Partition the array into two parts around the provided array index. Those less than the value at the provided
* array index will be on the left. Those greater than or equal to the value at the provided array index will fall
* on the left. The operation is performed for a specific dimension of the vectors defined by the comparator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say "on the right" as the values less than are on the left.


int store = left;
for (int idx = left; idx < right; idx++) {
if (compareByDimension(points[idx], pivot, dimension) <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a strict less than? The docs say it's less than on the left and greater than or equal on the right.

/**
* Set the median point for an array of {@link IntAndVector}s based, for a specific dimension, through recursive partitioning
* ensuring that points before it (with lower index) will be <= median, although not sorted, and points after it
* (with higher index) will be >= median, again not sorted. The order of the array will almost certainly be changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this says that values equal to the median can be found on both sides of the tree, but that seems to conflict with the documented behaviour of the partitionOnIndex function (though not its actual behaviour).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching these documentation discrepancies. For logic like this, it's really important that the docs are perfectly clear.

throw new RuntimeException("Parallel execution failed", e);
}
}
return new ArrayList<>(Arrays.asList(indexDistancePairListArray));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed converting this over to just returning Arrays.asList.

@oracle-contributor-agreement oracle-contributor-agreement bot removed the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Apr 22, 2022
Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Craigacp Craigacp merged commit f4b4b5c into oracle:main Apr 25, 2022
@geoffreydstewart geoffreydstewart deleted the kd-tree branch June 6, 2022 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants