-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support as many distances as possible from scipy.spatial.distance #848
Comments
scipy.spatial.distance
I see. I will push this up on the priority list and write a pairwise distance kernel (akin to GEMM but more general) that will have a pluggable functor for the implementation, but these functors will be chosen at compile time. |
Commenting on the CPU side:
|
From scipy.spatial.distance:
Observations:
|
Bugfixes: - slow scanning of inverted lists (#836). Features: - add basic support for 6 new metrics in CPU `IndexFlat` and `IndexHNSW` (#848); - add support for `IndexIDMap`/`IndexIDMap2` with binary indexes (#780). Misc: - throw python exception for OOM (#758); - make `DistanceComputer` available for all random access indexes; - gradually moving from `long` to `int64_t` for portability.
I wanted to check in on this in hopes there may be a timeline to get these distances supported. We could really use this feature in UMAP on RAPIDS cuML. |
@cjnolet I'm starting this week on this, sorry for the delay. |
No problem. We are super excited to have this capability! |
The latest github push includes this change, at least using metrics that don't require pre- or post-processing with L2. https://github.com/facebookresearch/faiss/blob/master/MetricType.h#L21 The preferred GPU API has changed as well to https://github.com/facebookresearch/faiss/blob/master/gpu/GpuDistance.h#L119 |
A big thank you for providing these new metrics, @wickedfoo and @mdouze! I'm currently in the process of integrating these into cuML |
For cosine distance, the dot product of unit vectors is going to return the k-farthest-neighbors. Are there any tricks we could do to flip the computation into a distance using only the vector norm? One solution could be to expose an option on |
I am not sure I understand the problem. If you search with max inner product you get the top-k largest dot products. To make it a cosine distance, you can use ||x - y||^2 = ||x||^2 + ||y||^2 - 2 <x, y> = 2 - 2 <x, y> |
@mdouze, this works. For some reason, I thought the inner product was still using a min heap so I was trying to flip the resulting sort order beforehand. Thanks! |
There are several algorithms in cuML which need support for different distances. For instance, our NearestNeighbors algorithm should be able to support the distances that sklearn's NearestNeighbors supports.
UMAP-contrib uses nearest neighbor descent and is able to support all of the distances in `scipy.spatial.distance, using a pairwise evaluation function for those which are not L_p based.
It would be really helpful if we were able to do this within FAISS, both supporting more L_p variants within the brute force kNN computation and supporting more distance types in the ANN algorithms overall.
The text was updated successfully, but these errors were encountered: