Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support as many distances as possible from scipy.spatial.distance #848

Open
cjnolet opened this issue May 31, 2019 · 11 comments
Open

Support as many distances as possible from scipy.spatial.distance #848

cjnolet opened this issue May 31, 2019 · 11 comments

Comments

@cjnolet
Copy link
Contributor

cjnolet commented May 31, 2019

There are several algorithms in cuML which need support for different distances. For instance, our NearestNeighbors algorithm should be able to support the distances that sklearn's NearestNeighbors supports.

UMAP-contrib uses nearest neighbor descent and is able to support all of the distances in `scipy.spatial.distance, using a pairwise evaluation function for those which are not L_p based.

It would be really helpful if we were able to do this within FAISS, both supporting more L_p variants within the brute force kNN computation and supporting more distance types in the ANN algorithms overall.

@cjnolet cjnolet changed the title Support distances from scipy.spatial.distance Support as many distances as possible from scipy.spatial.distance May 31, 2019
@wickedfoo
Copy link
Contributor

I see. I will push this up on the priority list and write a pairwise distance kernel (akin to GEMM but more general) that will have a pluggable functor for the implementation, but these functors will be chosen at compile time.

@mdouze
Copy link
Contributor

mdouze commented Jun 3, 2019

Commenting on the CPU side:
Supporting different types of distances should not be too hard for the flat index. The issues are:

  • some metrics need an additional parameter (eg. L_p), that has to be carried around
  • will IP remain the unique metric that has to be maximized rather than minimized?

@mdouze
Copy link
Contributor

mdouze commented Jun 3, 2019

From scipy.spatial.distance:

  • braycurtis(u, v[, w]): OK (=> can be supported)
  • canberra(u, v[, w]): OK
  • chebyshev(u, v[, w]): OK
  • cityblock(u, v[, w]): OK
  • correlation(u, v[, w, centered]): equivalent of cosine
  • cosine(u, v[, w]): equivalent to IP with preprocessing
  • euclidean(u, v[, w]): already supported (L2)
  • jensenshannon(p, q[, base]): OK
  • mahalanobis(u, v, VI): equivalent to L2 with preprocessing
  • minkowski(u, v[, p, w]): OK, requires an additional parameter
  • seuclidean(u, v, V): unclear what that is ???
  • sqeuclidean(u, v[, w]): equivalent of L2 with post-processing
  • wminkowski(u, v, p, w): equivalent of Minkowski with pre-processing

Observations:

  • many distances are equivalent up to a pre- or post-processing. These should be carried out by the caller.
  • most metrics take a weight vector in addition to the vectors to compare -> is a pre-processing
  • in fact only L_p has an additional parameter (aka minkowski)

beauby pushed a commit that referenced this issue Jun 19, 2019
Bugfixes:
- slow scanning of inverted lists (#836).

Features:
- add basic support for 6 new metrics in CPU `IndexFlat` and `IndexHNSW` (#848);
- add support for `IndexIDMap`/`IndexIDMap2` with binary indexes (#780).

Misc:
- throw python exception for OOM (#758);
- make `DistanceComputer` available for all random access indexes;
- gradually moving from `long` to `int64_t` for portability.
@cjnolet
Copy link
Contributor Author

cjnolet commented Feb 10, 2020

I wanted to check in on this in hopes there may be a timeline to get these distances supported. We could really use this feature in UMAP on RAPIDS cuML.

@wickedfoo
Copy link
Contributor

@cjnolet I'm starting this week on this, sorry for the delay.

@cjnolet
Copy link
Contributor Author

cjnolet commented Feb 10, 2020

No problem. We are super excited to have this capability!

@wickedfoo
Copy link
Contributor

The latest github push includes this change, at least using metrics that don't require pre- or post-processing with L2.

https://github.com/facebookresearch/faiss/blob/master/MetricType.h#L21

The preferred GPU API has changed as well to bfKnn. This supports float16 input data as well (though at the moment both sets of vectors must be float16).

https://github.com/facebookresearch/faiss/blob/master/gpu/GpuDistance.h#L119

@cjnolet
Copy link
Contributor Author

cjnolet commented May 7, 2020

A big thank you for providing these new metrics, @wickedfoo and @mdouze! I'm currently in the process of integrating these into cuML

@cjnolet
Copy link
Contributor Author

cjnolet commented May 14, 2020

@mdouze,

For cosine distance, the dot product of unit vectors is going to return the k-farthest-neighbors. Are there any tricks we could do to flip the computation into a distance using only the vector norm?

One solution could be to expose an option on bfknn() to set whether we want a max or a min-heap. I'm wondering if this might be simplest option? Another option could be to expose a cosine distance. Correlation distance can be derived from cosine, so neither option would be a one-off.

@mdouze
Copy link
Contributor

mdouze commented May 15, 2020

I am not sure I understand the problem. If you search with max inner product you get the top-k largest dot products. To make it a cosine distance, you can use

||x - y||^2 = ||x||^2 + ||y||^2 - 2 <x, y> = 2 - 2 <x, y>

@cjnolet
Copy link
Contributor Author

cjnolet commented May 15, 2020

@mdouze, this works. For some reason, I thought the inner product was still using a min heap so I was trying to flip the resulting sort order beforehand. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants