-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research on L1 distance metric #12
Comments
Hi! Thanks for your interest in Faiss. Could you consider the following points:
|
|
Regarding 1.: Matthijs and I are the author of the Yael library (http://yael.gforge.inria.fr/, CPU library), which included a k-means++ initialization. Our observation was that, on all our experimental data, the initialization was actually costly and not BLAS3-friendly, even though we put some effort into its parallelization. As Matthijs mentioned, the gain from k-means++ was not worth this cost in our experiments (very small). For a fixed amount of time, we found that it was better (w.r.t. loss) to add a few iterations to the regular k-means, which is why random selection was the default choice in our Matlab interface. Regarding 2.: as mentioned by Matthijs, Faiss includes a spherical k-means, targeting to maximize
So, this is the algorithm described (Figure 1) in We are aware that there are some variants of this algorithm, including one I co-authored that modifies the cost because the objective function is different (see Section 6.5 in https://arxiv.org/pdf/1412.3328.pdf). To which algorithm do you refer? Thanks for your feedback! |
Hi @vmarkovtsev, Indeed the port to Python 3 was not complete. I added instructions to the There are three possible levels of integration of other kmeans algorithms:
At which level do you think your changes to the clustering could be intergated? Thanks |
@mdouze, I guess my proposed additions in (1) fall in (1). |
@vmarkovtsev If you have done some experiments with k-means++ vs random init (large-scale, we are not much interested by dataset comprising less than 1 million input vectors and less than 10k centroids), I am of course curious of the results. |
I see, thanks. Our datasets are somewhat NLP-ish, https://data.world/vmarkovtsev are the examples. |
Any implementation on the GPU aiming for pairwise L1 distance will likely be a lot slower, since there would need to be an equivalent to the GEMM kernel that calculates However, cuBLAS GEMM itself is quite inefficient for small A kernel could probably be written for L1 using a GEMM-like strategy, but it would take a lot of time to get right or tune appropriately for different architectures. |
@vmarkovtsev: L1 is not derivable from an inner product. Employing such embeddings, you can map a vector |
Thank you everybody! So, to summarize my initial 3 points.
Thus I am renaming the topic to make it better reflect the intent. |
Duplicate of #848 merging into a more general task, might support some of this for GPU over the next couple of months |
Hi!
This project kicks ass to our https://github.com/src-d/kmcuda to major extent. I think the best way for us to move on is to integrate the missing (for source{d}) parts to Faiss. Those are:
If you agree with these, I will start making PRs. If you don't, I will have to incorporate Faiss inside kmcuda as the second non-free backend. Of course, I would prefer 1.
The text was updated successfully, but these errors were encountered: