-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gensim.similarities.Similarity merges results from shards incorrectly (LSI model) #2584
Comments
You're right. This seems an inconsistency / bug introduced by PR #811. @avoskresensky in your use case – is |
If my assumption that it's the absolute values that indicate similarity is correct, then I don't care that much about the raw values. But if I'm wrong and sign of the sim values does mean some form of dissimilarity, then the whole idea of sorting while merging becomes questionable. |
Negative similarity values (close to -1) mean "opposite vectors", which is can also be interpreted as "very similar", because it indicates a strong semantic connection between the two inputs. Cossim values around 0.0 mean perpendicular = unrelated. Whether "opposite" really means "similar" or not will depend on how you generated your vectors, and what you're using them for. That's why I ask: this is an "upstream" question, not a technical one. Either way: can you open a PR that fixes Similarity to sort on |
@avoskresensky Ping on this. Are you able to make a PR? |
Not in the next couple of weeks. |
If "num_best" is used,
gensim.similarities.Similarity
runs the query against each of the shards (MatrixSimilarity objects) and then merges the results.MatrixSimilarity uses
matutils.full2sparse_clipped()
to pick "num_best" results which sorts by the absolute value.gensim.similarities.Similarity on the other hand, just uses
heapq.nlargest
(in__getitem__
) to merge the results from each of the shards. So negative sims are either pushed down the list or cut off completely.The text was updated successfully, but these errors were encountered: