-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement Levenshtein term similarity matrix and fast SCM between cor…
…pora (#2016) * Wrap docstring for WordEmbeddingsKeyedVectors.similarity_matrix * Add the gensim.models.levenshtein module * Add projected density to term similarity matrix logs * Add tests for the gensim.models.levenshtein.similarity_matrix function * Separate similarity_matrix methods into director and builder classes. * Add symmetric parameter to SparseTermSimilarityMatrix * Add corpus support to SparseTermSimilarityMatrix.inner_product * Replace scipy.sparse.dok_matrix.has_key with the in operator * Fix handling of unicode in Python 3 in levsim * Remove temporary method similarity of LevenshteinSimilarityIndex * Move models.term_similarity, and levenshtein to similarities * Make python-Levenshtein a conditional import * Add default values to gensim.similarities.levenshtein.levsim arguments * Remove extraneous addition operators from @deprecated annotations * Remove @deprecated annotation from tests * Merge test_term_similarity, and test_levenshtein with test_similarities * Reword TermSimilarityIndex docstring * Consume no more than topn similarities produced by a TermSimilarityIndex * Use short uints (<64b) for dok_matrix keys and num_nonzero array * Write to matrix_nonzero only when building a symmetric matrix * Ensure UniformTermSimilarityIndex does not yield only topn - 1 values * Document _shortest_uint_dtype * Add soft cosine measure benchmark, part 1 * Add soft cosine measure benchmark, part 2 * Make similarity_matrix support non-contiguous dictionaries Closes #2041 * Support fast inner product between a document and a corpus * Support fast inner product between a document and a corpus (python 2.7) * Add faster sparse matrix slicing * Make Soft Cosine Measure support non-contiguous dictionaries * Remove gensim::similarities::levenshtein::similarity_matrix facade * Implement SoftCosineSimilarity using the inner_product method * Fix flake8 warnings * Make Soft Cosine Measure support non-contiguous dictionaries (cont) * Remove parallelization in gensim::similarities::levenshtein * Document future work * Update Soft Cosine Measure benchmark after commits 093d569, and c316b95 * Update SCM tutorial after PR 2016 * Add example to gensim::similarities::termsim::SparseTermSimilarityMatrix * Add max_distance kwarg to gensim::similarities::levenshtein::levsim * Replace max_distance kwarg in levsim with min_similarity, add tests * Remove conditional expression from levsim * Use less confusing wording in docsting for min_similarity / max_distance * Defer thresholding in LevenshteinSimilarityIndex.most_similar to levsim * Allow None value of nonzero_limit parameter in SparseTermSimilarityMatrix * Add positive_definite parameter to SparseTermSimilarityMatrix * Split test_building test into a number of atomic unit tests * Presort dictionary keys in UniformTermSimilarityIndex constructor * Make documentation of SparseTermSimilarityMatrix more accurate * Make SparseTermSimilarityMatrix expect negative similarities * Avoid expensive array copying in dot_product * Update SCM tutorial, and benchmark after PR 2016 * Remove fluff from stderr in the SCM tutorial notebook * Add a paper reference to the SCM tutorial notebook * Directly import Levenshtein package in levdist * Use embedded URI instead of indirect hyperlink target in documentation * Assume that max of lens is always an integer * Make LevenshteinSimilarityIndex.most_similar easier to read * Make LevenshteinSimilarityIndex.most_similar easier to read * Add an ordering test for LevenshteinSimilarityIndex.most_similar * Make WordEmbeddingSimilarityIndex.most_similar easier to read
- Loading branch information
1 parent
60b381e
commit f3cf463
Showing
12 changed files
with
5,771 additions
and
229 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.