Skip to content

Commit

Permalink
Implement Levenshtein term similarity matrix and fast SCM between cor…
Browse files Browse the repository at this point in the history
…pora (#2016)

* Wrap docstring for WordEmbeddingsKeyedVectors.similarity_matrix

* Add the gensim.models.levenshtein module

* Add projected density to term similarity matrix logs

* Add tests for the gensim.models.levenshtein.similarity_matrix function

* Separate similarity_matrix methods into director and builder classes.

* Add symmetric parameter to SparseTermSimilarityMatrix

* Add corpus support to SparseTermSimilarityMatrix.inner_product

* Replace scipy.sparse.dok_matrix.has_key with the in operator

* Fix handling of unicode in Python 3 in levsim

* Remove temporary method similarity of LevenshteinSimilarityIndex

* Move models.term_similarity, and levenshtein to similarities

* Make python-Levenshtein a conditional import

* Add default values to gensim.similarities.levenshtein.levsim arguments

* Remove extraneous addition operators from @deprecated annotations

* Remove @deprecated annotation from tests

* Merge test_term_similarity, and test_levenshtein with test_similarities

* Reword TermSimilarityIndex docstring

* Consume no more than topn similarities produced by a TermSimilarityIndex

* Use short uints (<64b) for dok_matrix keys and num_nonzero array

* Write to matrix_nonzero only when building a symmetric matrix

* Ensure UniformTermSimilarityIndex does not yield only topn - 1 values

* Document _shortest_uint_dtype

* Add soft cosine measure benchmark, part 1

* Add soft cosine measure benchmark, part 2

* Make similarity_matrix support non-contiguous dictionaries
Closes #2041

* Support fast inner product between a document and a corpus

* Support fast inner product between a document and a corpus (python 2.7)

* Add faster sparse matrix slicing

* Make Soft Cosine Measure support non-contiguous dictionaries

* Remove gensim::similarities::levenshtein::similarity_matrix facade

* Implement SoftCosineSimilarity using the inner_product method

* Fix flake8 warnings

* Make Soft Cosine Measure support non-contiguous dictionaries (cont)

* Remove parallelization in gensim::similarities::levenshtein

* Document future work

* Update Soft Cosine Measure benchmark after commits 093d569, and c316b95

* Update SCM tutorial after PR 2016

* Add example to gensim::similarities::termsim::SparseTermSimilarityMatrix

* Add max_distance kwarg to gensim::similarities::levenshtein::levsim

* Replace max_distance kwarg in levsim with min_similarity, add tests

* Remove conditional expression from levsim

* Use less confusing wording in docsting for min_similarity / max_distance

* Defer thresholding in LevenshteinSimilarityIndex.most_similar to levsim

* Allow None value of nonzero_limit parameter in SparseTermSimilarityMatrix

* Add positive_definite parameter to SparseTermSimilarityMatrix

* Split test_building test into a number of atomic unit tests

* Presort dictionary keys in UniformTermSimilarityIndex constructor

* Make documentation of SparseTermSimilarityMatrix more accurate

* Make SparseTermSimilarityMatrix expect negative similarities

* Avoid expensive array copying in dot_product

* Update SCM tutorial, and benchmark after PR 2016

* Remove fluff from stderr in the SCM tutorial notebook

* Add a paper reference to the SCM tutorial notebook

* Directly import Levenshtein package in levdist

* Use embedded URI instead of indirect hyperlink target in documentation

* Assume that max of lens is always an integer

* Make LevenshteinSimilarityIndex.most_similar easier to read

* Make LevenshteinSimilarityIndex.most_similar easier to read

* Add an ordering test for LevenshteinSimilarityIndex.most_similar

* Make WordEmbeddingSimilarityIndex.most_similar easier to read
  • Loading branch information
Witiko authored and menshikh-iv committed Jan 14, 2019
1 parent 60b381e commit f3cf463
Show file tree
Hide file tree
Showing 12 changed files with 5,771 additions and 229 deletions.
4,605 changes: 4,605 additions & 0 deletions docs/notebooks/soft_cosine_benchmark.ipynb

Large diffs are not rendered by default.

125 changes: 72 additions & 53 deletions docs/notebooks/soft_cosine_tutorial.ipynb

Large diffs are not rendered by default.

10 changes: 8 additions & 2 deletions gensim/matutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
import math

from gensim import utils
from gensim.utils import deprecated

import numpy as np
import scipy.sparse
Expand Down Expand Up @@ -796,6 +797,9 @@ def cossim(vec1, vec2):
return result


@deprecated(
"Function will be removed in 4.0.0, use "
"gensim.similarities.termsim.SparseTermSimilarityMatrix.inner_product instead")
def softcossim(vec1, vec2, similarity_matrix):
"""Get Soft Cosine Measure between two vectors given a term similarity matrix.
Expand All @@ -816,8 +820,10 @@ def softcossim(vec1, vec2, similarity_matrix):
vec2 : list of (int, float)
A document vector in the BoW format.
similarity_matrix : {:class:`scipy.sparse.csc_matrix`, :class:`scipy.sparse.csr_matrix`}
A term similarity matrix, typically produced by
:meth:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`.
A term similarity matrix. If the matrix is :class:`scipy.sparse.csr_matrix`, it is going
to be transposed. If you rely on the fact that there is at most a constant number of
non-zero elements in a single column, it is your responsibility to ensure that the matrix
is symmetric.
Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion gensim/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from .logentropy_model import LogEntropyModel # noqa:F401
from .word2vec import Word2Vec # noqa:F401
from .doc2vec import Doc2Vec # noqa:F401
from .keyedvectors import KeyedVectors # noqa:F401
from .keyedvectors import KeyedVectors, WordEmbeddingSimilarityIndex # noqa:F401
from .ldamulticore import LdaMulticore # noqa:F401
from .phrases import Phrases # noqa:F401
from .normmodel import NormModel # noqa:F401
Expand Down
139 changes: 63 additions & 76 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,6 @@

from __future__ import division # py3 "true division"

from collections import deque
from itertools import chain
import logging

Expand All @@ -173,11 +172,12 @@
double, array, zeros, vstack, sqrt, newaxis, integer, \
ndarray, sum as np_sum, prod, argmax
import numpy as np

from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc
from gensim.corpora.dictionary import Dictionary
from six import string_types, integer_types
from six.moves import zip, range
from scipy import sparse, stats
from scipy import stats
from gensim.utils import deprecated
from gensim.models.utils_any2vec import (
_save_word2vec_format,
Expand All @@ -186,6 +186,7 @@
_ft_hash,
_ft_hash_broken
)
from gensim.similarities.termsim import TermSimilarityIndex, SparseTermSimilarityMatrix

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -606,6 +607,9 @@ def similar_by_vector(self, vector, topn=10, restrict_vocab=None):
"""
return self.most_similar(positive=[vector], topn=topn, restrict_vocab=restrict_vocab)

@deprecated(
"Method will be removed in 4.0.0, use "
"gensim.models.keyedvectors.WordEmbeddingSimilarityIndex instead")
def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100, dtype=REAL):
"""Construct a term similarity matrix for computing Soft Cosine Measure.
Expand All @@ -615,24 +619,21 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
Parameters
----------
dictionary : :class:`~gensim.corpora.dictionary.Dictionary`
A dictionary that specifies a mapping between words and the indices of rows and columns
of the resulting term similarity matrix.
tfidf : :class:`gensim.models.tfidfmodel.TfidfModel`, optional
A model that specifies the relative importance of the terms in the dictionary. The rows
of the term similarity matrix will be build in a decreasing order of importance of terms,
or in the order of term identifiers if None.
A dictionary that specifies the considered terms.
tfidf : :class:`gensim.models.tfidfmodel.TfidfModel` or None, optional
A model that specifies the relative importance of the terms in the dictionary. The
columns of the term similarity matrix will be build in a decreasing order of importance
of terms, or in the order of term identifiers if None.
threshold : float, optional
Only pairs of words whose embeddings are more similar than `threshold` are considered
when building the sparse term similarity matrix.
Only embeddings more similar than `threshold` are considered when retrieving word
embeddings closest to a given word embedding.
exponent : float, optional
The exponent applied to the similarity between two word embeddings when building the term similarity matrix.
Take the word embedding similarities larger than `threshold` to the power of `exponent`.
nonzero_limit : int, optional
The maximum number of non-zero elements outside the diagonal in a single row or column
of the term similarity matrix. Setting `nonzero_limit` to a constant ensures that the
time complexity of computing the Soft Cosine Measure will be linear in the document
length rather than quadratic.
The maximum number of non-zero elements outside the diagonal in a single column of the
sparse term similarity matrix.
dtype : numpy.dtype, optional
Data-type of the term similarity matrix.
Data-type of the sparse term similarity matrix.
Returns
-------
Expand All @@ -654,66 +655,10 @@ def similarity_matrix(self, dictionary, tfidf=None, threshold=0.0, exponent=2.0,
<http://www.aclweb.org/anthology/S/S17/S17-2051.pdf>`_.
"""
logger.info("constructing a term similarity matrix")
matrix_order = len(dictionary)
matrix_nonzero = [1] * matrix_order
matrix = sparse.identity(matrix_order, dtype=dtype, format="dok")
num_skipped = 0
# Decide the order of rows.
if tfidf is None:
word_indices = deque(sorted(dictionary.keys()))
else:
assert max(tfidf.idfs) < matrix_order
word_indices = deque([
index for index, _
in sorted(tfidf.idfs.items(), key=lambda x: (x[1], -x[0]), reverse=True)
])

# Traverse rows.
for row_number, w1_index in enumerate(list(word_indices)):
word_indices.popleft()
if row_number % 1000 == 0:
logger.info(
"PROGRESS: at %.02f%% rows (%d / %d, %d skipped, %.06f%% density)",
100.0 * (row_number + 1) / matrix_order, row_number + 1, matrix_order,
num_skipped, 100.0 * matrix.getnnz() / matrix_order**2)
w1 = dictionary[w1_index]
if w1 not in self.vocab:
num_skipped += 1
continue # A word from the dictionary is not present in the word2vec model.

# Traverse upper triangle columns.
if matrix_order <= nonzero_limit + 1: # Traverse all columns.
columns = (
(w2_index, self.similarity(w1, dictionary[w2_index]))
for w2_index in word_indices
if dictionary[w2_index] in self.vocab)
else: # Traverse only columns corresponding to the embeddings closest to w1.
num_nonzero = matrix_nonzero[w1_index] - 1
columns = (
(dictionary.token2id[w2], similarity)
for _, (w2, similarity)
in zip(
range(nonzero_limit - num_nonzero),
self.most_similar(positive=[w1], topn=nonzero_limit - num_nonzero)
)
if w2 in dictionary.token2id
)
columns = sorted(columns, key=lambda x: x[0])

for w2_index, similarity in columns:
# Ensure that we don't exceed `nonzero_limit` by mirroring the upper triangle.
if similarity > threshold and matrix_nonzero[w2_index] <= nonzero_limit:
element = similarity**exponent
matrix[w1_index, w2_index] = element
matrix_nonzero[w1_index] += 1
matrix[w2_index, w1_index] = element
matrix_nonzero[w2_index] += 1
logger.info(
"constructed a term similarity matrix with %0.6f %% nonzero elements",
100.0 * matrix.getnnz() / matrix_order**2
)
return matrix.tocsc()
index = WordEmbeddingSimilarityIndex(self, threshold=threshold, exponent=exponent)
similarity_matrix = SparseTermSimilarityMatrix(
index, dictionary, tfidf=tfidf, nonzero_limit=nonzero_limit, dtype=dtype)
return similarity_matrix.matrix

def wmdistance(self, document1, document2):
"""Compute the Word Mover's Distance between two documents.
Expand Down Expand Up @@ -1386,6 +1331,48 @@ def init_sims(self, replace=False):
self.vectors_norm = _l2_norm(self.vectors, replace=replace)


class WordEmbeddingSimilarityIndex(TermSimilarityIndex):
"""
Computes cosine similarities between word embeddings and retrieves the closest word embeddings
by cosine similarity for a given word embedding.
Parameters
----------
keyedvectors : :class:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors`
The word embeddings.
threshold : float, optional
Only embeddings more similar than `threshold` are considered when retrieving word embeddings
closest to a given word embedding.
exponent : float, optional
Take the word embedding similarities larger than `threshold` to the power of `exponent`.
kwargs : dict or None
A dict with keyword arguments that will be passed to the `keyedvectors.most_similar` method
when retrieving the word embeddings closest to a given word embedding.
See Also
--------
:class:`~gensim.similarities.termsim.SparseTermSimilarityMatrix`
Build a term similarity matrix and compute the Soft Cosine Measure.
"""
def __init__(self, keyedvectors, threshold=0.0, exponent=2.0, kwargs=None):
assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors)
self.keyedvectors = keyedvectors
self.threshold = threshold
self.exponent = exponent
self.kwargs = kwargs or {}
super(WordEmbeddingSimilarityIndex, self).__init__()

def most_similar(self, t1, topn=10):
if t1 not in self.keyedvectors.vocab:
logger.debug('an out-of-dictionary term "%s"', t1)
else:
most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
for t2, similarity in most_similar:
if similarity > self.threshold:
yield (t2, similarity**self.exponent)


class Word2VecKeyedVectors(WordEmbeddingsKeyedVectors):
"""Mapping between words and vectors for the :class:`~gensim.models.Word2Vec` model.
Used to perform operations on the vectors such as vector lookup, distance, similarity etc.
Expand Down
2 changes: 2 additions & 0 deletions gensim/similarities/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@

# bring classes directly into package namespace, to save some typing
from .docsim import Similarity, MatrixSimilarity, SparseMatrixSimilarity, SoftCosineSimilarity, WmdSimilarity # noqa:F401
from .termsim import TermSimilarityIndex, UniformTermSimilarityIndex, SparseTermSimilarityMatrix # noqa:F401
from .levenshtein import LevenshteinSimilarityIndex # noqa:F401
77 changes: 35 additions & 42 deletions gensim/similarities/docsim.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@
import scipy.sparse

from gensim import interfaces, utils, matutils
from .termsim import SparseTermSimilarityMatrix
from six.moves import map, range, zip


Expand Down Expand Up @@ -272,8 +273,6 @@ class Similarity(interfaces.SimilarityABC):
Index similarity (dense with cosine distance).
:class:`~gensim.similarities.docsim.SparseMatrixSimilarity`
Index similarity (sparse with cosine distance).
:class:`~gensim.similarities.docsim.SoftCosineSimilarity`
Index similarity (with soft-cosine distance).
:class:`~gensim.similarities.docsim.WmdSimilarity`
Index similarity (with word-mover distance).
Expand Down Expand Up @@ -866,20 +865,18 @@ class SoftCosineSimilarity(interfaces.SimilarityABC):
>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec
>>> from gensim.similarities import SoftCosineSimilarity
>>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
>>> from gensim.similarities import SoftCosineSimilarity, TermSimilarityMatrix
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors
>>> termsim_index = WordEmbeddingSimilarityIndex(model)
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
>>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> similarity_matrix = model.wv.similarity_matrix(dictionary) # construct similarity matrix
>>> index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> # Make a query.
>>> query = 'graph trees computer'.split()
>>> # calculate similarity between query and each doc from bow_corpus
>>> sims = index[dictionary.doc2bow(query)]
>>> query = 'graph trees computer'.split() # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)] # calculate similarity of query to each doc from bow_corpus
Check out `Tutorial Notebook
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb>`_
Expand All @@ -893,24 +890,32 @@ def __init__(self, corpus, similarity_matrix, num_best=None, chunksize=256):
----------
corpus: iterable of list of (int, float)
A list of documents in the BoW format.
similarity_matrix : :class:`scipy.sparse.csc_matrix`
A term similarity matrix, typically produced by
:meth:`~gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`.
similarity_matrix : :class:`gensim.similarities.SparseTermSimilarityMatrix`
A term similarity matrix.
num_best : int, optional
The number of results to retrieve for a query, if None - return similarities with all elements from corpus.
chunksize: int, optional
Size of one corpus chunk.
See Also
--------
:meth:`gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity_matrix`
A term similarity matrix produced from term embeddings.
:func:`gensim.matutils.softcossim`
The Soft Cosine Measure.
:class:`gensim.similarities.SparseTermSimilarityMatrix`
A sparse term similarity matrix build using a term similarity index.
:class:`gensim.similarities.LevenshteinSimilarityIndex`
A term similarity index that computes Levenshtein similarities between terms.
:class:`gensim.models.WordEmbeddingSimilarityIndex`
A term similarity index that computes cosine similarities between word embeddings.
"""
if scipy.sparse.issparse(similarity_matrix):
logger.warn(
"Support for passing an unencapsulated sparse matrix will be removed in 4.0.0, pass "
"a SparseTermSimilarityMatrix instance instead")
self.similarity_matrix = SparseTermSimilarityMatrix(similarity_matrix)
else:
self.similarity_matrix = similarity_matrix

self.corpus = corpus
self.similarity_matrix = similarity_matrix
self.num_best = num_best
self.chunksize = chunksize

Expand Down Expand Up @@ -943,31 +948,19 @@ def get_similarities(self, query):
Similarity matrix.
"""
if not self.corpus:
return numpy.array()

is_corpus, query = utils.is_corpus(query)
if not is_corpus:
if isinstance(query, numpy.ndarray):
# Convert document indexes to actual documents.
query = [self.corpus[i] for i in query]
else:
query = [query]

result = []
for query_document in query:
# Compute similarity for each query.
qresult = [matutils.softcossim(query_document, corpus_document, self.similarity_matrix)
for corpus_document in self.corpus]
qresult = numpy.array(qresult)

# Append single query result to list of all results.
result.append(qresult)

if is_corpus:
result = numpy.array(result)
else:
result = result[0]

return result
if not is_corpus and isinstance(query, numpy.ndarray):
query = [self.corpus[i] for i in query] # convert document indexes to actual documents
result = self.similarity_matrix.inner_product(query, self.corpus, normalized=True)

if scipy.sparse.issparse(result):
return numpy.asarray(result.todense())
if numpy.isscalar(result):
return numpy.array(result)
return numpy.asarray(result)[0]

def __str__(self):
return "%s<%i docs, %i features>" % (self.__class__.__name__, len(self), self.similarity_matrix.shape[0])
Expand Down
Loading

0 comments on commit f3cf463

Please sign in to comment.