Implement Soft Cosine Measure #1827

Witiko · 2018-01-06T20:44:26Z

Introduction

I implemented the Soft Cosine Measure (SCM) [wiki, 1, 2] as a part of research for my thesis [3]. Although the original algorithm [1] has a time complexity that is quadratic in the document length, I implemented a linear-time approximative algorithm that I sketch in [3, sec. 4.4]. Since Gensim was such an indispensable asset in my work, I thought I would give back and contribute code. The implementation is showcased in a jupyter notebook on corpora from the SemEval 2016 and 2017 competitions.

Description

My original implementation closely followed the Gensim implementation of the Word Mover's Distance (WMD), which is split into a gensim.models.keyedvectors.EuclideanKeyedVectors.wmdistance method that takes two token lists and computes the WMD for them, and into the gensim.similarity.WdmSimilarity class that provides batch similarity queries. However, I was not quite happy with this for the following reasons:

Not all useful term similarity matrices are constructed using word embeddings. Therefore, putting the entire logic into gensim.models.keyedvectors.EuclideanKeyedVectors immediately seemed like a bad idea that would hinder further extensions.
By automatically converting token lists into bag-of-words representation behind the curtain, the user is unable to apply document length normalization methods such as tf-idf.

For the above reasons, I ultimately decided to split the implementation into a function, a method, and a class as follows:

The gensim.matutils.softcossim function takes two documents in the bag-of-words representation, a sparse term similarity matrix in the scipy CSC format, and computes SCM.
The gensim.models.keyedvectors.EuclideanKeyedVectors.similarity_matrix method takes a corpus of bag-of-words vectors, a dictionary, and produces a sparse term similarity matrix Mrel described by Charlet and Damnati, 2017 [1].
The gensim.similarities.SoftCosineSimilarity class takes a corpus of bag-of-words vectors, a sparse term similarity matrix in the scipy CSC format, and provides batch SCM queries against the corpus.

The above design achieves a much looser coupling between the individual components and eliminates the original concerns. I demonstrate the implementation in a jupyter notebook on the corpus of Yelp reviews. The approximative linear-time approximative algorithm for SCM achieves about the same speed as the linear-time approximative algorithm for WMD (see the corresponding jupyter notebook).

Future work

The gensim.similarities.SoftCosineSimilarity class goes over the entire corpus and computes the SCM between the query and each document separately by calling gensim.matutils.softcossim. If performance is a concern, SCM can be computed in a single step by computing q^T * S * C, where q is the normalized query vector, S is the term similarity matrix, C is the normalized term-document matrix of the corpus, and “normalized” in this context stands for a vector v being divided by sqrt(v^T * S * v). This is similar to what e.g. the gensim.similarity.MatrixSimilarity.get_similarities method does, only with the basic cosine similarity rather than SCM.

References

Grigori Sidorov et al. Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model, 2014. (link to PDF)
Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. (link to PDF)
Vít Novotný, Vector Space Representations in Information Retrieval (preprint), 2017. (link to PDF)

Witiko · 2018-01-07T18:12:20Z

I added numpy-style documentation and unit tests. Hopefully, the code should be good to go now.

menshikh-iv

Great work @Witiko, in general, looks nice!

menshikh-iv · 2018-01-12T08:39:19Z