From 3148b603a6e0b269ccb4b5708884362ccd367b32 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?V=C3=ADt=20Novotn=C3=BD?= Date: Mon, 5 Feb 2018 12:25:43 +0100 Subject: [PATCH] Update Soft Cosine Measure tutorial notebook --- docs/notebooks/soft_cosine_tutorial.ipynb | 589 +++++++++++----------- 1 file changed, 296 insertions(+), 293 deletions(-) diff --git a/docs/notebooks/soft_cosine_tutorial.ipynb b/docs/notebooks/soft_cosine_tutorial.ipynb index 5a9e868ff2..a4b90c2d34 100644 --- a/docs/notebooks/soft_cosine_tutorial.ipynb +++ b/docs/notebooks/soft_cosine_tutorial.ipynb @@ -6,22 +6,23 @@ "source": [ "# Finding similar documents with Word2Vec and Soft Cosine Measure \n", "\n", - "Soft Cosine Measure (SCM) is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. In **part 1**, we will show how you can compute SCM between two documents using `softcossim`. In **part 2**, we will use `SoftCosineSimilarity` to retrieve documents most similar to a query. Part 1 is optional if you only want use `SoftCosineSimilarity`, but is also useful in it's own merit.\n", + "Soft Cosine Measure (SCM) is a promising new tool in machine learning that allows us to submit a query and return the most relevant documents. In **part 1**, we will show how you can compute SCM between two documents using `softcossim`. In **part 2**, we will use `SoftCosineSimilarity` to retrieve documents most similar to a query and compare the performance against other similarity measures.\n", "\n", - "First, however, we go through the basics of what soft cosine measure is.\n", + "First, however, we go through the basics of what Soft Cosine Measure is.\n", "\n", "## Soft Cosine Measure basics\n", "\n", - "Soft Cosine Measure (SCM) is a method that allows us to assess the similarity between two documents in a meaningful way, even when they have no words in common. It uses a measure of similarity between words, which can be derived [2] using [word2vec](http://rare-technologies.com/word2vec-tutorial/) [3] vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in the semantic text similarity task in the context of community question answering [2].\n", + "Soft Cosine Measure (SCM) is a method that allows us to assess the similarity between two documents in a meaningful way, even when they have no words in common. It uses a measure of similarity between words, which can be derived [2] using [word2vec][] [3] vector embeddings of words. It has been shown to outperform many of the state-of-the-art methods in the semantic text similarity task in the context of community question answering [2].\n", "\n", - "SCM is illustrated below for two very similar sentences. The sentences have no words in common, but by matching the relevant words, SCM is able to accurately measure the similarity between the two sentences. The method also uses the bag-of-words vector representation of the documents (simply put, the word's frequencies in the documents). The intution behind the method is that we compute standard cosine similarity assuming that the document vectors are expressed in a non-orthogonal basis, where the angle between two basis vectors is derived from the angle between the word2vec embeddings of the corresponding words.\n", + "[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html\n", "\n", - "![Soft Cosine Measure](soft_cosine_tutorial.png)\n", + "SCM is illustrated below for two very similar sentences. The sentences have no words in common, but by modeling synonymy, SCM is able to accurately measure the similarity between the two sentences. The method also uses the bag-of-words vector representation of the documents (simply put, the word's frequencies in the documents). The intution behind the method is that we compute standard cosine similarity assuming that the document vectors are expressed in a non-orthogonal basis, where the angle between two basis vectors is derived from the angle between the word2vec embeddings of the corresponding words.\n", "\n", + "![Soft Cosine Measure](soft_cosine_tutorial.png)\n", "\n", - "This method was introduced in the article \"Soft Measure and Soft Cosine Measure: Measure of Features in Vector Space Model\" by Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf)).\n", + "This method was perhaps first introduced in the article “Soft Measure and Soft Cosine Measure: Measure of Features in Vector Space Model” by Grigori Sidorov, Alexander Gelbukh, Helena Gomez-Adorno, and David Pinto ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf)).\n", "\n", - "In this tutorial, we will learn how to use Gensim's SCM functionality, which consists of the `softcossim` method for distance computation, and the `SoftCosineSimilarity` class for corpus based similarity queries.\n", + "In this tutorial, we will learn how to use Gensim's SCM functionality, which consists of the `softcossim` function for one-off computation, and the `SoftCosineSimilarity` class for corpus-based similarity queries.\n", "\n", "> **Note**:\n", ">\n", @@ -29,9 +30,9 @@ ">\n", "\n", "## Running this notebook\n", - "You can download this [iPython Notebook](http://ipython.org/notebook.html), and run it on your own computer, provided you have installed Gensim, PyEMD, NLTK, Matplotlib, and downloaded the necessary data.\n", + "You can download this [Jupyter notebook](http://jupyter.org/), and run it on your own computer, provided you have installed the `gensim`, `jupyter`, `sklearn`, `pyemd`, `wmd`, and `wget` Python packages.\n", "\n", - "The notebook was run on an Ubuntu machine with an Intel core i7-6700HQ CPU 3.10GHz (4 cores) and 16 GB memory. Running the entire notebook on this machine takes about 6 minutes." + "The notebook was run on an Ubuntu machine with an Intel core i7-6700HQ CPU 3.10GHz (4 cores) and 16 GB memory. Assuming all resources required by the notebook have already been downloaded, running the entire notebook on this machine takes about 30 minutes." ] }, { @@ -42,7 +43,7 @@ "source": [ "# Initialize logging.\n", "import logging\n", - "logging.basicConfig(format='%(asctime)s | %(levelname)s : %(message)s')" + "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)" ] }, { @@ -51,9 +52,11 @@ "source": [ "## Part 1: Computing the Soft Cosine Measure\n", "\n", - "To use SCM, we need some word embeddings first of all. You could train a word2vec (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will start by downloading some pre-trained word2vec embeddings. Download the GoogleNews-vectors-negative300.bin.gz embeddings [here](https://code.google.com/archive/p/word2vec/) (warning: 1.5 GB, file is not needed for part 2). Training your own embeddings can be beneficial, but to simplify this tutorial, we will be using pre-trained embeddings at first.\n", + "To use SCM, we need some word embeddings first of all. You could train a [word2vec][] (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will use pre-trained word2vec embeddings.\n", "\n", - "Let's take some sentences to compute the similarity between." + "[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html\n", + "\n", + "Let's create some sentences to compare." ] }, { @@ -88,6 +91,13 @@ "[nltk_data] Downloading package stopwords to /home/witiko/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2018-02-05 10:47:42,975 : INFO : built Dictionary(11 unique tokens: ['president', 'fruit', 'greets', 'obama', 'illinois']...) from 3 documents (total 11 corpus positions)\n" + ] } ], "source": [ @@ -118,7 +128,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now, as mentioned earlier, we will be using some downloaded pre-trained embeddings. We load these into a Gensim Word2Vec model class. Note that the embeddings we have chosen here require a lot of memory. We will use the embeddings to construct a term similarity matrix that will be used by the `softcossim` method." + "Now, as we mentioned earlier, we will be using some downloaded pre-trained embeddings. Note that the embeddings we have chosen here require a lot of memory. We will use the embeddings to construct a term similarity matrix that will be used by the `softcossim` function." ] }, { @@ -126,37 +136,40 @@ "execution_count": 4, "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2018-02-05 10:49:29,393 : INFO : constructed a term similarity matrix with 91.735537 % nonzero elements\n" + ] + }, { "name": "stdout", "output_type": "stream", "text": [ - "Cell took 107.69 seconds to run.\n" + "CPU times: user 1min 39s, sys: 3.06 s, total: 1min 42s\n", + "Wall time: 1min 47s\n" ] } ], "source": [ "%%time\n", - "import os\n", - "\n", - "from gensim.models import KeyedVectors\n", - "if not os.path.exists('/data/GoogleNews-vectors-negative300.bin.gz'):\n", - " raise ValueError(\"SKIP: You need to download the google news model\")\n", - " \n", - "model = KeyedVectors.load_word2vec_format('/data/GoogleNews-vectors-negative300.bin.gz', binary=True)\n", - "similarity_matrix = model.similarity_matrix(dictionary)\n", - "del model" + "import gensim.downloader\n", + "\n", + "w2v_model = gensim.downloader.load(\"word2vec-google-news-300\")\n", + "similarity_matrix = w2v_model.similarity_matrix(dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "So let's compute SCM using the `softcossim` method." + "So let's compute SCM using the `softcossim` function." ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -169,6 +182,7 @@ ], "source": [ "from gensim.matutils import softcossim\n", + "\n", "similarity = softcossim(sentence_obama, sentence_president, similarity_matrix)\n", "print('similarity = %.4f' % similarity)" ] @@ -182,7 +196,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -203,359 +217,352 @@ "metadata": {}, "source": [ "## Part 2: Similarity queries using `SoftCosineSimilarity`\n", + "You can use SCM to get the most similar documents to a query, using the SoftCosineSimilarity class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.\n", "\n", - "You can use SCM to get the most similar documents to a query, using the `SoftCosineSimilarity` class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.\n", + "### Qatar Living unannotated dataset\n", + "Contestants solving the community question answering task in the [SemEval 2016][semeval16] and [2017][semeval17] competitions had an unannotated dataset of 189,941 questions and 1,894,456 comments from the [Qatar Living][ql] discussion forums. As our first step, we will use the same dataset to build a corpus.\n", "\n", - "### Yelp data\n", - "\n", - "Let's try similarity queries using some real world data. For that we'll be using Yelp reviews, available at http://www.yelp.com/dataset_challenge. Specifically, we will be using reviews of a single restaurant, namely the [Mon Ami Gabi](http://en.yelp.be/biz/mon-ami-gabi-las-vegas-2).\n", - "\n", - "To get the Yelp data, you need to register by name and email address. The data is 3.6 GB.\n", - "\n", - "This time around, we are going to train the Word2Vec embeddings on the data ourselves. One restaurant is not enough to train Word2Vec properly, so we use 6 restaurants for that, but only run queries against one of them. In addition to the Mon Ami Gabi, mentioned above, we will be using:\n", - "\n", - "* [Earl of Sandwich](http://en.yelp.be/biz/earl-of-sandwich-las-vegas).\n", - "* [Wicked Spoon](http://en.yelp.be/biz/wicked-spoon-las-vegas).\n", - "* [Serendipity 3](http://en.yelp.be/biz/serendipity-3-las-vegas).\n", - "* [Bacchanal Buffet](http://en.yelp.be/biz/bacchanal-buffet-las-vegas-7).\n", - "* [The Buffet](http://en.yelp.be/biz/the-buffet-las-vegas-6).\n", - "\n", - "The restaurants we chose were those with the highest number of reviews in the Yelp dataset. Incidentally, they all are on the Las Vegas Boulevard. The corpus we trained Word2Vec on has 27028 documents (reviews), and the corpus we used for `SoftCosineSimilarity` has 6978 documents.\n", - "\n", - "Below a JSON file with Yelp reviews is read line by line, the text is extracted, tokenized, and stopwords and punctuation are removed.\n" + "[semeval16]: http://alt.qcri.org/semeval2016/task3/\n", + "[semeval17]: http://alt.qcri.org/semeval2017/task3/\n", + "[ql]: http://www.qatarliving.com/forum" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "[nltk_data] Downloading package punkt to /home/witiko/nltk_data...\n", - "[nltk_data] Package punkt is already up-to-date!\n", "[nltk_data] Downloading package stopwords to /home/witiko/nltk_data...\n", - "[nltk_data] Package stopwords is already up-to-date!\n" - ] - } - ], - "source": [ - "# Pre-processing a document.\n", - "from nltk.corpus import stopwords\n", - "from nltk import download, word_tokenize\n", - "download('punkt') # Download data for tokenizer.\n", - "download('stopwords') # Download stopwords list.\n", - "stop_words = stopwords.words('english')\n", - "\n", - "def preprocess(doc):\n", - " doc = doc.lower() # Lower the text.\n", - " doc = word_tokenize(doc) # Split into words.\n", - " doc = [w for w in doc if not w in stop_words] # Remove stopwords.\n", - " doc = [w for w in doc if w.isalpha()] # Remove numbers and punctuation.\n", - " return doc" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Cell took 103.94 seconds to run.\n" + "[nltk_data] Package stopwords is already up-to-date!\n", + "Number of documents: 3\n", + "CPU times: user 1min 59s, sys: 6.06 s, total: 2min 5s\n", + "Wall time: 2min 22s\n" ] } ], "source": [ "%%time\n", + "from itertools import chain\n", "import json\n", + "import gzip\n", + "from re import sub\n", + "from os.path import isfile\n", "\n", - "# Business IDs of the restaurants.\n", - "ids = ['4JNXUYY8wbaaDmk3BPzlWw', # Mon Ami Gabi\n", - " 'Ffhe2cmRyloz3CCdRGvHtA', # Earl of Sandwich\n", - " 'K7lWdNUhCbcnEvI0NhGewg', # Wicked Spoon\n", - " 'eoHdUeQDNgQ6WYEnP2aiRw', # Serendipity 3\n", - " 'RESDUcs7fIiihp38-d6_6g', # Bacchanal Buffet\n", - " '2weQS-RnoOBhb1KsHKyoSQ'] # The Buffet\n", - "\n", - "w2v_corpus = [] # Documents to train word2vec on (all 6 restaurants).\n", - "scm_corpus = [] # Documents to run queries against (only one restaurant).\n", - "documents = [] # scm_corpus, with no pre-processing (so we can see the original documents).\n", - "with open('/data/review.json') as data_file:\n", - " for line in data_file:\n", - " json_line = json.loads(line)\n", - " \n", - " if json_line['business_id'] not in ids:\n", - " # Not one of the 6 restaurants.\n", - " continue\n", - " \n", - " # Pre-process document.\n", - " text = json_line['text'] # Extract text from JSON object.\n", - " text = preprocess(text)\n", - " \n", - " # Add to corpus for training Word2Vec.\n", - " w2v_corpus.append(text)\n", - " \n", - " if json_line['business_id'] == ids[0]:\n", - " # Add to corpus for similarity queries.\n", - " scm_corpus.append(text)\n", - " documents.append(json_line['text'])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Below is a plot with a histogram of document lengths and includes the average document length as well. Note that these are the pre-processed documents, meaning stopwords are removed, punctuation is removed, etc. Document lengths have a high impact on the running time of SCM, so when comparing running times with this experiment, the number of documents in query corpus (about 7000) and the length of the documents (about 59 words on average) should be taken into account." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "%%time\n", - "from matplotlib import cycler, pyplot as plt\n", - "%matplotlib inline\n", - "\n", - "# Document lengths.\n", - "lens = [len(doc) for doc in scm_corpus]\n", - "\n", - "# Plot.\n", - "plt.rc('figure', figsize=(8,6))\n", - "plt.rc('font', size=14)\n", - "plt.rc('lines', linewidth=2)\n", - "plt.rc('axes', prop_cycle=cycler('color', ('#377eb8','#e41a1c','#4daf4a',\n", - " '#984ea3','#ff7f00','#ffff33')))\n", - "# Histogram.\n", - "plt.hist(lens, bins=20, edgecolor=\"k\")\n", - "# Average length.\n", - "avg_len = sum(lens) / float(len(lens))\n", - "plt.axvline(avg_len, color='#e41a1c')\n", - "plt.title('Histogram of document lengths.')\n", - "plt.xlabel('Length')\n", - "plt.xlim((0, 450))\n", - "plt.text(100, 800, 'mean = %.2f' % avg_len)\n", - "plt.show()" + "from gensim.utils import simple_preprocess\n", + "from nltk.corpus import stopwords\n", + "from nltk import download\n", + "import wget\n", + "\n", + "download(\"stopwords\") # Download stopwords list.\n", + "stopwords = set(stopwords.words(\"english\"))\n", + "\n", + "def preprocess(doc):\n", + " doc = sub(r']+(>|$)', \" image_token \", doc)\n", + " doc = sub(r'<[^<>]+(>|$)', \" \", doc)\n", + " doc = sub(r'\\[img_assist[^]]*?\\]', \" \", doc)\n", + " doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', \" url_token \", doc)\n", + " return [token for token in simple_preprocess(doc, min_len=0, max_len=float(\"inf\")) if token not in stopwords]\n", + "\n", + "if not isfile(\"semeval-2016_2017-task3-subtaskA-unannotated-english.json.gz\"): # TODO: Replace with a gensim-data call.\n", + " wget.download(\"https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english/releases/download/2018-01-29/semeval-2016_2017-task3-subtaskA-unannotated-english.json.gz\")\n", + "with gzip.open(\"semeval-2016_2017-task3-subtaskA-unannotated-english.json.gz\", \"rt\") as json_file:\n", + " json_data = json.loads(json_file.read())\n", + " corpus = list(chain(*[\n", + " chain(\n", + " [preprocess(thread[\"RelQuestion\"][\"RelQSubject\"]), preprocess(thread[\"RelQuestion\"][\"RelQBody\"])],\n", + " [preprocess(relcomment[\"RelCText\"]) for relcomment in thread[\"RelComments\"]])\n", + " for thread in json_data]))\n", + "\n", + "print(\"Number of documents: %d\" % len(documents))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Now we want to initialize the similarity class with a corpus and a word2vec model (which provides the embeddings and the `softcossim` method itself)." + "Using the corpus we have just build, we will now construct a [dictionary][], a [TF-IDF model][tfidf], a [word2vec model][word2vec], and a term similarity matrix.\n", + "\n", + "[dictionary]: https://radimrehurek.com/gensim/corpora/dictionary.html\n", + "[tfidf]: https://radimrehurek.com/gensim/models/tfidfmodel.html\n", + "[word2vec]: https://radimrehurek.com/gensim/models/word2vec.html" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 8, "metadata": { - "scrolled": false + "scrolled": true }, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2018-02-05 10:52:53,477 : INFO : built Dictionary(462807 unique tokens: ['reclarify', 'depeneded', 'autralia', 'cloudnight', 'openmoko']...) from 2274338 documents (total 40096354 corpus positions)\n", + "2018-02-05 10:56:50,633 : INFO : training on a 200481770 raw words (192577574 effective words) took 224.3s, 858402 effective words/s\n", + "2018-02-05 11:13:14,895 : INFO : constructed a term similarity matrix with 0.003564 % nonzero elements\n" + ] + }, { "name": "stdout", "output_type": "stream", "text": [ - "Cell took 41.35 seconds to run.\n" + "Number of unique words: 462807\n", + "CPU times: user 1h 2min 21s, sys: 12min 56s, total: 1h 15min 17s\n", + "Wall time: 21min 27s\n" ] } ], "source": [ "%%time\n", + "from gensim.corpora import Dictionary\n", + "from gensim.models import TfidfModel\n", "from gensim.models import Word2Vec\n", + "from multiprocessing import cpu_count\n", "\n", - "# Train Word2Vec on all the restaurants.\n", - "model = Word2Vec(w2v_corpus, workers=3, size=100)\n", + "dictionary = Dictionary(corpus)\n", + "tfidf = TfidfModel(dictionary=dictionary)\n", + "w2v_model = Word2Vec(corpus, workers=cpu_count(), min_count=5, size=300, seed=12345)\n", + "similarity_matrix = w2v_model.wv.similarity_matrix(dictionary, tfidf, nonzero_limit=100)\n", "\n", - "# Initialize SoftCosineSimilarity.\n", - "from gensim import corpora\n", - "from gensim.similarities import SoftCosineSimilarity\n", - "num_best = 10\n", - "dictionary = corpora.Dictionary(scm_corpus)\n", - "scm_corpus = [dictionary.doc2bow(document) for document in scm_corpus]\n", - "similarity_matrix = model.wv.similarity_matrix(dictionary)\n", - "instance = SoftCosineSimilarity(scm_corpus, similarity_matrix, num_best=num_best)" + "print(\"Number of unique words: %d\" % len(dictionary))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The `num_best` parameter decides how many results the queries return. Now let's try making a query. The output is a list of indeces and similarities of documents in the corpus, sorted by similarity.\n", - "\n", - "Note that the output format is slightly different when `num_best` is `None` (i.e. not assigned). In this case, you get an array of similarities, corresponding to each of the documents in the corpus.\n", - "\n", - "The query below is taken directly from one of the reviews in the corpus. Let's see if there are other reviews that are similar to this one." + "### Evaluation\n", + "Next, we will load the validation and test datasets that were used by the SemEval 2016 and 2017 contestants. The datasets contain 208 original questions posted by the forum members. For each question, there is a list of 10 threads with a human annotation denoting whether or not the thread is relevant to the original question. Our task will be to order the threads so that relevant threads rank above irrelevant threads." ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 9, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Cell took 47.43 seconds to run.\n" - ] - } - ], + "outputs": [], "source": [ - "%%time\n", - "sent = 'Yummy! Great view of the Bellagio Fountain show.'\n", - "query = dictionary.doc2bow(preprocess(sent))\n", - "\n", - "sims = instance[query] # A query is simply a \"look-up\" in the similarity class." + "# TODO: Replace with a gensim-data call.\n", + "if not isfile(\"semeval-2016_2017-task3-subtaskB-english.json.gz\"):\n", + " wget.download(\"https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english/releases/download/2018-01-29/semeval-2016_2017-task3-subtaskB-english.json.gz\")\n", + "with gzip.open(\"semeval-2016_2017-task3-subtaskB-english.json.gz\", \"rt\") as json_file:\n", + " datasets = json.loads(json_file.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The query and the most similar documents, together with the similarities, are printed below. We see that the retrieved documents are discussing the same thing as the query, although using different words. The query talks about the food being \"yummy\", while the second best result talk about it being \"good\"." + "Finally, we will perform an evaluation to compare three unsupervised similarity measures – the Soft Cosine Measure, two different implementations of the [Word Mover's Distance][wmd], and standard cosine similarity. We will use the [Mean Average Precision (MAP)][map] as an evaluation measure and 10-fold cross-validation to get an estimate of the variance of MAP for each similarity measure.\n", + "\n", + "[wmd]: http://vene.ro/blog/word-movers-distance-in-python.html\n", + "[map]: https://medium.com/@pds.bangalore/mean-average-precision-abd77d0b9a7e" ] }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 10, "metadata": {}, + "outputs": [], + "source": [ + "from math import isnan\n", + "from time import time\n", + "\n", + "from gensim.similarities import MatrixSimilarity, WmdSimilarity, SoftCosineSimilarity\n", + "import numpy as np\n", + "from sklearn.model_selection import KFold\n", + "from wmd import WMD\n", + "\n", + "def produce_test_data(dataset):\n", + " for orgquestion in datasets[dataset]:\n", + " query = preprocess(orgquestion[\"OrgQSubject\"]) + preprocess(orgquestion[\"OrgQBody\"])\n", + " documents = [\n", + " preprocess(thread[\"RelQuestion\"][\"RelQSubject\"]) + preprocess(thread[\"RelQuestion\"][\"RelQBody\"])\n", + " for thread in orgquestion[\"Threads\"]]\n", + " relevance = [\n", + " thread[\"RelQuestion\"][\"RELQ_RELEVANCE2ORGQ\"] in (\"PerfectMatch\", \"Relevant\")\n", + " for thread in orgquestion[\"Threads\"]]\n", + " yield query, documents, relevance\n", + "\n", + "def cossim(query, documents):\n", + " # Compute cosine similarity between the query and the documents.\n", + " query = tfidf[dictionary.doc2bow(query)]\n", + " index = MatrixSimilarity(\n", + " tfidf[[dictionary.doc2bow(document) for document in documents]],\n", + " num_features=len(dictionary))\n", + " similarities = index[query]\n", + " return similarities\n", + "\n", + "def softcossim(query, documents):\n", + " # Compute Soft Cosine Measure between the query and the documents.\n", + " query = tfidf[dictionary.doc2bow(query)]\n", + " index = SoftCosineSimilarity(\n", + " tfidf[[dictionary.doc2bow(document) for document in documents]],\n", + " similarity_matrix)\n", + " similarities = index[query]\n", + " return similarities\n", + "\n", + "def wmd_gensim(query, documents):\n", + " # Compute Word Mover's Distance as implemented in PyEMD by William Mayner\n", + " # between the query and the documents.\n", + " index = WmdSimilarity(documents, w2v_model)\n", + " similarities = index[query]\n", + " return similarities\n", + "\n", + "def wmd_relax(query, documents):\n", + " # Compute Word Mover's Distance as implemented in WMD by Source{d}\n", + " # between the query and the documents.\n", + " words = [word for word in set(chain(query, *documents)) if word in w2v_model.wv]\n", + " indices, words = zip(*sorted((\n", + " (index, word) for (index, _), word in zip(dictionary.doc2bow(words), words))))\n", + " query = dict(tfidf[dictionary.doc2bow(query)])\n", + " query = [\n", + " (new_index, query[dict_index])\n", + " for new_index, dict_index in enumerate(indices)\n", + " if dict_index in query]\n", + " documents = [dict(tfidf[dictionary.doc2bow(document)]) for document in documents]\n", + " documents = [[\n", + " (new_index, document[dict_index])\n", + " for new_index, dict_index in enumerate(indices)\n", + " if dict_index in document] for document in documents]\n", + " embeddings = np.array([w2v_model.wv[word] for word in words], dtype=np.float32)\n", + " nbow = dict(((index, (None, *zip(*document))) for index, document in enumerate(documents)))\n", + " nbow[\"query\"] = (None, *zip(*query))\n", + " distances = WMD(embeddings, nbow, vocabulary_min=1).nearest_neighbors(\"query\")\n", + " similarities = [-distance for _, distance in sorted(distances)]\n", + " return similarities\n", + "\n", + "strategies = {\n", + " \"cossim\" : cossim,\n", + " \"softcossim\": softcossim,\n", + " \"wmd-gensim\": wmd_gensim,\n", + " \"wmd-relax\": wmd_relax}\n", + "\n", + "def evaluate(split, strategy):\n", + " # Perform a single round of evaluation.\n", + " results = []\n", + " start_time = time()\n", + " for query, documents, relevance in split:\n", + " similarities = strategies[strategy](query, documents)\n", + " assert len(similarities) == len(documents)\n", + " precision = [\n", + " (num_correct + 1) / (num_total + 1) for num_correct, num_total in enumerate(\n", + " num_total for num_total, (_, relevant) in enumerate(\n", + " sorted(zip(similarities, relevance), reverse=True)) if relevant)]\n", + " average_precision = np.mean(precision) if precision else 0.0\n", + " results.append(average_precision)\n", + " return (np.mean(results) * 100, time() - start_time)\n", + "\n", + "def crossvalidate(args):\n", + " # Perform a cross-validation.\n", + " dataset, strategy = args\n", + " test_data = np.array(list(produce_test_data(dataset)))\n", + " kf = KFold(n_splits=10)\n", + " samples = []\n", + " for _, test_index in kf.split(test_data):\n", + " samples.append(evaluate(test_data[test_index], strategy))\n", + " return (np.mean(samples, axis=0), np.std(samples, axis=0))" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "scrolled": true + }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Query:\n", - "Yummy! Great view of the Bellagio Fountain show.\n", - "\n", - "sim = 1.0000\n", - "Yummy! Great view of the Bellagio Fountain show.\n", - "\n", - "sim = 0.8114\n", - "Food was good. Awesome service. Great view of the water show at the Bellagio\n", - "\n", - "sim = 0.7813\n", - "This is a great place to eat after a show. Great atmosphere with the Bellagio Fountain across the street. The food is really good.\n", - "\n", - "sim = 0.7719\n", - "Love this place! It has a great atmosphere, the food is consistently good, if you sit on the patio you can watch the Fountain show of the Bellagio.\n", - "\n", - "sim = 0.7680\n", - "Solid food; great service. Beautiful view of the Bellagio fountains across the street.\n", - "\n", - "sim = 0.7627\n", - "Nice French food with a great view if the Bellagio fountains\n", - "\n", - "sim = 0.7597\n", - "Great environment, great service and great food with relatively affordable price! What can be better than enjoying a glass of sweet Frangria under the sun while watching the fountain show at the Bellagio right across the street during your vacay?\n", - "\n", - "sim = 0.7585\n", - "Amazing food, amazing service, great view of Bellagio fountains\n", - "\n", - "sim = 0.7569\n", - "Consistently good food with a view of the fountains at bellagio.\n", - "\n", - "sim = 0.7565\n", - "Great food with a great view! Time it right with the bellagio fountains!\n" + "CPU times: user 1.49 s, sys: 1.28 s, total: 2.77 s\n", + "Wall time: 1min 42s\n" ] } ], "source": [ - "# Print the query and the retrieved documents, together with their similarities.\n", - "print('Query:')\n", - "print(sent)\n", - "for i in range(num_best):\n", - " print()\n", - " print('sim = %.4f' % sims[i][1])\n", - " print(documents[sims[i][0]])" + "%%time\n", + "from multiprocessing import Pool\n", + "\n", + "args_list = [\n", + " (dataset, technique)\n", + " for dataset in (\"2016-test\", \"2017-test\")\n", + " for technique in (\"softcossim\", \"wmd-gensim\", \"wmd-relax\", \"cossim\")]\n", + "with Pool() as pool:\n", + " results = pool.map(crossvalidate, args_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Let us now remove the word \"yummy\" from the query. We can see that\n", - "\n", - "> Food was good. Awesome service. Great view of the water show at the Bellagio\n", - "\n", - "drops from the second to the seventh place even though it does not actually contain the word \"yummy\"." + "The table below shows the pointwise estimates of means and standard variances for MAP scores and elapsed times. Baselines and winners for each year are displayed in bold. We can see that the Soft Cosine Measure gives a strong performance on both the 2016 and the 2017 dataset." ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 12, "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "Query:\n", - "Great view of the Bellagio Fountain show.\n", - "\n", - "sim = 0.9591\n", - "Yummy! Great view of the Bellagio Fountain show.\n", - "\n", - "sim = 0.8020\n", - "Solid food; great service. Beautiful view of the Bellagio fountains across the street.\n", - "\n", - "sim = 0.7899\n", - "Nice French food with a great view if the Bellagio fountains\n", - "\n", - "sim = 0.7820\n", - "Food was good. Awesome service. Great view of the water show at the Bellagio\n", - "\n", - "sim = 0.7797\n", - "This is a great place to eat after a show. Great atmosphere with the Bellagio Fountain across the street. The food is really good.\n", - "\n", - "sim = 0.7684\n", - "Love this place! It has a great atmosphere, the food is consistently good, if you sit on the patio you can watch the Fountain show of the Bellagio.\n", - "\n", - "sim = 0.7648\n", - "Consistently good food with a view of the fountains at bellagio.\n", - "\n", - "sim = 0.7641\n", - "Great food with a great view! Time it right with the bellagio fountains!\n", - "\n", - "sim = 0.7631\n", - "Great environment, great service and great food with relatively affordable price! What can be better than enjoying a glass of sweet Frangria under the sun while watching the fountain show at the Bellagio right across the street during your vacay?\n", - "\n", - "sim = 0.7519\n", - "They have very unique thin steaks that have great flavor. Directly across from the fountains at the Bellagio so you get a great view with dinner as well.\n", - "Cell took 46.75 seconds to run.\n" - ] + "data": { + "text/markdown": [ + "\n", + "\n", + "Dataset | Strategy | MAP score | Elapsed time (sec)\n", + ":---|:---|:---|---:\n", + "2016-test|softcossim|77.29 ±10.35|0.20 ±0.06\n", + "2016-test|**Winner (UH-PRHLT-primary)**|76.70 ±0.00|\n", + "2016-test|cossim|76.45 ±10.40|0.48 ±0.07\n", + "2016-test|wmd-gensim|76.07 ±11.52|8.36 ±2.05\n", + "2016-test|**Baseline 1 (IR)**|74.75 ±0.00|\n", + "2016-test|wmd-relax|73.01 ±10.33|0.97 ±0.16\n", + "2016-test|**Baseline 2 (random)**|46.98 ±0.00|\n", + "\n", + "\n", + "Dataset | Strategy | MAP score | Elapsed time (sec)\n", + ":---|:---|:---|---:\n", + "2017-test|**Winner (SimBow-primary)**|47.22 ±0.00|\n", + "2017-test|softcossim|46.06 ±18.00|0.15 ±0.03\n", + "2017-test|cossim|44.38 ±14.71|0.43 ±0.07\n", + "2017-test|wmd-gensim|44.20 ±16.02|9.78 ±1.80\n", + "2017-test|**Baseline 1 (IR)**|41.85 ±0.00|\n", + "2017-test|wmd-relax|41.24 ±14.87|1.00 ±0.26\n", + "2017-test|**Baseline 2 (random)**|29.81 ±0.00|" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" } ], "source": [ - "%%time\n", - "sent = 'Great view of the Bellagio Fountain show.'\n", - "query = dictionary.doc2bow(preprocess(sent))\n", - "\n", - "sims = instance[query] # A query is simply a \"look-up\" in the similarity class.\n", - "\n", - "print('Query:')\n", - "print(sent)\n", - "for i in range(num_best):\n", - " print()\n", - " print('sim = %.4f' % sims[i][1])\n", - " print(documents[sims[i][0]])" + "from IPython.display import display, Markdown\n", + "\n", + "output = []\n", + "baselines = [\n", + " ((\"2016-test\", \"**Winner (UH-PRHLT-primary)**\"), ((76.70, 0), (0, 0))),\n", + " ((\"2016-test\", \"**Baseline 1 (IR)**\"), ((74.75, 0), (0, 0))),\n", + " ((\"2016-test\", \"**Baseline 2 (random)**\"), ((46.98, 0), (0, 0))),\n", + " ((\"2017-test\", \"**Winner (SimBow-primary)**\"), ((47.22, 0), (0, 0))),\n", + " ((\"2017-test\", \"**Baseline 1 (IR)**\"), ((41.85, 0), (0, 0))),\n", + " ((\"2017-test\", \"**Baseline 2 (random)**\"), ((29.81, 0), (0, 0)))]\n", + "table_header = [\"Dataset | Strategy | MAP score | Elapsed time (sec)\", \":---|:---|:---|---:\"]\n", + "for row, ((dataset, technique), ((mean_map_score, mean_duration), (std_map_score, std_duration))) \\\n", + " in enumerate(sorted(chain(zip(args_list, results), baselines), key=lambda x: (x[0][0], -x[1][0][0]))):\n", + " if row % (len(strategies) + 3) == 0:\n", + " output.extend(chain([\"\\n\"], table_header))\n", + " map_score = \"%.02f ±%.02f\" % (mean_map_score, std_map_score)\n", + " duration = \"%.02f ±%.02f\" % (mean_duration, std_duration) if mean_duration else \"\"\n", + " output.append(\"%s|%s|%s|%s\" % (dataset, technique, map_score, duration))\n", + "\n", + "display(Markdown('\\n'.join(output)))" ] }, { @@ -564,13 +571,9 @@ "source": [ "## References\n", "\n", - "1. Grigori Sidorov et al. [*Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model*][1], 2014.\n", - "* Delphine Charlet and Geraldine Damnati, [*SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering*][2], 2017.\n", - "* Thomas Mikolov et al. [*Efficient Estimation of Word Representations in Vector Space*][3], 2013.\n", - "\n", - " [1]: http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf (Soft Measure and Soft Cosine Measure: Measure of Features in Vector Space Model)\n", - " [2]: http://www.aclweb.org/anthology/S17-2051 (Simbow at semeval-2017 task 3: Soft-cosine semantic measure between questions for community question answering)\n", - " [3]: https://github.com/witiko-masters-thesis/thesis/blob/master/main.pdf (Vector Space Representations in Information Retrieval)" + "1. Grigori Sidorov et al. *Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model*, 2014. ([link to PDF](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf))\n", + "2. Delphine Charlet and Geraldine Damnati, SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering, 2017. ([link to PDF](http://www.aclweb.org/anthology/S17-2051))\n", + "3. Thomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space, 2013. ([link to PDF](https://arxiv.org/pdf/1301.3781.pdf))" ] } ],