Add model_to_dict one-liner to word2vec notebook. Fix #1269 (#1776)

* Add model to dict method * add documentation and oneliner code * Add benchmark
piskvorky · Dec 12, 2017 · bf1b865 · bf1b865
1 parent 6248d33
commit bf1b865
Showing 1 changed file with 102 additions and 1 deletion.
diff --git a/docs/notebooks/word2vec.ipynb b/docs/notebooks/word2vec.ipynb
@@ -1295,6 +1295,107 @@
     "print(train_times_table)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Adding Word2Vec \"model to dict\" method to production pipeline\n",
+    "Suppose, we still want more performance improvement in production. \n",
+    "One good way is to cache all the similar words in a dictionary.\n",
+    "So that next time when we get the similar query word, we'll search it first in the dict.\n",
+    "And if it's a hit then we will show the result directly from the dictionary.\n",
+    "otherwise we will query the word and then cache it so that it doesn't miss next time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}\n",
+    "print(most_similars_precalc)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Comparison with and without caching"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "for time being lets take 4 words randomly"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "words = ['voted','few','their','around']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Without caching"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "start = time.time()\n",
+    "for word in words:\n",
+    "    result = model.wv.most_similar(word)\n",
+    "    print(result)\n",
+    "end = time.time()\n",
+    "print(end-start)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now with caching"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "start = time.time()\n",
+    "for word in words:\n",
+    "    if 'voted' in most_similars_precalc:\n",
+    "        result = most_similars_precalc[word]\n",
+    "        print(result)\n",
+    "    else:\n",
+    "        result = model.wv.most_similar(word)\n",
+    "        most_similars_precalc[word] = result\n",
+    "        print(result)\n",
+    "    \n",
+    "end = time.time()\n",
+    "print(end-start)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Clearly you can see the improvement but this difference will be even larger when we take more words in the consideration."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1336,7 +1437,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython2",
-   "version": "2.7.13"
+   "version": "2.7.10"
   }
  },
  "nbformat": 4,