diff --git a/docs/notebooks/doc2vec-wikipedia.ipynb b/docs/notebooks/doc2vec-wikipedia.ipynb new file mode 100644 index 0000000000..c5cd3e70f5 --- /dev/null +++ b/docs/notebooks/doc2vec-wikipedia.ipynb @@ -0,0 +1,472 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Doc2Vec to wikipedia articles" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We conduct the replication to **Document Embedding with Paragraph Vectors** (http://arxiv.org/abs/1507.07998).\n", + "In this paper, they showed only DBOW results to Wikipedia data. So we replicate this experiments using not only DBOW but also DM." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Basic Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's import Doc2Vec module." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from gensim.corpora.wikicorpus import WikiCorpus\n", + "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n", + "from pprint import pprint\n", + "import multiprocessing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing the corpus" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, download the dump of all Wikipedia articles from [here](http://download.wikimedia.org/enwiki/) (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps).\n", + "\n", + "Second, convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.\n", + "\n", + "For more details on WikiCorpus, you should access [Corpus from a Wikipedia dump](https://radimrehurek.com/gensim/corpora/wikicorpus.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "wiki = WikiCorpus(\"enwiki-latest-pages-articles.xml.bz2\")\n", + "#wiki = WikiCorpus(\"enwiki-YYYYMMDD-pages-articles.xml.bz2\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define **TaggedWikiDocument** class to convert WikiCorpus into suitable form for Doc2Vec." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "class TaggedWikiDocument(object):\n", + " def __init__(self, wiki):\n", + " self.wiki = wiki\n", + " self.wiki.metadata = True\n", + " def __iter__(self):\n", + " for content, (page_id, title) in self.wiki.get_texts():\n", + " yield TaggedDocument([c.decode(\"utf-8\") for c in content], [title])" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "documents = TaggedWikiDocument(wiki)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preprocessing\n", + "To set the same vocabulary size with original papar. We first calculate the optimal **min_count** parameter." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "pre = Doc2Vec(min_count=0)\n", + "pre.scan_vocab(documents)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "min_count: 0, size of vocab: 8545782.0\n", + "min_count: 1, size of vocab: 8545782.0\n", + "min_count: 2, size of vocab: 4227783.0\n", + "min_count: 3, size of vocab: 3008772.0\n", + "min_count: 4, size of vocab: 2439367.0\n", + "min_count: 5, size of vocab: 2090709.0\n", + "min_count: 6, size of vocab: 1856609.0\n", + "min_count: 7, size of vocab: 1681670.0\n", + "min_count: 8, size of vocab: 1546914.0\n", + "min_count: 9, size of vocab: 1437367.0\n", + "min_count: 10, size of vocab: 1346177.0\n", + "min_count: 11, size of vocab: 1267916.0\n", + "min_count: 12, size of vocab: 1201186.0\n", + "min_count: 13, size of vocab: 1142377.0\n", + "min_count: 14, size of vocab: 1090673.0\n", + "min_count: 15, size of vocab: 1043973.0\n", + "min_count: 16, size of vocab: 1002395.0\n", + "min_count: 17, size of vocab: 964684.0\n", + "min_count: 18, size of vocab: 930382.0\n", + "min_count: 19, size of vocab: 898725.0\n" + ] + } + ], + "source": [ + "for num in range(0, 20):\n", + " print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the original paper, they set the vocabulary size 915,715. It seems similar size of vocabulary if we set min_count = 19. (size of vocab = 898,725)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training the Doc2Vec Model\n", + "To train Doc2Vec model by several method, DBOW and DM, we define the list of models." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false, + "scrolled": false + }, + "outputs": [], + "source": [ + "cores = multiprocessing.cpu_count()\n", + "\n", + "models = [\n", + " # PV-DBOW \n", + " Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=19, iter=10, workers=cores),\n", + " # PV-DM w/average\n", + " Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, workers=cores),\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)\n", + "Doc2Vec(dm/m,d200,hs,w8,mc19,t8)\n" + ] + } + ], + "source": [ + "models[0].build_vocab(documents)\n", + "print(str(models[0]))\n", + "models[1].reset_from(models[0])\n", + "print(str(models[1]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we’re ready to train Doc2Vec of the English Wikipedia. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": false, + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 5d 18h 24min 30s, sys: 26min 6s, total: 5d 18h 50min 36s\n", + "Wall time: 1d 2h 58min 58s\n", + "CPU times: user 1d 1h 28min 2s, sys: 33min 15s, total: 1d 2h 1min 18s\n", + "Wall time: 15h 27min 18s\n" + ] + } + ], + "source": [ + "for model in models:\n", + " %%time model.train(documents)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Similarity interface" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After that, let's test both models! DBOW model show the simillar results with the original paper. First, calculating cosine simillarity of \"Machine learning\" using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)\n", + "[('Theoretical computer science', 0.7256590127944946),\n", + " ('Artificial neural network', 0.7162272930145264),\n", + " ('Pattern recognition', 0.6948175430297852),\n", + " ('Data mining', 0.6938608884811401),\n", + " ('Bayesian network', 0.6938260197639465),\n", + " ('Support vector machine', 0.6706081628799438),\n", + " ('Glossary of artificial intelligence', 0.670173704624176),\n", + " ('Computational learning theory', 0.6648679971694946),\n", + " ('Outline of computer science', 0.6638073921203613),\n", + " ('List of important publications in computer science', 0.663051187992096),\n", + " ('Mathematical optimization', 0.655048131942749),\n", + " ('Theory of computation', 0.6508707404136658),\n", + " ('Word-sense disambiguation', 0.6505812406539917),\n", + " ('Reinforcement learning', 0.6480429172515869),\n", + " (\"Solomonoff's theory of inductive inference\", 0.6459559202194214),\n", + " ('Computational intelligence', 0.6458009481430054),\n", + " ('Information visualization', 0.6437181234359741),\n", + " ('Algorithmic composition', 0.643247127532959),\n", + " ('Ray Solomonoff', 0.6425477862358093),\n", + " ('Kriging', 0.6425424814224243)]\n", + "Doc2Vec(dm/m,d200,hs,w8,mc19,t8)\n", + "[('Artificial neural network', 0.640324592590332),\n", + " ('Pattern recognition', 0.6244156360626221),\n", + " ('Data stream mining', 0.6140210032463074),\n", + " ('Theoretical computer science', 0.5964258909225464),\n", + " ('Outline of computer science', 0.5862746834754944),\n", + " ('Supervised learning', 0.5847170352935791),\n", + " ('Data mining', 0.5817658305168152),\n", + " ('Decision tree learning', 0.5785809755325317),\n", + " ('Bayesian network', 0.5768401622772217),\n", + " ('Computational intelligence', 0.5717238187789917),\n", + " ('Theory of computation', 0.5703311562538147),\n", + " ('Bayesian programming', 0.5693561434745789),\n", + " ('Reinforcement learning', 0.564978837966919),\n", + " ('Helmholtz machine', 0.564972460269928),\n", + " ('Inductive logic programming', 0.5631471276283264),\n", + " ('Algorithmic learning theory', 0.563083291053772),\n", + " ('Semi-supervised learning', 0.5628935694694519),\n", + " ('Early stopping', 0.5597405433654785),\n", + " ('Decision tree', 0.5596889853477478),\n", + " ('Artificial intelligence', 0.5569720268249512)]\n" + ] + } + ], + "source": [ + "for model in models:\n", + " print(str(model))\n", + " pprint(model.docvecs.most_similar(positive=[\"Machine learning\"], topn=20))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "DBOW model interpret the word 'Machine Learning' as a part of Computer Science field, and DM model as Data Science related field.\n", + "\n", + "Second, calculating cosine simillarity of \"Lady Gaga\" using Paragraph Vector." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false, + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)\n", + "[('Katy Perry', 0.7374469637870789),\n", + " ('Adam Lambert', 0.6972734928131104),\n", + " ('Miley Cyrus', 0.6212848424911499),\n", + " ('List of awards and nominations received by Lady Gaga', 0.6138384938240051),\n", + " ('Nicole Scherzinger', 0.6092700958251953),\n", + " ('Christina Aguilera', 0.6062655448913574),\n", + " ('Nicki Minaj', 0.6019431948661804),\n", + " ('Taylor Swift', 0.5973174571990967),\n", + " ('The Pussycat Dolls', 0.5888757705688477),\n", + " ('Beyoncé', 0.5844652652740479)]\n", + "Doc2Vec(dm/m,d200,hs,w8,mc19,t8)\n", + "[('ArtRave: The Artpop Ball', 0.5719832181930542),\n", + " ('Artpop', 0.5651129484176636),\n", + " ('Katy Perry', 0.5571318864822388),\n", + " ('The Fame', 0.5388195514678955),\n", + " ('The Fame Monster', 0.5380634069442749),\n", + " ('G.U.Y.', 0.5365751385688782),\n", + " ('Beautiful, Dirty, Rich', 0.5329179763793945),\n", + " ('Applause (Lady Gaga song)', 0.5328119993209839),\n", + " ('The Monster Ball Tour', 0.5299569368362427),\n", + " ('Lindsey Stirling', 0.5281971096992493)]\n" + ] + } + ], + "source": [ + "for model in models:\n", + " print(str(model))\n", + " pprint(model.docvecs.most_similar(positive=[\"Lady Gaga\"], topn=10))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "DBOW model reveal the similar singer in the U.S., and DM model understand that many of Lady Gaga's songs are similar with the word \"Lady Gaga\".\n", + "\n", + "Third, calculating cosine simillarity of \"Lady Gaga\" - \"American\" + \"Japanese\" using Document vector and Word Vectors. \"American\" and \"Japanese\" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false, + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)\n", + "[('Game (Perfume album)', 0.5571034550666809),\n", + " ('Katy Perry', 0.5537782311439514),\n", + " ('Taboo (Kumi Koda song)', 0.5304880142211914),\n", + " ('Kylie Minogue', 0.5234110355377197),\n", + " ('Ayumi Hamasaki', 0.5110630989074707),\n", + " (\"Girls' Generation\", 0.4996713399887085),\n", + " ('Britney Spears', 0.49094343185424805),\n", + " ('Koda Kumi', 0.48719698190689087),\n", + " ('Perfume (Japanese band)', 0.48536181449890137),\n", + " ('Kara (South Korean band)', 0.48507893085479736)]\n", + "Doc2Vec(dm/m,d200,hs,w8,mc19,t8)\n", + "[('Artpop', 0.47699037194252014),\n", + " ('Jessie J', 0.4439432621002197),\n", + " ('Haus of Gaga', 0.43463900685310364),\n", + " ('The Fame', 0.4278091788291931),\n", + " ('List of awards and nominations received by Lady Gaga', 0.4268512427806854),\n", + " ('Applause (Lady Gaga song)', 0.41563737392425537),\n", + " ('New Cutie Honey', 0.4152414798736572),\n", + " ('M.I.A. (rapper)', 0.4091864228248596),\n", + " ('Mama Do (Uh Oh, Uh Oh)', 0.4044945538043976),\n", + " ('The Fame Monster', 0.40421998500823975)]\n" + ] + } + ], + "source": [ + "for model in models:\n", + " print(str(model))\n", + " vec = [model.docvecs[\"Lady Gaga\"] - model[\"american\"] + model[\"japanese\"]]\n", + " pprint([m for m in model.docvecs.most_similar(vec, topn=11) if m[0] != \"Lady Gaga\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As a result, DBOW model demonstrate the similar artists with Lady Gaga in Japan such as 'Perfume', which is the Most famous Idol in Japan. On the other hand, DM model results don't include the Japanese aritsts in top 10 simillar documents. It's almost same with no vector calculated results.\n", + "\n", + "This results demonstrate that DBOW employed in the original paper is outstanding for calculating the similarity between Document Vector and Word Vector." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}