diff --git a/docs/notebooks/doc2vec-IMDB.ipynb b/docs/notebooks/doc2vec-IMDB.ipynb index 9beb99935f..f4a2ec2ae3 100644 --- a/docs/notebooks/doc2vec-IMDB.ipynb +++ b/docs/notebooks/doc2vec-IMDB.ipynb @@ -4,22 +4,51 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# gensim doc2vec & IMDB sentiment dataset" + "# Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "TODO: section on introduction & motivation\n", + "## Introduction\n", "\n", - "TODO: prerequisites + dependencies (statsmodels, patsy, ?)\n", + "In this tutorial, we will learn how to apply Doc2vec using gensim by recreating the results of Le and Mikolov 2014. \n", + "\n", + "### Bag-of-words Model\n", + "Previous state-of-the-art document representations were based on the bag-of-words model, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents \n", + "(1) `John likes to watch movies. Mary likes movies too.` \n", + "(2) `John also likes to watch football games.` \n", + "are used to construct a length 10 list of words \n", + "`[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\"]` \n", + "so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list \n", + "(1) `[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]` \n", + "(2) `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]` \n", + "Bag-of-words models are surprisingly effective but still lose information about word order. Bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.\n", + "\n", + "### Word2vec Model\n", + "Word2vec is a more recent model that embeds words in a high-dimensional vector space using a shallow neural network. The result is a set of word vectors where vectors close together in vector space have similar meanings based on context, and word vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far. There are two versions of this model based on skip-grams and continuous bag of words.\n", + "\n", + "\n", + "#### Word2vec - Skip-gram Model\n", + "The skip-gram word2vec model, for example, takes in pairs (word1, word2) generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the fake task of given an input word, giving us a predicted probability distribution of nearby words to the input. The hidden-to-output weights in the neural network give us the word embeddings. So if the hidden layer has 300 neurons, this network will give us 300-dimensional word embeddings. We use one-hot encoding for the words.\n", + "\n", + "#### Word2vec - Continuous-bag-of-words Model\n", + "Continuous-bag-of-words Word2vec is very similar to the skip-gram model. It is also a 1-hidden-layer neural network. The fake task is based on the input context words in a window around a center word, predict the center word. Again, the hidden-to-output weights give us the word embeddings and we use one-hot encoding.\n", + "\n", + "### Paragraph Vector\n", + "Le and Mikolov 2014 introduces the Paragraph Vector, which outperforms more naïve representations of documents such as averaging the Word2vec word vectors of a document. The idea is straightforward: we act as if a paragraph (or document) is just another vector like a word vector, but we will call it a paragraph vector. We determine the embedding of the paragraph in vector space in the same way as words. Our paragraph vector model considers local word order like bag of n-grams, but gives us a denser representation in vector space compared to a sparse, high-dimensional representation.\n", + "\n", + "#### Paragraph Vector - Distributed Memory (PV-DM)\n", + "This is the Paragraph Vector model analogous to Continuous-bag-of-words Word2vec. The paragraph vectors are obtained by training a neural network on the fake task of inferring a center word based on context words and a context paragraph. A paragraph is a context for all words in the paragraph, and a word in a paragraph can have that paragraph as a context. \n", + "\n", + "#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)\n", + "This is the Paragraph Vector model analogous to Skip-gram Word2vec. The paragraph vectors are obtained by training a neural network on the fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.\n", "\n", "### Requirements\n", - "Following are the dependencies for this tutorial:\n", - " - testfixtures\n", - " - statsmodels\n", - " " + "The following python modules are dependencies for this tutorial:\n", + "* testfixtures ( `pip install testfixtures` )\n", + "* statsmodels ( `pip install statsmodels` )" ] }, { @@ -33,19 +62,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Fetch and prep exactly as in Mikolov's go.sh shell script. (Note this cell tests for existence of required files, so steps won't repeat once the final summary file (`aclImdb/alldata-id.txt`) is available alongside this notebook.)" + "Let's download the IMDB archive if it is not already downloaded (84 MB). This will be our text data for this tutorial. \n", + "The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "total running time: 41.018378\n" + "Total running time: 0.00035199999999990794\n" ] } ], @@ -71,14 +101,11 @@ "# Convert text to lower-case and strip punctuation/symbols from words\n", "def normalize_text(text):\n", " norm_text = text.lower()\n", - "\n", " # Replace breaks with spaces\n", " norm_text = norm_text.replace('
', ' ')\n", - "\n", " # Pad punctuation with spaces on both sides\n", " for char in ['.', '\"', ',', '(', ')', '!', '?', ';', ':']:\n", " norm_text = norm_text.replace(char, ' ' + char + ' ')\n", - "\n", " return norm_text\n", "\n", "import time\n", @@ -88,42 +115,34 @@ " if not os.path.isdir(dirname):\n", " if not os.path.isfile(filename):\n", " # Download IMDB archive\n", + " print(\"Downloading IMDB archive...\")\n", " url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename\n", " r = requests.get(url)\n", " with open(filename, 'wb') as f:\n", " f.write(r.content)\n", - "\n", " tar = tarfile.open(filename, mode='r')\n", " tar.extractall()\n", " tar.close()\n", "\n", - " # Concat and normalize test/train data\n", + " # Concatenate and normalize test/train data\n", + " print(\"Cleaning up dataset...\")\n", " folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']\n", " alldata = u''\n", - "\n", " for fol in folders:\n", " temp = u''\n", " output = fol.replace('/', '-') + '.txt'\n", - "\n", " # Is there a better pattern to use?\n", " txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))\n", - "\n", " for txt in txt_files:\n", " with smart_open.smart_open(txt, \"rb\") as t:\n", " t_clean = t.read().decode(\"utf-8\")\n", - "\n", " for c in control_chars:\n", " t_clean = t_clean.replace(c, ' ')\n", - "\n", " temp += t_clean\n", - "\n", " temp += \"\\n\"\n", - "\n", " temp_norm = normalize_text(temp)\n", - "\n", " with smart_open.smart_open(os.path.join(dirname, output), \"wb\") as n:\n", " n.write(temp_norm.encode(\"utf-8\"))\n", - "\n", " alldata += temp_norm\n", "\n", " with smart_open.smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:\n", @@ -132,7 +151,7 @@ " f.write(num_line.encode(\"utf-8\"))\n", "\n", "end = time.clock()\n", - "print (\"total running time: \", end-start)" + "print (\"Total running time: \", end-start)" ] }, { @@ -151,7 +170,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The data is small enough to be read into memory. " + "The text data is small enough to be read into memory. " ] }, { @@ -174,19 +193,19 @@ "\n", "SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')\n", "\n", - "alldocs = [] # will hold all docs in original order\n", + "alldocs = [] # Will hold all docs in original order\n", "with open('aclImdb/alldata-id.txt', encoding='utf-8') as alldata:\n", " for line_no, line in enumerate(alldata):\n", " tokens = gensim.utils.to_unicode(line).split()\n", " words = tokens[1:]\n", - " tags = [line_no] # `tags = [tokens[0]]` would also work at extra memory cost\n", - " split = ['train','test','extra','extra'][line_no//25000] # 25k train, 25k test, 25k extra\n", + " tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost\n", + " split = ['train', 'test', 'extra', 'extra'][line_no//25000] # 25k train, 25k test, 25k extra\n", " sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown\n", " alldocs.append(SentimentDocument(words, tags, split, sentiment))\n", "\n", "train_docs = [doc for doc in alldocs if doc.split == 'train']\n", "test_docs = [doc for doc in alldocs if doc.split == 'test']\n", - "doc_list = alldocs[:] # for reshuffling per pass\n", + "doc_list = alldocs[:] # For reshuffling per pass\n", "\n", "print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))" ] @@ -202,17 +221,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Approximating experiment of Le & Mikolov [\"Distributed Representations of Sentences and Documents\"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf), also with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):\n", + "We approximate the experiment of Le & Mikolov [\"Distributed Representations of Sentences and Documents\"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf) with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):\n", "\n", "`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`\n", "\n", - "Parameter choices below vary:\n", - "\n", - "* 100-dimensional vectors, as the 400d vectors of the paper don't seem to offer much benefit on this task\n", - "* similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out\n", + "We vary the following parameter choices:\n", + "* 100-dimensional vectors, as the 400-d vectors of the paper don't seem to offer much benefit on this task\n", + "* Similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out\n", "* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`\n", - "* added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)\n", - "* a `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)" + "* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)\n", + "* A `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)" ] }, { @@ -237,19 +255,19 @@ "import multiprocessing\n", "\n", "cores = multiprocessing.cpu_count()\n", - "assert gensim.models.doc2vec.FAST_VERSION > -1, \"this will be painfully slow otherwise\"\n", + "assert gensim.models.doc2vec.FAST_VERSION > -1, \"This will be painfully slow otherwise\"\n", "\n", "simple_models = [\n", - " # PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size\n", + " # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size\n", " Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),\n", " # PV-DBOW \n", " Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),\n", - " # PV-DM w/average\n", + " # PV-DM w/ average\n", " Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),\n", "]\n", "\n", - "# speed setup by sharing results of 1st model's vocabulary scan\n", - "simple_models[0].build_vocab(alldocs) # PV-DM/concat requires one special NULL word so it serves as template\n", + "# Speed up setup by sharing results of the 1st model's vocabulary scan\n", + "simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template\n", "print(simple_models[0])\n", "for model in simple_models[1:]:\n", " model.reset_from(simple_models[0])\n", @@ -262,7 +280,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Following the paper, we also evaluate models in pairs. These wrappers return the concatenation of the vectors from each model. (Only the singular models are trained.)" + "Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model." ] }, { @@ -289,7 +307,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Helper methods for evaluating error rate." + "Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors. We will classify document sentiments using a logistic regression model based on our paragraph embeddings. We will compare the error rates based on word embeddings from our various Doc2vec models." ] }, { @@ -301,8 +319,8 @@ "name": "stderr", "output_type": "stream", "text": [ - "/usr/lib/python3.4/importlib/_bootstrap.py:321: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", - " return f(*args, **kwds)\n" + "/Users/daniel/miniconda3/envs/gensim/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.\n", + " from pandas.core import datetools\n" ] } ], @@ -311,7 +329,7 @@ "import statsmodels.api as sm\n", "from random import sample\n", "\n", - "# for timing\n", + "# For timing\n", "from contextlib import contextmanager\n", "from timeit import default_timer\n", "import time \n", @@ -327,7 +345,7 @@ "def logistic_predictor_from_data(train_targets, train_regressors):\n", " logit = sm.Logit(train_targets, train_regressors)\n", " predictor = logit.fit(disp=0)\n", - " #print(predictor.summary())\n", + " # print(predictor.summary())\n", " return predictor\n", "\n", "def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):\n", @@ -346,7 +364,7 @@ " test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]\n", " test_regressors = sm.add_constant(test_regressors)\n", " \n", - " # predict & evaluate\n", + " # Predict & evaluate\n", " test_predictions = predictor.predict(test_regressors)\n", " corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])\n", " errors = len(test_predictions) - corrects\n", @@ -365,11 +383,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Using explicit multiple-pass, alpha-reduction approach as sketched in [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) – with added shuffling of corpus on each pass.\n", + "We use an explicit multiple-pass, alpha-reduction approach as sketched in this [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) with added shuffling of corpus on each pass.\n", "\n", "Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.\n", "\n", - "Evaluation of each model's sentiment-predictive power is repeated after each pass, as an error rate (lower is better), to see the rates-of-relative-improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. \n", + "We evaluate each model's sentiment predictive power based on error rate, and the evaluation is repeated after each pass so we can see the rates of relative improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. \n", "\n", "(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)" ] @@ -383,7 +401,7 @@ "outputs": [], "source": [ "from collections import defaultdict\n", - "best_error = defaultdict(lambda :1.0) # to selectively-print only best errors achieved" + "best_error = defaultdict(lambda: 1.0) # To selectively print only best errors achieved" ] }, { @@ -395,159 +413,159 @@ "name": "stdout", "output_type": "stream", "text": [ - "START 2017-06-06 15:19:50.208091\n", - "*0.408320 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 131.9s 33.6s\n", - "*0.341600 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 131.9s 48.3s\n", - "*0.239960 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 35.3s 45.9s\n", - "*0.193200 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 35.3s 48.3s\n", - "*0.268640 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 48.6s 48.5s\n", - "*0.208000 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 48.6s 47.4s\n", - "*0.216160 : 1 passes : dbow+dmm 0.0s 168.9s\n", - "*0.176000 : 1 passes : dbow+dmm_inferred 0.0s 176.4s\n", - "*0.237280 : 1 passes : dbow+dmc 0.0s 169.3s\n", - "*0.194400 : 1 passes : dbow+dmc_inferred 0.0s 183.9s\n", - "completed pass 1 at alpha 0.025000\n", - "*0.346760 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 133.4s 42.2s\n", - "*0.145280 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 29.0s 42.8s\n", - "*0.210920 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 38.8s 42.2s\n", - "*0.139120 : 2 passes : dbow+dmm 0.0s 173.2s\n", - "*0.147120 : 2 passes : dbow+dmc 0.0s 191.8s\n", - "completed pass 2 at alpha 0.023800\n", - "*0.314920 : 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 112.3s 37.6s\n", - "*0.126720 : 3 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 28.4s 42.6s\n", - "*0.191920 : 3 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 37.9s 42.2s\n", - "*0.121640 : 3 passes : dbow+dmm 0.0s 190.8s\n", - "*0.127040 : 3 passes : dbow+dmc 0.0s 188.1s\n", - "completed pass 3 at alpha 0.022600\n", - "*0.282080 : 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 104.9s 36.3s\n", - "*0.115520 : 4 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 27.6s 49.9s\n", - "*0.181280 : 4 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 40.7s 42.2s\n", - "*0.114760 : 4 passes : dbow+dmm 0.0s 188.6s\n", - "*0.116040 : 4 passes : dbow+dmc 0.0s 192.5s\n", - "completed pass 4 at alpha 0.021400\n", - "*0.257560 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 102.5s 35.8s\n", - "*0.265200 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 102.5s 48.6s\n", - "*0.110880 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 27.0s 46.5s\n", - "*0.117600 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 27.0s 50.5s\n", - "*0.171240 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 39.1s 43.7s\n", - "*0.207200 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 39.1s 47.5s\n", - "*0.108920 : 5 passes : dbow+dmm 0.0s 203.4s\n", - "*0.114800 : 5 passes : dbow+dmm_inferred 0.0s 213.4s\n", - "*0.111520 : 5 passes : dbow+dmc 0.0s 189.5s\n", - "*0.132000 : 5 passes : dbow+dmc_inferred 0.0s 202.6s\n", - "completed pass 5 at alpha 0.020200\n", - "*0.240440 : 6 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 117.6s 39.2s\n", - "*0.107600 : 6 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 32.3s 52.1s\n", - "*0.166800 : 6 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 46.4s 40.8s\n", - "*0.108160 : 6 passes : dbow+dmm 0.0s 197.8s\n", - "*0.109920 : 6 passes : dbow+dmc 0.0s 189.4s\n", - "completed pass 6 at alpha 0.019000\n", - "*0.225280 : 7 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 102.8s 36.0s\n", - "*0.105560 : 7 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.0s 47.0s\n", - "*0.164320 : 7 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 38.6s 43.7s\n", - "*0.104760 : 7 passes : dbow+dmm 0.0s 187.1s\n", - "*0.107600 : 7 passes : dbow+dmc 0.0s 182.9s\n", - "completed pass 7 at alpha 0.017800\n", - "*0.214280 : 8 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 99.2s 41.1s\n", - "*0.102400 : 8 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 28.6s 47.3s\n", - "*0.161000 : 8 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 36.4s 40.9s\n", - "*0.102720 : 8 passes : dbow+dmm 0.0s 188.2s\n", - "*0.104280 : 8 passes : dbow+dmc 0.0s 187.3s\n", - "completed pass 8 at alpha 0.016600\n", - "*0.206840 : 9 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 96.9s 41.4s\n", - " 0.102920 : 9 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 27.1s 46.4s\n", - "*0.158600 : 9 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 40.3s 40.7s\n", - "*0.101880 : 9 passes : dbow+dmm 0.0s 188.1s\n", - "*0.103960 : 9 passes : dbow+dmc 0.0s 192.2s\n", - "completed pass 9 at alpha 0.015400\n", - "*0.198960 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 116.0s 43.0s\n", - "*0.194000 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 116.0s 54.2s\n", - "*0.102120 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 27.8s 47.1s\n", - "*0.100000 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 27.8s 50.4s\n", - "*0.156640 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 38.3s 41.9s\n", - "*0.178400 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 38.3s 46.8s\n", - " 0.102520 : 10 passes : dbow+dmm 0.0s 192.5s\n", - "*0.104000 : 10 passes : dbow+dmm_inferred 0.0s 207.3s\n", - "*0.103560 : 10 passes : dbow+dmc 0.0s 191.0s\n", - "*0.115200 : 10 passes : dbow+dmc_inferred 0.0s 203.5s\n", - "completed pass 10 at alpha 0.014200\n", - "*0.192000 : 11 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 97.3s 42.7s\n", - " 0.102840 : 11 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.8s 45.1s\n", - " 0.156680 : 11 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 36.9s 41.1s\n", - "*0.101600 : 11 passes : dbow+dmm 0.0s 187.8s\n", - " 0.103880 : 11 passes : dbow+dmc 0.0s 187.9s\n", - "completed pass 11 at alpha 0.013000\n", - "*0.190440 : 12 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 99.1s 44.5s\n", - " 0.103640 : 12 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 34.7s 45.9s\n", - "*0.154640 : 12 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 37.3s 41.8s\n", - " 0.103400 : 12 passes : dbow+dmm 0.0s 190.1s\n", - " 0.103640 : 12 passes : dbow+dmc 0.0s 190.6s\n", - "completed pass 12 at alpha 0.011800\n", - "*0.186840 : 13 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 99.1s 41.0s\n", - " 0.102560 : 13 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.7s 44.5s\n", - "*0.153880 : 13 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 35.9s 40.0s\n", - " 0.103760 : 13 passes : dbow+dmm 0.0s 182.8s\n", - " 0.103680 : 13 passes : dbow+dmc 0.0s 174.8s\n", - "completed pass 13 at alpha 0.010600\n", - "*0.184600 : 14 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 92.0s 38.6s\n", - " 0.103080 : 14 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.7s 44.5s\n", - "*0.153760 : 14 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 35.8s 39.0s\n", - " 0.103120 : 14 passes : dbow+dmm 0.0s 177.6s\n", - " 0.103960 : 14 passes : dbow+dmc 0.0s 176.0s\n", - "completed pass 14 at alpha 0.009400\n", - "*0.182720 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 91.7s 38.7s\n", - "*0.179600 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 91.7s 50.8s\n", - " 0.103280 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.7s 43.5s\n", - " 0.104400 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 26.7s 47.8s\n", - "*0.153720 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 36.0s 39.0s\n", - " 0.187200 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 36.0s 43.7s\n", - " 0.103520 : 15 passes : dbow+dmm 0.0s 174.9s\n", - " 0.105600 : 15 passes : dbow+dmm_inferred 0.0s 183.2s\n", - " 0.103680 : 15 passes : dbow+dmc 0.0s 175.9s\n", - "*0.106000 : 15 passes : dbow+dmc_inferred 0.0s 189.9s\n", - "completed pass 15 at alpha 0.008200\n", - "*0.181040 : 16 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 91.6s 41.2s\n", - " 0.103240 : 16 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.7s 45.3s\n", - "*0.153600 : 16 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 36.1s 40.6s\n", - " 0.103960 : 16 passes : dbow+dmm 0.0s 175.9s\n", - "*0.103400 : 16 passes : dbow+dmc 0.0s 175.9s\n", - "completed pass 16 at alpha 0.007000\n", - "*0.180080 : 17 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 92.1s 40.3s\n", - " 0.102760 : 17 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.3s 44.9s\n", - "*0.152880 : 17 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 35.4s 39.0s\n", - " 0.103200 : 17 passes : dbow+dmm 0.0s 182.5s\n", - "*0.103280 : 17 passes : dbow+dmc 0.0s 178.0s\n", - "completed pass 17 at alpha 0.005800\n", - "*0.178720 : 18 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 91.1s 39.0s\n", - "*0.101640 : 18 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.4s 44.3s\n", - "*0.152280 : 18 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 35.6s 39.5s\n", - " 0.102360 : 18 passes : dbow+dmm 0.0s 183.8s\n", - " 0.103320 : 18 passes : dbow+dmc 0.0s 179.0s\n", - "completed pass 18 at alpha 0.004600\n", - "*0.178600 : 19 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 91.1s 38.9s\n", - " 0.102320 : 19 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.3s 45.7s\n", - "*0.151920 : 19 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 35.5s 40.7s\n", - " 0.102240 : 19 passes : dbow+dmm 0.0s 181.7s\n", - "*0.103000 : 19 passes : dbow+dmc 0.0s 181.7s\n", - "completed pass 19 at alpha 0.003400\n", - "*0.177360 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 90.9s 40.0s\n", - " 0.190800 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 90.9s 52.1s\n" + "START 2017-07-08 17:48:01.470463\n", + "*0.404640 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 80.4s 2.3s\n", + "*0.361200 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 80.4s 10.9s\n", + "*0.247520 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.0s 1.1s\n", + "*0.201200 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 31.0s 3.5s\n", + "*0.264120 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 38.5s 0.7s\n", + "*0.203600 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 38.5s 4.7s\n", + "*0.216600 : 1 passes : dbow+dmm 0.0s 1.7s\n", + "*0.199600 : 1 passes : dbow+dmm_inferred 0.0s 10.6s\n", + "*0.244800 : 1 passes : dbow+dmc 0.0s 2.0s\n", + "*0.219600 : 1 passes : dbow+dmc_inferred 0.0s 15.0s\n", + "Completed pass 1 at alpha 0.025000\n", + "*0.349560 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 52.7s 0.6s\n", + "*0.147400 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 20.3s 0.5s\n", + "*0.209200 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 28.3s 0.5s\n", + "*0.140280 : 2 passes : dbow+dmm 0.0s 1.4s\n", + "*0.149360 : 2 passes : dbow+dmc 0.0s 2.2s\n", + "Completed pass 2 at alpha 0.023800\n", + "*0.308760 : 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 50.4s 0.6s\n", + "*0.126880 : 3 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 19.5s 0.5s\n", + "*0.192560 : 3 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 37.8s 0.7s\n", + "*0.124440 : 3 passes : dbow+dmm 0.0s 1.8s\n", + "*0.126280 : 3 passes : dbow+dmc 0.0s 1.7s\n", + "Completed pass 3 at alpha 0.022600\n", + "*0.277160 : 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 75.2s 0.7s\n", + "*0.119120 : 4 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.0s 2.6s\n", + "*0.177960 : 4 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 48.3s 0.8s\n", + "*0.118000 : 4 passes : dbow+dmm 0.0s 2.2s\n", + "*0.119400 : 4 passes : dbow+dmc 0.0s 2.0s\n", + "Completed pass 4 at alpha 0.021400\n", + "*0.256040 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 75.2s 0.8s\n", + "*0.256800 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 75.2s 9.0s\n", + "*0.115120 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 34.0s 1.6s\n", + "*0.115200 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 34.0s 3.5s\n", + "*0.171840 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 42.5s 0.9s\n", + "*0.202400 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 42.5s 6.2s\n", + "*0.111920 : 5 passes : dbow+dmm 0.0s 2.0s\n", + "*0.118000 : 5 passes : dbow+dmm_inferred 0.0s 11.6s\n", + "*0.113040 : 5 passes : dbow+dmc 0.0s 2.2s\n", + "*0.115600 : 5 passes : dbow+dmc_inferred 0.0s 17.3s\n", + "Completed pass 5 at alpha 0.020200\n", + "*0.236880 : 6 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 70.1s 2.0s\n", + "*0.109720 : 6 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 32.2s 0.9s\n", + "*0.166320 : 6 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 44.8s 0.9s\n", + "*0.108720 : 6 passes : dbow+dmm 0.0s 2.1s\n", + "*0.108480 : 6 passes : dbow+dmc 0.0s 2.0s\n", + "Completed pass 6 at alpha 0.019000\n", + "*0.221640 : 7 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 84.7s 0.9s\n", + "*0.107120 : 7 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.3s 1.9s\n", + "*0.164000 : 7 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.0s 0.9s\n", + "*0.106160 : 7 passes : dbow+dmm 0.0s 2.0s\n", + "*0.106680 : 7 passes : dbow+dmc 0.0s 2.0s\n", + "Completed pass 7 at alpha 0.017800\n", + "*0.209360 : 8 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 64.0s 0.8s\n", + "*0.106200 : 8 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.2s 0.8s\n", + "*0.161360 : 8 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.0s 0.9s\n", + "*0.104480 : 8 passes : dbow+dmm 0.0s 3.0s\n", + "*0.105640 : 8 passes : dbow+dmc 0.0s 2.0s\n", + "Completed pass 8 at alpha 0.016600\n", + "*0.203520 : 9 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 66.6s 1.0s\n", + "*0.105120 : 9 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 39.1s 1.1s\n", + "*0.160960 : 9 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.7s 0.7s\n", + " 0.104840 : 9 passes : dbow+dmm 0.0s 2.0s\n", + "*0.104240 : 9 passes : dbow+dmc 0.0s 2.0s\n", + "Completed pass 9 at alpha 0.015400\n", + "*0.195840 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 66.5s 1.7s\n", + "*0.197600 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 66.5s 10.1s\n", + "*0.104280 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.3s 0.8s\n", + " 0.115200 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 31.3s 4.7s\n", + "*0.158800 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 44.5s 0.9s\n", + "*0.182800 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 44.5s 6.3s\n", + "*0.102760 : 10 passes : dbow+dmm 0.0s 3.1s\n", + "*0.110000 : 10 passes : dbow+dmm_inferred 0.0s 11.3s\n", + "*0.103920 : 10 passes : dbow+dmc 0.0s 2.2s\n", + "*0.109200 : 10 passes : dbow+dmc_inferred 0.0s 16.4s\n", + "Completed pass 10 at alpha 0.014200\n", + "*0.190800 : 11 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 71.3s 1.0s\n", + "*0.103840 : 11 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 33.8s 0.8s\n", + "*0.157440 : 11 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 44.5s 0.9s\n", + " 0.103240 : 11 passes : dbow+dmm 0.0s 3.0s\n", + " 0.104360 : 11 passes : dbow+dmc 0.0s 2.1s\n", + "Completed pass 11 at alpha 0.013000\n", + "*0.188520 : 12 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 65.4s 0.8s\n", + " 0.104600 : 12 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 33.3s 1.0s\n", + "*0.157240 : 12 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 53.5s 1.7s\n", + " 0.103880 : 12 passes : dbow+dmm 0.0s 2.8s\n", + " 0.104640 : 12 passes : dbow+dmc 0.0s 2.6s\n", + "Completed pass 12 at alpha 0.011800\n", + "*0.185760 : 13 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 71.8s 1.7s\n", + " 0.104040 : 13 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.9s 1.0s\n", + "*0.155960 : 13 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 45.7s 0.8s\n", + "*0.102720 : 13 passes : dbow+dmm 0.0s 2.0s\n", + " 0.104120 : 13 passes : dbow+dmc 0.0s 1.9s\n", + "Completed pass 13 at alpha 0.010600\n", + "*0.181960 : 14 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 80.3s 0.8s\n", + "*0.103680 : 14 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 23.1s 0.7s\n", + "*0.155040 : 14 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 31.4s 1.5s\n", + "*0.102440 : 14 passes : dbow+dmm 0.0s 1.6s\n", + "*0.103680 : 14 passes : dbow+dmc 0.0s 1.7s\n", + "Completed pass 14 at alpha 0.009400\n", + "*0.180680 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 48.5s 0.7s\n", + "*0.186000 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 48.5s 12.0s\n", + " 0.104840 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 23.4s 0.7s\n", + "*0.101600 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 23.4s 4.3s\n", + "*0.154000 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 53.2s 2.0s\n", + " 0.191600 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 53.2s 4.8s\n", + " 0.102960 : 15 passes : dbow+dmm 0.0s 3.1s\n", + "*0.108400 : 15 passes : dbow+dmm_inferred 0.0s 11.4s\n", + " 0.104280 : 15 passes : dbow+dmc 0.0s 1.7s\n", + "*0.098400 : 15 passes : dbow+dmc_inferred 0.0s 14.1s\n", + "Completed pass 15 at alpha 0.008200\n", + "*0.180320 : 16 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 68.3s 1.0s\n", + "*0.103600 : 16 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 28.5s 2.1s\n", + " 0.154640 : 16 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.4s 0.7s\n", + " 0.102520 : 16 passes : dbow+dmm 0.0s 1.9s\n", + "*0.102480 : 16 passes : dbow+dmc 0.0s 2.9s\n", + "Completed pass 16 at alpha 0.007000\n", + "*0.178160 : 17 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 63.4s 2.0s\n", + "*0.103360 : 17 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 31.5s 0.8s\n", + " 0.154160 : 17 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 40.9s 1.0s\n", + "*0.102320 : 17 passes : dbow+dmm 0.0s 3.0s\n", + " 0.102680 : 17 passes : dbow+dmc 0.0s 2.0s\n", + "Completed pass 17 at alpha 0.005800\n", + "*0.177520 : 18 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 55.1s 0.8s\n", + "*0.103120 : 18 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 24.8s 0.7s\n", + "*0.153040 : 18 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 32.9s 0.8s\n", + " 0.102440 : 18 passes : dbow+dmm 0.0s 1.7s\n", + "*0.102480 : 18 passes : dbow+dmc 0.0s 2.6s\n", + "Completed pass 18 at alpha 0.004600\n", + "*0.177240 : 19 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 57.2s 1.5s\n", + "*0.103080 : 19 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 20.6s 1.8s\n", + "*0.152680 : 19 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 43.8s 0.8s\n", + " 0.102800 : 19 passes : dbow+dmm 0.0s 1.8s\n", + " 0.102600 : 19 passes : dbow+dmc 0.0s 1.7s\n", + "Completed pass 19 at alpha 0.003400\n", + "*0.176080 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 50.2s 0.6s\n", + " 0.188000 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 50.2s 8.5s\n", + " 0.103400 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 19.7s 0.7s\n", + " 0.111600 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 19.7s 4.1s\n", + "*0.152680 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 30.5s 0.6s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - " 0.102520 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 26.4s 45.2s\n", - " 0.108800 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 26.4s 48.7s\n", - "*0.151680 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 35.5s 40.8s\n", - " 0.182400 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 35.5s 45.3s\n", - " 0.102320 : 20 passes : dbow+dmm 0.0s 183.5s\n", - " 0.113200 : 20 passes : dbow+dmm_inferred 0.0s 192.3s\n", - "*0.102800 : 20 passes : dbow+dmc 0.0s 183.3s\n", - " 0.111200 : 20 passes : dbow+dmc_inferred 0.0s 196.1s\n", - "completed pass 20 at alpha 0.002200\n", - "END 2017-06-06 19:46:10.508929\n" + " 0.182800 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 30.5s 4.7s\n", + " 0.102600 : 20 passes : dbow+dmm 0.0s 1.6s\n", + " 0.112800 : 20 passes : dbow+dmm_inferred 0.0s 8.8s\n", + "*0.102440 : 20 passes : dbow+dmc 0.0s 2.1s\n", + " 0.103600 : 20 passes : dbow+dmc_inferred 0.0s 12.4s\n", + "Completed pass 20 at alpha 0.002200\n", + "END 2017-07-08 18:39:42.878219\n" ] } ], @@ -561,17 +579,17 @@ "print(\"START %s\" % datetime.datetime.now())\n", "\n", "for epoch in range(passes):\n", - " shuffle(doc_list) # shuffling gets best results\n", + " shuffle(doc_list) # Shuffling gets best results\n", " \n", " for name, train_model in models_by_name.items():\n", - " # train\n", + " # Train\n", " duration = 'na'\n", " train_model.alpha, train_model.min_alpha = alpha, alpha\n", " with elapsed_timer() as elapsed:\n", " train_model.train(doc_list, total_examples=len(doc_list), epochs=1)\n", " duration = '%.1f' % elapsed()\n", " \n", - " # evaluate\n", + " # Evaluate\n", " eval_duration = ''\n", " with elapsed_timer() as eval_elapsed:\n", " err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)\n", @@ -593,7 +611,7 @@ " best_indicator = '*'\n", " print(\"%s%f : %i passes : %s %ss %ss\" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))\n", "\n", - " print('completed pass %i at alpha %f' % (epoch + 1, alpha))\n", + " print('Completed pass %i at alpha %f' % (epoch + 1, alpha))\n", " alpha -= alpha_delta\n", " \n", "print(\"END %s\" % str(datetime.datetime.now()))" @@ -615,21 +633,23 @@ "name": "stdout", "output_type": "stream", "text": [ - "0.100000 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred\n", - "0.101600 dbow+dmm\n", - "0.101640 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)\n", - "0.102800 dbow+dmc\n", - "0.104000 dbow+dmm_inferred\n", - "0.106000 dbow+dmc_inferred\n", - "0.151680 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)\n", - "0.177360 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)\n", - "0.178400 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred\n", - "0.179600 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred\n" + "Err rate Model\n", + "0.098400 dbow+dmc_inferred\n", + "0.101600 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred\n", + "0.102320 dbow+dmm\n", + "0.102440 dbow+dmc\n", + "0.103080 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)\n", + "0.108400 dbow+dmm_inferred\n", + "0.152680 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)\n", + "0.176080 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)\n", + "0.182800 Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred\n", + "0.186000 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred\n" ] } ], "source": [ - "# print best error rates achieved\n", + "# Print best error rates achieved\n", + "print(\"Err rate Model\")\n", "for rate, name in sorted((rate, name) for name, rate in best_error.items()):\n", " print(\"%f %s\" % (rate, name))" ] @@ -638,7 +658,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In my testing, unlike the paper's report, DBOW performs best. Concatenating vectors from different models only offers a small predictive improvement. The best results I've seen are still just under 10% error rate, still a ways from the paper's 7.42%.\n" + "In our testing, contrary to the results of the paper, PV-DBOW performs best. Concatenating vectors from different models only offers a small predictive improvement over averaging vectors. There best results reproduced are just under 10% error rate, still a long way from the paper's reported 7.42% error rate." ] }, { @@ -664,18 +684,18 @@ "name": "stdout", "output_type": "stream", "text": [ - "for doc 47495...\n", + "for doc 73872...\n", "Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4):\n", - " [(47495, 0.8063223361968994), (28683, 0.4661555588245392), (10030, 0.3962923586368561)]\n", + " [(73872, 0.7427197694778442), (43744, 0.42404329776763916), (75113, 0.41938722133636475)]\n", "Doc2Vec(dbow,d100,n5,mc2,s0.001,t4):\n", - " [(47495, 0.9660482406616211), (17469, 0.5925078392028809), (52349, 0.5742233991622925)]\n", + " [(73872, 0.9305995106697083), (64147, 0.6267511248588562), (80042, 0.6207213401794434)]\n", "Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4):\n", - " [(47495, 0.8801028728485107), (60782, 0.5431949496269226), (42472, 0.5375599265098572)]\n" + " [(73872, 0.7893393039703369), (67773, 0.7167356014251709), (32802, 0.6937947273254395)]\n" ] } ], "source": [ - "doc_id = np.random.randint(simple_models[0].docvecs.count) # pick random doc; re-run cell for more examples\n", + "doc_id = np.random.randint(simple_models[0].docvecs.count) # Pick random doc; re-run cell for more examples\n", "print('for doc %d...' % doc_id)\n", "for model in simple_models:\n", " inferred_docvec = model.infer_vector(alldocs[doc_id].words)\n", @@ -705,15 +725,15 @@ "name": "stdout", "output_type": "stream", "text": [ - "TARGET (43375): «the film \" chaos \" takes its name from gleick's 1988 pop science explanation of chaos theory . what does the book or anything related to the content of the book have to do with the plot of the movie \" chaos \" ? nothing . the film makers seem to have skimmed the book ( obviously without understanding a thing about it ) looking for a \" theme \" to united the series of mundane action sequences that overlie the flimsy string of events that acts in place of a plot in the film . in this respect , the movie \" choas \" resembles the canadian effort \" cube , \" in which prime numbers function as a device to mystify the audience so that the ridiculousness of the plot will not be noticed : in \" cube \" a bunch of prime numbers are tossed in so that viewers will attribute their lack of understanding to lack of knowledge about primes : the same approach is taken in \" chaos \" : disconnected extracts from gleick's books are thrown in make the doings of the bad guy in the film seem fiendishly clever . this , of course , is an insultingly condescending treatment of the audience , and any literate viewer of \" chaos \" who can stand to sit through the entire film will end up bewildered . how could a film so bad be made ? rewritten as a novel , the story in \" chaos \" would probably not even make it past a literary agent's secretary's desk . how could ( at least ) hundreds of thousands ( and probably millions ) of dollars have been thrown away on what can only be considered a waste of time for everyone except those who took home money from the film ? regarding what's in the movie , every performance is phoned in . save for technical glitches , it would be astonishing if more than one take was used for any one scene . the story is uniformly senseless : the last time i saw a story to disconnected it was the production of a literal eight-year-old . among other massive shortcomings are the following : the bad guy leaves hints for the police to follow . he has no reason whatsoever for leaving such hints . police officers do not carry or use radios . dupes of the bad guy have no reason to act in concert with the bad guy . let me strongly recommend that no one watch this film . if there is any other movie you like ( or even simply do not hate ) watch that instead .»\n", + "TARGET (71919): «tweety is perched in his cage on the ledge and sylvester is across the street at the \" bird watching society \" building on about the same level . both are looking through binoculars , and they spot each other . tweety then utters his famous phrase , \" i taught i taw a puddy cat . \" ( thought i saw a pussy cat . ) sylvester scampers over to grab the bird . tweety flies out of his cage and granny comes to the rescue , bashing the cat and driving it away . the rest of the animated short shows a series of attempts by sylvester to grab tweetie - a familiar theme - and how either bad luck or granny thwarts him every time . the cat dons disguises and tries a number of clever schemes . . . all of which are funny and very entertaining . in all , a good cartoon and fun to watch .»\n", "\n", "SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4):\n", "\n", - "MOST (48890, 0.5806792378425598): «asmali konak has arguably become one of the best tv series to come out of turkey . with its unique cinematography and visual approach to filming , the series has gained a wide following base with rating records continuously broken . personally i do not agree with singers becoming actors ( hence , ozcan deniz - the lead actor ) but i guess the figures speak for themselves . in relation to the movie , it was disgusting to see how much someone can destroy such a plotline . years in the making , this movie was able to oversee every descent story that existed within the series . not only that , the cultural mistakes were unacceptable , with an idiotic scene involving the family members dancing ( greek style ) and breaking plates , which does not exists anywhere within the turkish culture . some argue the movie should be taken as a stand alone movie not as a continuation of the tv series but this theory has one major fall , the way the movie was marketed was that it will be picking up where the series left off and will conclude the series once and for all . so with that note in mind , me and everyone i know , would have asked for a refund and accepted to stand outside the theatre to warn other victims .»\n", + "MOST (30440, 0.752430260181427): «in tweety's s . o . s , sylvester goes from picking garbage cans to being a stowaway on a cruise ship that happens to carry a certain canary bird-and granny , his owner . uh-oh ! once again , tweety and granny provide many obstacles to the cat's attempts to get the bird . sylvester also gets seasick quite a few times , too . and the second time the red-nosed feline goes to the place on the ship that has something that cures his ailments , tweety replaces it with nitroglycerin . so now sylvester can blow fire ! i'll stop here and say this is another excellent cartoon directed by friz freling starring the popular cat-and-bird duo . tweety's s . o . s is most highly recommended .»\n", "\n", - "MEDIAN (93452, 0.22335509955883026): «this is the second film ( dead men walking ) set in a prison by theasylum . the mythos behind plot is very good , russian mafia has this demon do there dirty work and the rainbow array of inmates have to defend their bars & mortar . jennifer lee ( see interview ) wiggins stars as a prison guard who has a inmate , who maybe a demon . the monster suit is awesome and frightening , and a different look that almost smacks of a toy franchise , hey if full moon and todd mcfarlane can make action figures for any character . . why not the beast from bray road wolfette , shapeshifter with medallion accessory , or the rhett giles everyman hero with removable appendages .»\n", + "MEDIAN (32141, 0.3800385594367981): «my entire family enjoyed this film , including 2 small children . great values without sex , violence , drugs , nudity , or profanity . also no zillion dollar special effects were added to try to misdirect viewers from a poorly written storyline . a simple little family fun movie . we especially like the songs in the movie . but we only got to hear a portion of the songs . . . mostly during the end credits . . . would love to buy a sound track cd from this movie . this is my 4th bill hillman movie and they all have the same guidelines as mentioned above . with all the movies out there that you don't want your kids to watch , this hillman fella has a no risk rating . we love his movies .»\n", "\n", - "LEAST (57989, -0.22353392839431763): «saw this movie on re-run just once , when i was about 13 , in 1980 . it completely matched my teenaged fantasies of sweet , gentle , interesting — and let's face it — hot — \" older \" guys . just ordered it from cd universe about a month ago , and have given it about four whirls in the two weeks since . as somebody mentioned — i'm haunted by it . as somebody else mentioned — i think it's part of a midlife crisis as well ! being 39 and realizing how much has changed since those simpler '70s times when girls of 13 actually did take buses and go to malls together and had a lot more freedom away from the confines of modern suburbia makes me sad for my daughter — who is nearly 13 herself . thirteen back then was in many ways a lot more grown up . the film is definitely '70s but not in a super-dated cheesy way , in fact the outfits denise miller as jessie wears could be current now ! you know what they say , everything that goes around . . . although the short-short jogging shorts worn by rex with the to-the-knees sweat socks probably won't make a comeback . the subject matter is handled in a very sensitive way and the characters are treated with a lot of respect . it's not the most chatty movie going — i often wished for more to be said between jessie and michael that would cement why he was also attracted to her . but the acting is solid , the movie is sweet and atmospheric , and the fringe characters give great performances . mary beth manning as jessie's friend caroline is a total hoot — i think we all had friends like her . maia danziger as the relentless flirt with michael gives a wiggy , stoned-out performance that just makes you laugh — because we also all knew girls that acted like that . denise miller knocked her performance out of the ballpark with a very down-to-earth quality likely credited to her uknown status and being new to the industry . and i think not a little of the credit for the film's theatre-grade quality comes from the very capable , brilliant hands of the story's authors , carole and the late bruce hart , who also wrote for sesame street . they really cared about the message of the movie , which was not an overt in-your-face thing , while at the same time understanding how eager many girls are to grow up at that age . one thing that made me love the film then as much as now is not taking the cliché , easy , tied-with-a-bow but sort of let-down ending . in fact it's probably the end that has caused so many women to return to viewing the movie in their later years . re-watching sooner or later has me absolutely sick with nostalgia for those simpler times , and has triggered a ridiculous and sudden obsession with catching up with rex smith — whom while i enjoyed his albums sooner or later and forever when i was young , i never plastered his posters on my walls as i did some of my other faves . in the past week , i've put his music on my ipod , read fan sites , found interviews ( and marveled in just how brilliant he really is — the man has a fascinating way of thinking ) , watched clips on youtube — what am i , 13 ? i guess that's the biggest appeal of this movie . remembering what it was like to be 13 and the whole world was ahead of you .»\n", + "LEAST (57712, -0.051298510283231735): «in a recent biography of burt lancaster , go tell the spartans is described as the best vietnam war film that nobody ever saw . hopefully with television and video products that will be corrected . i prefer to think of it as a prequel to platoon . this film is set in 1964 when america's participation was limited to advisers by this time raised to about 20 , 000 of them by president kennedy . whether if kennedy had lived and won a second term he would have increased our commitment to a half a million men as lyndon johnson did is open to much historical speculation . major burt lancaster heads such an advisory team with his number two captain marc singer . they get some replacements and a new assignment to build a fortress where the french tried years ago and failed . the replacements are a really mixed bag , a sergeant who lancaster has served with before and respects highly in jonathan goldsmith , a very green and eager second lieutenant in joe unger , a demolitions man who is a draftee and at that time vietnam service was a strictly volunteer thing in craig wasson , and a medic who is also a junkie in dennis howard . for one reason or another all of these get sent forward to build that outpost in a place that suddenly has acquired military significance . i said before this could be a prequel to platoon . platoon is set in the time a few years later when the usa was fully militarily committed in vietnam . platoon raises the same issues about the futility of that war , but i think go tell the spartans does a much better job . hard to bring your best effort into the fight since who and what you're fighting and fighting for seems to change weekly . originally this project was for william holden and i'm surprised holden passed on it . maybe for the better because lancaster strikes just the right note as the professional soldier in what was a backwater assignment who politics has passed over for promotion . knowing all that you will understand why lancaster makes the final decision he does . two others of note are evan kim who is the head of the south vietnamese regulars and interpreter who lancaster and company are training . he epitomizes the brutality of the struggle for us in a way that we can't appreciate from the other side because we never meet any of the viet cong by name . dolph sweet plays the general in charge of the american vietnam commitment , a general harnitz . he is closest to a real character because the general in charge their before johnson raised the troop levels and put in william westmoreland was paul harkins . joe unger is who i think gives the best performance as the shavetail lieutenant with all the conventional ideas of war and believes we have got to be with the good guys since we are americans . he learns fast that you issue uniforms for a reason and wars against people who don't have them are the most difficult . i think one could get a deep understanding of just what america faced in 1964 in vietnam by watching go tell the spartans .»\n", "\n" ] } @@ -764,70 +784,70 @@ "name": "stdout", "output_type": "stream", "text": [ - "most similar words for 'gymnast' (36 occurences)\n" + "most similar words for 'thrilled' (276 occurences)\n" ] }, { "data": { "text/html": [ - "
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)
[('scientist', 0.530441164970398),
\n", - "('psychotherapist', 0.527083694934845),
\n", - "('parapsychologist', 0.5239906907081604),
\n", - "('cringer', 0.5199892520904541),
\n", - "('samir', 0.5048707127571106),
\n", - "('reporter', 0.49532145261764526),
\n", - "('swimmer', 0.4937909245491028),
\n", - "('thrill-seeker', 0.4905340373516083),
\n", - "('chiara', 0.48281964659690857),
\n", - "('psychiatrist', 0.4788440763950348),
\n", - "('nerd', 0.4779984951019287),
\n", - "('surgeon', 0.47712844610214233),
\n", - "('jock', 0.4741038382053375),
\n", - "('geek', 0.4714686870574951),
\n", - "('mumu', 0.47104766964912415),
\n", - "('painter', 0.4689804017543793),
\n", - "('cheater', 0.4655175805091858),
\n", - "('hypnotist', 0.4645438492298126),
\n", - "('whizz', 0.46407681703567505),
\n", - "('cryptozoologist', 0.4627385437488556)]
[('bang-bang', 0.4289792478084564),
\n", - "('master', 0.41190674901008606),
\n", - "('greenleaf', 0.38207903504371643),
\n", - "('122', 0.3811250925064087),
\n", - "('fingernails', 0.3794997036457062),
\n", - "('cardboard-cutout', 0.3740081787109375),
\n", - "(\"album'\", 0.3706256151199341),
\n", - "('sex-starved', 0.3696949779987335),
\n", - "('creme-de-la-creme', 0.36426788568496704),
\n", - "('destroyed', 0.3638569116592407),
\n", - "('imminent', 0.3612757921218872),
\n", - "('cruisers', 0.3568859398365021),
\n", - "(\"emo's\", 0.35605981945991516),
\n", - "('lavransdatter', 0.3534432649612427),
\n", - "(\"'video'\", 0.3508487641811371),
\n", - "('garris', 0.3507363796234131),
\n", - "('romanzo', 0.3495352268218994),
\n", - "('tombes', 0.3494585454463959),
\n", - "('story-writers', 0.3461073637008667),
\n", - "('georgette', 0.34602558612823486)]
[('ex-marine', 0.5273298621177673),
\n", - "('koichi', 0.5020822882652283),
\n", - "('dorkish', 0.49750325083732605),
\n", - "('fenyö', 0.4765225946903229),
\n", - "('castleville', 0.46756264567375183),
\n", - "('smoorenburg', 0.46484801173210144),
\n", - "('chimp', 0.46456438302993774),
\n", - "('swimmer', 0.46236276626586914),
\n", - "('falcone', 0.4614230990409851),
\n", - "('yak', 0.45991501212120056),
\n", - "('gms', 0.4542686939239502),
\n", - "('iván', 0.4503802955150604),
\n", - "('spidy', 0.4494086503982544),
\n", - "('arnie', 0.44659116864204407),
\n", - "('hobo', 0.4465593695640564),
\n", - "('evelyne', 0.4455353617668152),
\n", - "('pandey', 0.4452363848686218),
\n", - "('hector', 0.4442984461784363),
\n", - "('baboon', 0.44382452964782715),
\n", - "('miao', 0.4437481164932251)]
" + "
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)
[('pleased', 0.8135600090026855),
\n", + "('excited', 0.7601636648178101),
\n", + "('surprised', 0.7497514486312866),
\n", + "('delighted', 0.740871012210846),
\n", + "('impressed', 0.7300887107849121),
\n", + "('disappointed', 0.715817391872406),
\n", + "('shocked', 0.7109759449958801),
\n", + "('intrigued', 0.7000594139099121),
\n", + "('amazed', 0.6994709968566895),
\n", + "('fascinated', 0.6952326893806458),
\n", + "('saddened', 0.68060702085495),
\n", + "('satisfied', 0.674963116645813),
\n", + "('apprehensive', 0.6572576761245728),
\n", + "('entertained', 0.654381275177002),
\n", + "('disgusted', 0.6502282023429871),
\n", + "('overjoyed', 0.6485082507133484),
\n", + "('stunned', 0.6478738784790039),
\n", + "('entranced', 0.6438385844230652),
\n", + "('amused', 0.6437265872955322),
\n", + "('dissappointed', 0.6427538394927979)]
[(\"ifans'\", 0.44280144572257996),
\n", + "('shay', 0.4335209131240845),
\n", + "('crappers', 0.4007232189178467),
\n", + "('overflow', 0.40028804540634155),
\n", + "('yum', 0.3929170072078705),
\n", + "(\"monkey'\", 0.38661277294158936),
\n", + "('kholi', 0.38401469588279724),
\n", + "('fun-bloodbath', 0.38145124912261963),
\n", + "('breathed', 0.373812735080719),
\n", + "(\"eszterhas'\", 0.3729144334793091),
\n", + "('nob', 0.3723628520965576),
\n", + "(\"meatloaf's\", 0.3720172643661499),
\n", + "('ruegger', 0.3683895468711853),
\n", + "(\"haynes'\", 0.36665791273117065),
\n", + "('feigning', 0.36445197463035583),
\n", + "('torches', 0.35865518450737),
\n", + "('sirens', 0.3581739068031311),
\n", + "('insides', 0.35690629482269287),
\n", + "('swackhamer', 0.35603001713752747),
\n", + "('trolls', 0.3526684641838074)]
[('pleased', 0.7576382160186768),
\n", + "('excited', 0.7351139187812805),
\n", + "('delighted', 0.7220871448516846),
\n", + "('intrigued', 0.6748061180114746),
\n", + "('surprised', 0.6552557945251465),
\n", + "('shocked', 0.6505781412124634),
\n", + "('disappointed', 0.6428648233413696),
\n", + "('impressed', 0.6426182389259338),
\n", + "('overjoyed', 0.6259098052978516),
\n", + "('saddened', 0.6148285865783691),
\n", + "('anxious', 0.6140503883361816),
\n", + "('fascinated', 0.6126223802566528),
\n", + "('skeptical', 0.6025052070617676),
\n", + "('suprised', 0.5986943244934082),
\n", + "('upset', 0.596437931060791),
\n", + "('relieved', 0.593376874923706),
\n", + "('psyched', 0.5923721790313721),
\n", + "('captivated', 0.5753644704818726),
\n", + "('astonished', 0.574415922164917),
\n", + "('horrified', 0.5716636180877686)]
" ], "text/plain": [ "" @@ -876,28 +896,20 @@ }, { "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4): 31.50% correct (3154 of 10012)\n", - "Doc2Vec(dbow,d100,n5,mc2,s0.001,t4): 0.00% correct (0 of 10012)\n", - "Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4): 32.24% correct (3228 of 10012)\n" - ] - } - ], + "execution_count": 14, + "metadata": { + "collapsed": true + }, + "outputs": [], "source": [ - "# assuming something like\n", - "# https://word2vec.googlecode.com/svn/trunk/questions-words.txt \n", - "# is in local directory\n", - "# note: this takes many minutes\n", - "for model in word_models:\n", - " sections = model.accuracy('questions-words.txt')\n", - " correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])\n", - " print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))" + "# Download this file: https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt\n", + "# and place it in the local directory\n", + "# Note: this takes many minutes\n", + "if os.path.isfile('question-words.txt'):\n", + " for model in word_models:\n", + " sections = model.accuracy('questions-words.txt')\n", + " correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])\n", + " print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))" ] }, { @@ -916,10 +928,8 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": true - }, + "execution_count": 15, + "metadata": {}, "outputs": [], "source": [ "This cell left intentionally erroneous." @@ -989,7 +999,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python [default]", "language": "python", "name": "python3" }, @@ -1003,7 +1013,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.4.3" + "version": "3.6.1" } }, "nbformat": 4,