diff --git a/docs/notebooks/ldaseqmodel.ipynb b/docs/notebooks/ldaseqmodel.ipynb new file mode 100644 index 0000000000..50b1465a3f --- /dev/null +++ b/docs/notebooks/ldaseqmodel.ipynb @@ -0,0 +1,853 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Dynamic Topic Models" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Imagine you have a gigantic corpus which spans over a couple of years. You want to find semantically similar documents; one from the very beginning of your time-line, and one in the very end. How would you?\n", + "This is where Dynamic Topic Models comes in. By having a time-based element to topics, context is preserved while ley-words may change.\n", + "\n", + "Dynamic Topic Models are used to model the evolution of topics in a corpus, over time. The Dynamic Topic Model is part of a class of probabilistic topic models, like the LDA. \n", + "\n", + "While most traditional topic mining algorithms do not expect time-tagged data or take into account any prior ordering, Dynamic Topic Models (DTM) leverages the knowledge of different documents belonging to a different time-slice in an attempt to map how the words in a topic change over time.\n", + "\n", + "David Blei does a good job explaining the theory behind this in this [Google talk](https://www.youtube.com/watch?v=7BMsuyBPx90). If you prefer to directly read the [paper on DTM by Blei and Lafferty](http://repository.cmu.edu/cgi/viewcontent.cgi?article=2036&context=compsci), that should get you upto speed too." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Motivation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "But - why even undertake this, especially when Gensim itself have a wrapper?\n", + "The main motivation was the lack of documentation in the original code - and the fact that doing an only python version makes it easier to use gensim building blocks. For example, for setting up the Sufficient Statistics to initialize the DTM, you can just pass a pre-trained gensim LDA model!\n", + "\n", + "There is some clarity on how they built their code now - Variational Inference using Kalman Filters. I've tried to make things as clear as possible in the code, but it still needs some polishing. \n", + "\n", + "Any help through PRs would be greatly appreciated!\n", + "\n", + "I have been regularly blogging about my progress with implementing this, which you can find [here](http://rare-technologies.com/author/bhargav/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Use Case " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you would have seen the video or read the paper, it's use case would be pretty clear - and the example of modelling it on Science research papers gives us some pretty interesting results. It was used to not only catch how various themes of research such as Physics or Neuroscience evolved over the decades but also in identifying similar documents in a way not many other modelling algorithms can. While words may change over time, the fact that DTM can identify topics over time can help us find semantically similar documents over a long time-period.\n", + "\n", + "[This](http://rare-technologies.com/understanding-and-coding-dynamic-topic-models/) blog post is also useful in breaking down the ideas and theory behind DTM." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using LdaSeqModel for DTM" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Gensim already has a wrapper for original C++ DTM code, but the `LdaSeqModel` class is an effort to have a pure python implementation of the same.\n", + "Using it is very similar to using any other gensim topic-modelling algorithm, with all you need to start is an iterable gensim corpus, id2word and a list with the number of documents in each of your time-slices." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# setting up our imports\n", + "\n", + "from gensim.models import ldaseqmodel\n", + "from gensim.corpora import Dictionary, bleicorpus\n", + "import numpy\n", + "from gensim.matutils import hellinger" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will be loading the corpus and dictionary from disk. Here our corpus in the Blei corpus format, but it can be any iterable corpus.\n", + "The data set here consists of news reports over 3 months downloaded from here and cleaned. \n", + "\n", + "TODO: better, more interesting data-set.\n", + "\n", + "### What is a time-slice?\n", + "A very important input for DTM to work is the `time_slice` input. It should be a list which contains the number of documents in each time-slice. In our case, the first month had 438 articles, the second 430 and the last month had 456 articles. This means we'd need an input which looks like this: `time_slice = [438, 430, 456]`. \n", + "\n", + "Once you have your corpus, id2word and time_slice ready, we're good to go!" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# loading our corpus and dictionary\n", + "dictionary = Dictionary.load('Corpus/news_dictionary')\n", + "corpus = bleicorpus.BleiCorpus('Corpus/news_corpus')\n", + "# it's very important that your corpus is saved in order of your time-slices!\n", + "\n", + "time_slice = [438, 430, 456]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For DTM to work it first needs the Sufficient Statistics from a trained LDA model on the _*same*_ dataset. \n", + "By default LdaSeqModel trains it's own model and passes those values on, but can also accept a pre-trained gensim LDA model, or a numpy matrix which contains the Suff Stats.\n", + "\n", + "We will be training our model in default mode, so LDA will be first performed on the dataset. The `passes` parameter is to instruct LdaModel on the number of passes." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false, + "scrolled": true + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/bhargavvader/Open_Source/gensim/gensim/models/ldaseqmodel.py:237: RuntimeWarning: divide by zero encountered in double_scalars\n", + " convergence = numpy.fabs((bound - old_bound) / old_bound)\n" + ] + } + ], + "source": [ + "ldaseq = ldaseqmodel.LdaSeqModel(corpus=corpus, id2word=dictionary, time_slice=time_slice, num_topics=5, passes=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that our model is trained, let's see what our results look like.\n", + "\n", + "### Results\n", + "Much like LDA, the points of interest would be in what the topics are and how the documents are made up of these topics.\n", + "In DTM we have the added interest of seeing how these topics evolve over time.\n", + "\n", + "Let's go through some of the functions to print Topics and analyse documents." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[[(0.01, 'best'),\n", + " (0.0060000000000000001, 'film'),\n", + " (0.0050000000000000001, 'music'),\n", + " (0.0050000000000000001, 'last'),\n", + " (0.0040000000000000001, 'number'),\n", + " (0.0040000000000000001, 'tv'),\n", + " (0.0040000000000000001, 'show'),\n", + " (0.0040000000000000001, 'top'),\n", + " (0.0030000000000000001, 'uk'),\n", + " (0.0030000000000000001, 'first'),\n", + " (0.0030000000000000001, 'year'),\n", + " (0.0030000000000000001, 'band'),\n", + " (0.002, 'award'),\n", + " (0.002, 'million'),\n", + " (0.002, 'record'),\n", + " (0.002, 'three'),\n", + " (0.002, 'sales'),\n", + " (0.002, 'bbc'),\n", + " (0.002, 'including'),\n", + " (0.002, 'british')],\n", + " [(0.0040000000000000001, 'mobile'),\n", + " (0.0030000000000000001, 'technology'),\n", + " (0.0030000000000000001, 'use'),\n", + " (0.0030000000000000001, 'last'),\n", + " (0.002, 'market'),\n", + " (0.002, 'firm'),\n", + " (0.002, 'firms'),\n", + " (0.002, 'net'),\n", + " (0.002, 'much'),\n", + " (0.002, 'phone'),\n", + " (0.002, 'year'),\n", + " (0.002, 'make'),\n", + " (0.002, 'companies'),\n", + " (0.002, 'uk'),\n", + " (0.002, 'digital'),\n", + " (0.002, 'european'),\n", + " (0.002, 'economic'),\n", + " (0.002, 'company'),\n", + " (0.002, 'growth'),\n", + " (0.002, 'government')],\n", + " [(0.0040000000000000001, 'think'),\n", + " (0.0040000000000000001, 'club'),\n", + " (0.0040000000000000001, 'like'),\n", + " (0.0040000000000000001, 'want'),\n", + " (0.0030000000000000001, \"don't\"),\n", + " (0.0030000000000000001, 'game'),\n", + " (0.0030000000000000001, 'football'),\n", + " (0.0030000000000000001, 'last'),\n", + " (0.0030000000000000001, 'make'),\n", + " (0.0030000000000000001, 'way'),\n", + " (0.0030000000000000001, 'go'),\n", + " (0.0030000000000000001, 'time'),\n", + " (0.0030000000000000001, 'real'),\n", + " (0.002, 'players'),\n", + " (0.002, 'bbc'),\n", + " (0.002, 'going'),\n", + " (0.002, 'know'),\n", + " (0.002, 'manager'),\n", + " (0.002, 'liverpool'),\n", + " (0.002, 'got')],\n", + " [(0.0050000000000000001, 'government'),\n", + " (0.0040000000000000001, 'blair'),\n", + " (0.0040000000000000001, 'labour'),\n", + " (0.0030000000000000001, 'minister'),\n", + " (0.0030000000000000001, 'security'),\n", + " (0.0030000000000000001, 'public'),\n", + " (0.0030000000000000001, 'prime'),\n", + " (0.0030000000000000001, 'election'),\n", + " (0.0030000000000000001, 'party'),\n", + " (0.002, 'brown'),\n", + " (0.002, 'search'),\n", + " (0.002, 'make'),\n", + " (0.002, 'users'),\n", + " (0.002, 'howard'),\n", + " (0.002, 'bbc'),\n", + " (0.002, 'lord'),\n", + " (0.002, 'say'),\n", + " (0.002, 'home'),\n", + " (0.002, 'tory'),\n", + " (0.002, 'secretary')],\n", + " [(0.0050000000000000001, 'game'),\n", + " (0.0050000000000000001, 'chelsea'),\n", + " (0.0050000000000000001, 'cup'),\n", + " (0.0040000000000000001, 'first'),\n", + " (0.0040000000000000001, 'win'),\n", + " (0.0040000000000000001, 'united'),\n", + " (0.0040000000000000001, 'games'),\n", + " (0.0030000000000000001, 'league'),\n", + " (0.0030000000000000001, 'arsenal'),\n", + " (0.0030000000000000001, 'last'),\n", + " (0.0030000000000000001, 'home'),\n", + " (0.0030000000000000001, 'play'),\n", + " (0.0030000000000000001, 'players'),\n", + " (0.0030000000000000001, 'good'),\n", + " (0.0030000000000000001, 'side'),\n", + " (0.0030000000000000001, 'world'),\n", + " (0.0030000000000000001, 'goal'),\n", + " (0.002, 'ball'),\n", + " (0.002, 'second'),\n", + " (0.002, 'champions')]]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# to print all topics, use `print_topics`. \n", + "# the input parameter to `print_topics` is only a time-slice option. By passing `0` we are seeing the topics in the 1st time-slice.\n", + "ldaseq.print_topics(time=0)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[[(0.0040000000000000001, 'mobile'),\n", + " (0.0030000000000000001, 'technology'),\n", + " (0.0030000000000000001, 'use'),\n", + " (0.0030000000000000001, 'last'),\n", + " (0.002, 'market'),\n", + " (0.002, 'firm'),\n", + " (0.002, 'firms'),\n", + " (0.002, 'net'),\n", + " (0.002, 'much'),\n", + " (0.002, 'phone'),\n", + " (0.002, 'year'),\n", + " (0.002, 'make'),\n", + " (0.002, 'companies'),\n", + " (0.002, 'uk'),\n", + " (0.002, 'digital'),\n", + " (0.002, 'european'),\n", + " (0.002, 'economic'),\n", + " (0.002, 'company'),\n", + " (0.002, 'growth'),\n", + " (0.002, 'government')],\n", + " [(0.0030000000000000001, 'mobile'),\n", + " (0.0030000000000000001, 'technology'),\n", + " (0.0030000000000000001, 'use'),\n", + " (0.002, 'market'),\n", + " (0.002, 'last'),\n", + " (0.002, 'firms'),\n", + " (0.002, 'firm'),\n", + " (0.002, 'phone'),\n", + " (0.002, 'much'),\n", + " (0.002, 'net'),\n", + " (0.002, 'make'),\n", + " (0.002, 'year'),\n", + " (0.002, 'digital'),\n", + " (0.002, 'uk'),\n", + " (0.002, 'companies'),\n", + " (0.002, 'economic'),\n", + " (0.002, 'european'),\n", + " (0.002, 'company'),\n", + " (0.002, 'growth'),\n", + " (0.002, 'broadband')],\n", + " [(0.0040000000000000001, 'mobile'),\n", + " (0.0030000000000000001, 'technology'),\n", + " (0.0030000000000000001, 'use'),\n", + " (0.0030000000000000001, 'market'),\n", + " (0.002, 'phone'),\n", + " (0.002, 'firms'),\n", + " (0.002, 'last'),\n", + " (0.002, 'much'),\n", + " (0.002, 'firm'),\n", + " (0.002, 'make'),\n", + " (0.002, 'net'),\n", + " (0.002, 'digital'),\n", + " (0.002, 'year'),\n", + " (0.002, 'uk'),\n", + " (0.002, 'companies'),\n", + " (0.002, 'broadband'),\n", + " (0.002, 'economic'),\n", + " (0.002, 'european'),\n", + " (0.002, 'company'),\n", + " (0.002, 'next')]]" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# to fix a topic and see it evolve, use `print_topic_times`\n", + "\n", + "ldaseq.print_topic_times(topic=1) # evolution of 1st topic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you look at the lower frequencies; the word broadband is creeping itself up into prominence in topic number 1. \n", + "We've had our fun looking at topics, now let us see how to analyse documents.\n", + "\n", + "### Doc-Topics\n", + "the function `doc_topics` checks the topic proportions on documents already trained on. It accepts the document number in the corpus as an input.\n", + "\n", + "Let's pick up document number 558 arbitrarily and have a look." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['set', 'time,\"', 'chairman', 'decision', 'news', 'director', 'former', 'vowed', '\"it', 'results', 'club', 'third', 'home', 'paul', 'saturday.', 'south', 'conference', 'leading', '\"some', 'survival', 'needed', 'coach', \"don't\", 'every', 'trouble', 'desperate', 'eight', 'first', 'win', 'going', 'park', 'near', 'chance', 'manager', 'league', 'milan', 'games', 'go', 'game', 'foot', 'say', 'upset', \"i'm\", 'poor', 'season.', 'executive', 'road', '24', 'debut', 'portsmouth.', 'give', 'claiming', 'steve', 'break', 'rivals', 'boss', 'kevin', 'premiership', 'little', 'left', 'table.', 'life', 'join', 'years.', 'bring', 'season,', 'director.', 'became', 'st', 'according', 'official', 'hope', 'shocked', 'though', 'phone', 'charge', '14', 'website.', 'time,', 'claimed', 'kept', 'bond', 'appointment', 'unveil', 'november', 'picked', 'confirmed,', 'believed', 'deep', 'position', 'surprised', 'negotiations', 'talks', 'gmt', 'middlesbrough', 'replaced', 'appear', 'football,', '\"i\\'m', 'charge.', 'saints', 'southampton', 'sturrock', 'wednesday.', 'harry', 'poised', 'ninth', 'quit', 'relieved', 'chance.\"', 'decision.\"', 'hero', 'redknapp,', 'redknapp', \"saints'\", 'first-team', \"wouldn't\", \"mary's.\", 'portsmouth', \"redknapp's\", 'pompey', 'academy', \"harry's\", 'cult', 'rupert', 'time\".', 'coast', '57,', 'succeed', 'duties', \"'i\", 'bitter,', \"mandaric's\", \"portsmouth's\", 'wigley,', 'wigley', \"southampton',\", '1500', 'mandaric', \"'absolutely\", 'lowe', '\"disappointed\"', 'velimir', 'not\\',\"', 'disgusted', 'disappointed,', 'mandaric,', 'fratton', 'replaces', 'masterminding', 'angry,', 'vowed:', 'informed.\"', 'zajec']\n" + ] + } + ], + "source": [ + "# to check Document - Topic proportions, use `doc-topics`\n", + "words = [dictionary[word_id] for word_id, count in ldaseq.corpus.corpus[558]]\n", + "print (words)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It's pretty clear that it's a news article about football. What topics will it likely be comprised of?" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[ 5.46298825e-05 5.46298825e-05 7.24590312e-01 5.46298825e-05\n", + " 2.75245799e-01]\n" + ] + } + ], + "source": [ + "doc_1 = ldaseq.doc_topics(558) # check the 244th document in the corpuses topic distribution\n", + "print (doc_1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It's largely made of topics 3 and 5 - and if we go back and inspect our topics, it's quite a good match." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we wish to analyse a document not in our training set, we can use simply pass the doc to the model similar to the `__getitem__` funciton for `LdaModel`.\n", + "\n", + "Let's let our document be a hypothetical news article about the effects of Ryan Giggs buying mobiles affecting the British economy." + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[ 0.00110497 0.65302349 0.34366159 0.00110497 0.00110497]\n" + ] + } + ], + "source": [ + "doc_2 = ['economy', 'bank', 'mobile', 'phone', 'markets', 'buy', 'football', 'united', 'giggs']\n", + "doc_2 = dictionary.doc2bow(doc_2)\n", + "doc_2 = ldaseq[doc_2]\n", + "print (doc_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Pretty neat! Topics 2 and 3 are about technology, the market and football, so this works well for us." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Distances between documents\n", + "\n", + "One of the more handy uses of DTMs topic modelling is that we can compare documents across different time-frames and see how similar they are topic-wise. When words may not necessarily overlap over these time-periods, this is very useful.\n", + "\n", + "The current dataset doesn't provide us the diversity for this to be an effective example; but we will nevertheless illustrate how to do the same." + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.69071218819511226" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "hellinger(doc_1, doc_2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The topic distributions are quite similar, so we get a high value.\n", + "For more information on how to use the gensim distance metrics, check out [this notebook](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/distance_metrics.ipynb)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Performance\n", + "\n", + "The code currently runs between 5 to 7 times slower than the original C++ DTM code. The bottleneck is in the scipy `optimize.fmin_cg` method for updating obs. Speeding this up would fix things up!\n", + "\n", + "Since it uses iterable gensim corpuses, the memory stamp is also cleaner. The corpus size doesn't matter." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The advantages of the python port are that unlike the C++ code we needn't treat it like a black-box; PRs to help make the code better are welcomed, as well as help to make the documentation clearer and improve performance. It is also in pure python and doesn't need any dependancy outside of what gensim already needs. The added functionality of being able to analyse new documents is also a plus!\n", + "\n", + "### DTM wrapper comparison\n", + "Let's now compare these results with the DTM wrapper." + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from gensim.models.wrappers.dtmmodel import DtmModel\n", + "\n", + "\n", + "dtm_path = \"/Users/bhargavvader/Downloads/dtm_release/dtm/main\"\n", + "dtm_model = DtmModel(dtm_path, corpus, time_slice, num_topics=5, id2word=dictionary, initialize_lda=True)\n", + "dtm_model.save('dtm_news')\n", + "ldaseq.save('ldaseq_news')\n", + "\n", + "# if we've saved before simply load the model\n", + "dtm_model = DtmModel.load('dtm_news')" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# setting up the DTM wrapper for \n", + "\n", + "from gensim import matutils\n", + "num_topics = 5\n", + "topic_term = dtm_model.lambda_[:,:,0] # the lambda matrix contains \n", + "\n", + "def validate(topic_term):\n", + " topic_term = numpy.exp(topic_term)\n", + " topic_term = topic_term / topic_term.sum()\n", + " topic_term = topic_term * num_topics\n", + " return topic_term\n", + "\n", + "def get_topics(topic_terms, topic_number):\n", + " topic_terms = topic_terms[topic_number]\n", + " bestn = matutils.argsort(topic_terms, 20, reverse=True)\n", + " beststr = [dictionary[id_] for id_ in bestn]\n", + " return beststr\n", + "\n", + "topic_term = validate(topic_term)\n", + "# next is doc_topic_dist\n", + "doc_topic = dtm_model.gamma_\n", + "# next is the vocabulary, which we already have\n", + "\n", + "vocab = []\n", + "for i in range(0, len(dictionary)):\n", + " vocab.append(dictionary[i])\n", + "\n", + "# we now need term-frequency and doc_lengths\n", + "\n", + "def term_frequency(corpus, dictionary):\n", + " term_frequency = [0] * len(dictionary)\n", + " doc_lengths = []\n", + " for doc in corpus:\n", + " doc_lengths.append(len(doc))\n", + " for pair in doc:\n", + " term_frequency[pair[0]] += pair[1]\n", + " return term_frequency, doc_lengths\n", + "\n", + "topics_wrapper = []\n", + "for i in range(0, num_topics):\n", + " topics_wrapper.append(get_topics(topic_term, i))\n", + " \n", + " \n", + "term_frequency, doc_lengths = term_frequency(corpus, dictionary)" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pyLDAvis\n", + "vis_wrapper = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)\n", + "pyLDAvis.display(vis_wrapper)" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# now let us visualize the DTM python port.\n", + "\n", + "# getting a list of just words for each topics\n", + "dtm_tp = ldaseq.print_topics()\n", + "dtm_topics = []\n", + "for topic in dtm_tp:\n", + " topics = []\n", + " for prob, word in topic:\n", + " topics.append(word)\n", + " dtm_topics.append(topics)\n", + " \n", + "# getting dtm python doc-topic proportions\n", + "doc_topic = numpy.copy(ldaseq.gammas)\n", + "doc_topic /= doc_topic.sum(axis=1)[:, numpy.newaxis]\n", + "\n", + "# getting dtm topic_word proportions for first time_slice\n", + "def get_topic_term(ldaseq, topic, time=0):\n", + " topic = numpy.transpose(ldaseq.topic_chains[topic].e_log_prob)\n", + " topic = topic[time]\n", + " topic = numpy.exp(topic)\n", + " topic = topic / topic.sum()\n", + " return topic\n", + "\n", + "# get_topic_term(ldaseq, 0).shape\n", + "topic_term =numpy.array(numpy.split(numpy.concatenate((get_topic_term(ldaseq, 0), get_topic_term(ldaseq, 1), get_topic_term(ldaseq, 2), get_topic_term(ldaseq, 3), get_topic_term(ldaseq, 4))), 5))\n", + "vis_dtm = pyLDAvis.prepare(topic_term_dists=topic_term, doc_topic_dists=doc_topic, doc_lengths=doc_lengths, vocab=vocab, term_frequency=term_frequency)\n", + "pyLDAvis.display(vis_dtm)" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-1.8017627313\n", + "-1.93767499394\n", + "0.474753976324\n", + "0.511561156095\n" + ] + } + ], + "source": [ + "from gensim.models.coherencemodel import CoherenceModel\n", + "import pickle\n", + "\n", + "\n", + "cm_wrapper = CoherenceModel(topics=topics_wrapper, corpus=corpus, dictionary=dictionary, coherence='u_mass')\n", + "cm_DTM = CoherenceModel(topics=dtm_topics, corpus=corpus, dictionary=dictionary, coherence='u_mass')\n", + "\n", + "print (cm_wrapper.get_coherence())\n", + "print (cm_DTM.get_coherence())\n", + "\n", + "# to use 'c_v' we need texts, which we have saved to disk.\n", + "texts = pickle.load(open('Corpus/texts', 'rb'))\n", + "cm_wrapper = CoherenceModel(topics=topics_wrapper, texts=texts, dictionary=dictionary, coherence='c_v')\n", + "cm_DTM = CoherenceModel(topics=dtm_topics, texts=texts, dictionary=dictionary, coherence='c_v')\n", + "\n", + "print (cm_wrapper.get_coherence())\n", + "print (cm_DTM.get_coherence())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "So while `u_mass` coherence prefers the wrapper topics, `c_v` seems to favor our python port better. :)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "So while there is already a python wrapper of DTM, a pure python implementation will be useful to better understand what goes on undcer the hood and better the code. When it comes to performance, the C++ is undoubtably faster, but we can continue to work on ours to make it as fast.\n", + "As for evaluating the results, our topics are on par if not better than the wrapper!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/gensim/models/ldaseqmodel.py b/gensim/models/ldaseqmodel.py new file mode 100644 index 0000000000..2ea9d47cc1 --- /dev/null +++ b/gensim/models/ldaseqmodel.py @@ -0,0 +1,1108 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html +# Based on Copyright (C) 2016 Radim Rehurek + + +""" + +Inspired by the Blei's original DTM code and paper. +Original DTM C/C++ code: https://github.com/blei-lab/dtm +DTM Paper: https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006a.pdf + + +TODO: +The next steps to take this forward would be: + + 1) Include DIM mode. Most of the infrastructure for this is in place. + 2) See if LdaPost can be replaced by LdaModel completely without breaking anything. + 3) Heavy lifting going on in the sslm class - efforts can be made to cythonise mathematical methods. + - in particular, update_obs and the optimization takes a lot time. + 4) Try and make it distributed, especially around the E and M step. + +""" + +from gensim import utils, matutils +from gensim.models import ldamodel +import numpy +from scipy.special import digamma, gammaln +from scipy import optimize +import logging + +logger = logging.getLogger('gensim.models.ldaseqmodel') + + +class seq_corpus(utils.SaveLoad): + + """ + `seq_corpus` is basically a wrapper class which contains information about the corpus. + `vocab_len` is the length of the vocabulary. + `max_doc_len` is the maximum number of terms a single document has. + `num_time_slices` is the number of sequences, i.e number of time-slices. + `corpus_len` is the number of documents present. + `time_slice` is a list or numpy array which the user must provide which contains the number of documents in each time-slice. + `corpus` is any iterable gensim corpus. + + """ + def __init__(self, corpus=None, time_slice=None, id2word=None): + self.id2word = id2word + if corpus is None and self.id2word is None: + raise ValueError('at least one of corpus/id2word must be specified, to establish input space dimensionality') + + if self.id2word is None: + logger.warning("no word id mapping provided; initializing from corpus, assuming identity") + self.id2word = utils.dict_from_corpus(corpus) + self.vocab_len = len(self.id2word) + elif len(self.id2word) > 0: + self.vocab_len = len(self.id2word) + else: + self.vocab_len = 0 + + self.corpus = corpus + if self.corpus is not None: + try: + self.corpus_len = len(corpus) + except: + logger.warning("input corpus stream has no len(); counting documents") + self.corpus_len = sum(1 for _ in corpus) + + self.time_slice = time_slice + if self.time_slice is not None: + self.num_time_slices = len(time_slice) + + max_doc_len = 0 + for line_no, line in enumerate(corpus): + if len(line) > max_doc_len: + max_doc_len = len(line) + self.max_doc_len = max_doc_len + +# endclass seq_corpus + + +class LdaSeqModel(utils.SaveLoad): + """ + The constructor estimates Dynamic Topic Model parameters based + on a training corpus. + If we have 30 documents, with 5 in the first time-slice, 10 in the second, and 15 in the third, we would + set up our model like this: + + >>> ldaseq = LdaSeqModel(corpus=corpus, time_slice= [5, 10, 15], num_topics=5) + + Model persistency is achieved through inheriting utils.SaveLoad. + + >>> ldaseq.save("ldaseq") + + saves the model to disk. + """ + + def __init__(self, corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10, + initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, + random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20): + """ + `corpus` is any iterable gensim corpus + + `time_slice` as described above is a list which contains the number of documents in each time-slice + + `id2word` is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size and printing topics. + + `alphas` is a prior of your choice and should be a double or float value. default is 0.01 + + `num_topics` is the number of requested latent topics to be extracted from the training corpus. + + `initalize` allows the user to decide how he wants to initialise the DTM model. Default is through gensim LDA. + You can use your own sstats of an LDA model previously trained as well by specifying 'own' and passing a numpy matrix through sstats. + If you wish to just pass a previously used LDA model, pass it through `lda_model` + Shape of sstats is (vocab_len, num_topics) + + `chain_variance` is a constant which dictates how the beta values evolve - it is a gaussian parameter defined in the + beta distribution. + + `passes` is the number of passes of the initial LdaModel. + + `random_state` can be a numpy.random.RandomState object or the seed for one, for the LdaModel. + """ + if corpus is not None: + self.corpus = seq_corpus(corpus=corpus, id2word=id2word, time_slice=time_slice) + self.vocab_len = len(self.corpus.id2word) + + self.time_slice = time_slice + self.num_topics = num_topics + self.num_time_slices = len(time_slice) + self.alphas = numpy.full(num_topics, alphas) + + # topic_chains contains for each topic a 'state space language model' object which in turn has information about each topic + # the sslm class is described below and contains information on topic-word probabilities and doc-topic probabilities. + self.topic_chains = [] + for topic in range(0, num_topics): + sslm_ = sslm(num_time_slices=self.num_time_slices, vocab_len=self.vocab_len, num_topics=self.num_topics, chain_variance=chain_variance, obs_variance=obs_variance) + self.topic_chains.append(sslm_) + + # the following are class variables which are to be integrated during Document Influence Model + self.top_doc_phis = None + self.influence = None + self.renormalized_influence = None + self.influence_sum_lgl = None + + # if a corpus and time_slice is provided, depending on the user choice of initializing LDA, we start DTM. + if self.corpus is not None and time_slice is not None: + if initialize == 'gensim': + lda_model = ldamodel.LdaModel(corpus, id2word=self.corpus.id2word, num_topics=self.num_topics, passes=passes, alpha=self.alphas, random_state=random_state) + self.sstats = numpy.transpose(lda_model.state.sstats) + if initialize == 'ldamodel': + self.sstats = numpy.transpose(lda_model.state.sstats) + if initialize == 'own': + self.sstats = sstats + + # initialize model from sstats + self.init_ldaseq_ss(chain_variance, obs_variance, self.alphas, self.sstats) + + # fit DTM + self.fit_lda_seq(self.corpus, lda_inference_max_iter, em_min_iter, em_max_iter) + + + def init_ldaseq_ss(self, topic_chain_variance, topic_obs_variance, alpha, init_suffstats): + """ + Method to initialize State Space Language Model, topic wise. + """ + self.alphas = alpha + for k, chain in enumerate(self.topic_chains): + sstats = init_suffstats[:, k] + sslm.sslm_counts_init(chain, topic_obs_variance, topic_chain_variance, sstats) + + # initialize the below matrices only if running DIM + # ldaseq.topic_chains[k].w_phi_l = numpy.zeros((ldaseq.vocab_len, ldaseq.num_time_slices)) + # ldaseq.topic_chains[k].w_phi_sum = numpy.zeros((ldaseq.vocab_len, ldaseq.num_time_slices)) + # ldaseq.topic_chains[k].w_phi_sq = numpy.zeros((ldaseq.vocab_len, ldaseq.num_time_slices)) + + + def fit_lda_seq(self, seq_corpus, lda_inference_max_iter, em_min_iter, em_max_iter): + """ + fit an lda sequence model: + + for each time period + set up lda model with E[log p(w|z)] and \alpha + for each document + perform posterior inference + update sufficient statistics/likelihood + + maximize topics + + """ + LDASQE_EM_THRESHOLD = 1e-4 + # if bound is low, then we increase iterations. + LOWER_ITER = 10 + ITER_MULT_LOW = 2 + MAX_ITER = 500 + + num_topics = self.num_topics + vocab_len = self.vocab_len + data_len = seq_corpus.num_time_slices + corpus_len = seq_corpus.corpus_len + + bound = 0 + convergence = LDASQE_EM_THRESHOLD + 1 + iter_ = 0 + + while iter_ < em_min_iter or ((convergence > LDASQE_EM_THRESHOLD) and iter_ <= em_max_iter): + + logger.info(" EM iter ", iter_) + logger.info("E Step") + # TODO: bound is initialized to 0 + old_bound = bound + + # initiate sufficient statistics + topic_suffstats = [] + for topic in range(0, num_topics): + topic_suffstats.append(numpy.resize(numpy.zeros(vocab_len * data_len), (vocab_len, data_len))) + + # set up variables + gammas = numpy.resize(numpy.zeros(corpus_len * num_topics), (corpus_len, num_topics)) + lhoods = numpy.resize(numpy.zeros(corpus_len * num_topics + 1), (corpus_len, num_topics + 1)) + # compute the likelihood of a sequential corpus under an LDA + # seq model and find the evidence lower bound. This is the E - Step + bound, gammas = self.lda_seq_infer(seq_corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter) + self.gammas = gammas + + logger.info("M Step") + + # fit the variational distribution. This is the M - Step + topic_bound = self.fit_lda_seq_topics(topic_suffstats) + bound += topic_bound + + if ((bound - old_bound) < 0): + # if max_iter is too low, increase iterations. + if lda_inference_max_iter < LOWER_ITER: + lda_inference_max_iter *= ITER_MULT_LOW + logger.info("Bound went down, increasing iterations to", lda_inference_max_iter) + + # check for convergence + convergence = numpy.fabs((bound - old_bound) / old_bound) + + if convergence < LDASQE_EM_THRESHOLD: + + lda_inference_max_iter = MAX_ITER + logger.info("Starting final iterations, max iter is", lda_inference_max_iter) + convergence = 1.0 + + logger.info(iter_, "iteration lda seq bound is", bound, ", convergence is ", convergence) + + iter_ += 1 + + return bound + + + def lda_seq_infer(self, seq_corpus, topic_suffstats, gammas, lhoods, iter_, lda_inference_max_iter): + """ + Inference or E- Step. + This is used to set up the gensim LdaModel to be used for each time-slice. + It also allows for Document Influence Model code to be written in. + """ + num_topics = self.num_topics + vocab_len = self.vocab_len + bound = 0.0 + + lda = ldamodel.LdaModel(num_topics=num_topics, alpha=self.alphas, id2word=seq_corpus.id2word) + lda.topics = numpy.array(numpy.split(numpy.zeros(vocab_len * num_topics), vocab_len)) + ldapost = LdaPost(max_doc_len=seq_corpus.max_doc_len, num_topics=num_topics, lda=lda) + + model = "DTM" + if model == "DTM": + bound, gammas = self.inferDTMseq(seq_corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter) + elif model == "DIM": + self.InfluenceTotalFixed(seq_corpus) + bound, gammas = self.inferDIMseq(seq_corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter) + + return bound, gammas + + + def inferDTMseq(self, seq_corpus, topic_suffstats, gammas, lhoods, lda, ldapost, iter_, bound, lda_inference_max_iter): + """ + Computes the likelihood of a sequential corpus under an LDA seq model, and return the likelihood bound. + Need to pass the LdaSeq model, seq_corpus, sufficient stats, gammas and lhoods matrices previously created, + and LdaModel and LdaPost class objects. + """ + doc_index = 0 # overall doc_index in corpus + time = 0 # current time-slice + doc_num = 0 # doc-index in current time-lice + num_topics = self.num_topics + lda = self.make_lda_seq_slice(lda, time) # create lda_seq slice + + time_slice = numpy.cumsum(numpy.array(self.time_slice)) + + for line_no, line in enumerate(seq_corpus.corpus): + # this is used to update the time_slice and create a new lda_seq slice every new time_slice + if doc_index > time_slice[time]: + time += 1 + lda = self.make_lda_seq_slice(lda, time) # create lda_seq slice + doc_num = 0 + + gam = gammas[doc_index] + lhood = lhoods[doc_index] + + ldapost.gamma = gam + ldapost.lhood = lhood + ldapost.doc = line + + # TODO: replace fit_lda_post with appropriate ldamodel functions, if possible. + if iter_ == 0: + doc_lhood = LdaPost.fit_lda_post(ldapost, doc_num, time, None, lda_inference_max_iter=lda_inference_max_iter) + else: + doc_lhood = LdaPost.fit_lda_post(ldapost, doc_num, time, self, lda_inference_max_iter=lda_inference_max_iter) + + if topic_suffstats is not None: + topic_suffstats = LdaPost.update_lda_seq_ss(ldapost, time, line, topic_suffstats) + + gammas[doc_index] = ldapost.gamma + bound += doc_lhood + doc_index += 1 + doc_num += 1 + + return bound, gammas + + + def make_lda_seq_slice(self, lda, time): + """ + set up the LDA model topic-word values with that of ldaseq. + """ + for k in range(0, self.num_topics): + lda.topics[:, k] = numpy.copy(self.topic_chains[k].e_log_prob[:, time]) + + lda.alpha = numpy.copy(self.alphas) + return lda + + + def fit_lda_seq_topics(self, topic_suffstats): + """ + Fit lda sequence topic wise. + """ + lhood = 0 + lhood_term = 0 + + for k, chain in enumerate(self.topic_chains): + logger.info("Fitting topic number", k) + lhood_term = sslm.fit_sslm(chain, topic_suffstats[k]) + lhood += lhood_term + + return lhood + + + def print_topic_times(self, topic, top_terms=20): + """ + Prints one topic showing each time-slice. + """ + topics = [] + for time in range(0, self.num_time_slices): + topics.append(self.print_topic(topic, time, top_terms)) + + return topics + + + def print_topics(self, time=0, top_terms=20): + """ + Prints all topics in a particular time-slice. + """ + topics =[] + for topic in range(0, self.num_topics): + topics.append(self.print_topic(topic, time, top_terms)) + return topics + + + def print_topic(self, topic, time=0, top_terms=20): + """ + Topic is the topic number + Time is for a particular time_slice + top_terms is the number of terms to display + """ + topic = self.topic_chains[topic].e_log_prob + topic = numpy.transpose(topic) + topic = numpy.exp(topic[time]) + topic = topic / topic.sum() + bestn = matutils.argsort(topic, top_terms, reverse=True) + beststr = [(round(topic[id_], 3), self.corpus.id2word[id_]) for id_ in bestn] + return beststr + + + def doc_topics(self, doc_number): + """ + On passing the LdaSeqModel trained ldaseq object, the doc_number of your document in the corpus, + it returns the doc-topic probabilities of that document. + """ + doc_topic = numpy.copy(self.gammas) + doc_topic /= doc_topic.sum(axis=1)[:, numpy.newaxis] + return doc_topic[doc_number] + + + def __getitem__(self, doc): + """ + Similar to the LdaModel __getitem__ function, it returns topic proportions of a document passed. + """ + lda_model = ldamodel.LdaModel(num_topics=self.num_topics, alpha=self.alphas, id2word=self.corpus.id2word) + lda_model.topics = numpy.array(numpy.split(numpy.zeros(self.vocab_len * self.num_topics), self.vocab_len)) + ldapost = LdaPost(num_topics=self.num_topics, max_doc_len=len(doc), lda=lda_model, doc=doc) + + time_lhoods = [] + for time in range(0, self.num_time_slices): + lda_model = self.make_lda_seq_slice(lda_model, time) # create lda_seq slice + lhood = LdaPost.fit_lda_post(ldapost, 0, time, self) + time_lhoods.append(lhood) + + doc_topic = ldapost.gamma / ldapost.gamma.sum() + # should even the likelihoods be returned? + return doc_topic + +# endclass LdaSeqModel + + +class sslm(utils.SaveLoad): + """ + The sslm class is the State Space Language Model for DTM and contains the following information: + `obs` values contain the doc - topic ratios + `e_log_prob` contains topic - word ratios + `mean`, `fwd_mean` contains the mean values to be used for inference for each word for a time_slice + `variance`, `fwd_variance` contains the variance values to be used for inference for each word in a time_slice + `fwd_mean`, `fwd_variance` are the forward posterior values. + `zeta` is an extra variational parameter with a value for each time-slice + """ + def __init__(self, vocab_len=None, num_time_slices=None, num_topics=None, obs_variance=0.5, chain_variance=0.005): + self.vocab_len = vocab_len + self.num_time_slices = num_time_slices + self.obs_variance = obs_variance + self.chain_variance= chain_variance + self.num_topics = num_topics + + # setting up matrices + self.obs = numpy.array(numpy.split(numpy.zeros(num_time_slices * vocab_len), vocab_len)) + self.e_log_prob = numpy.array(numpy.split(numpy.zeros(num_time_slices * vocab_len), vocab_len)) + self.mean = numpy.array(numpy.split(numpy.zeros((num_time_slices + 1) * vocab_len), vocab_len)) + self.fwd_mean = numpy.array(numpy.split(numpy.zeros((num_time_slices + 1) * vocab_len), vocab_len)) + self.fwd_variance = numpy.array(numpy.split(numpy.zeros((num_time_slices + 1) * vocab_len), vocab_len)) + self.variance = numpy.array(numpy.split(numpy.zeros((num_time_slices + 1) * vocab_len), vocab_len)) + self.zeta = numpy.zeros(num_time_slices) + + # the following are class variables which are to be integrated during Document Influence Model + self.m_update_coeff = None + self.mean_t = None + self.variance_t = None + self.influence_sum_lgl = None + self.w_phi_l = None + self.w_phi_sum = None + self.w_phi_l_sq = None + self.m_update_coeff_g = None + + + def update_zeta(self): + """ + Updates the Zeta Variational Parameter. + Zeta is described in the appendix and is equal to sum (exp(mean[word] + Variance[word] / 2)), over every time-slice. + It is the value of variational parameter zeta which maximizes the lower bound. + """ + for j, val in enumerate(self.zeta): + self.zeta[j] = numpy.sum(numpy.exp(self.mean[:, j + 1] + self.variance[:, j + 1] / 2)) + return self.zeta + + + def compute_post_variance(self, word, chain_variance): + """ + Based on the Variational Kalman Filtering approach for Approximate Inference [https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006a.pdf] + This function accepts the word to compute variance for, along with the associated sslm class object, and returns variance and fwd_variance + Computes Var[\beta_{t,w}] for t = 1:T + + Fwd_Variance(t) ≡ E((beta_{t,w} − mean_{t,w})^2 |beta_{t} for 1:t) + = (obs_variance / fwd_variance[t - 1] + chain_variance + obs_variance ) * (fwd_variance[t - 1] + obs_variance) + + Variance(t) ≡ E((beta_{t,w} − mean_cap{t,w})^2 |beta_cap{t} for 1:t) + = fwd_variance[t - 1] + (fwd_variance[t - 1] / fwd_variance[t - 1] + obs_variance)^2 * (variance[t - 1] - (fwd_variance[t-1] + obs_variance)) + + """ + INIT_VARIANCE_CONST = 1000 + + T = self.num_time_slices + variance = self.variance[word] + fwd_variance = self.fwd_variance[word] + # forward pass. Set initial variance very high + fwd_variance[0] = chain_variance * INIT_VARIANCE_CONST + for t in range(1, T + 1): + if self.obs_variance: + c = self.obs_variance / (fwd_variance[t - 1] + chain_variance + self.obs_variance) + else: + c = 0 + fwd_variance[t] = c * (fwd_variance[t - 1] + chain_variance) + + # backward pass + variance[T] = fwd_variance[T] + for t in range(T - 1, -1, -1): + if fwd_variance[t] > 0.0: + c = numpy.power((fwd_variance[t] / (fwd_variance[t] + chain_variance)), 2) + else: + c = 0 + variance[t] = (c * (variance[t + 1] - chain_variance)) + ((1 - c) * fwd_variance[t]) + + return variance, fwd_variance + + + def compute_post_mean(self, word, chain_variance): + """ + Based on the Variational Kalman Filtering approach for Approximate Inference [https://www.cs.princeton.edu/~blei/papers/BleiLafferty2006a.pdf] + This function accepts the word to compute mean for, along with the associated sslm class object, and returns mean and fwd_mean + Essentially a forward-backward to compute E[\beta_{t,w}] for t = 1:T. + + Fwd_Mean(t) ≡ E(beta_{t,w} | beta_ˆ 1:t ) + = (obs_variance / fwd_variance[t - 1] + chain_variance + obs_variance ) * fwd_mean[t - 1] + (1 - (obs_variance / fwd_variance[t - 1] + chain_variance + obs_variance)) * beta + + Mean(t) ≡ E(beta_{t,w} | beta_ˆ 1:T ) + = fwd_mean[t - 1] + (obs_variance / fwd_variance[t - 1] + obs_variance) + (1 - obs_variance / fwd_variance[t - 1] + obs_variance)) * mean[t] + + """ + T = self.num_time_slices + obs = self.obs[word] + fwd_variance = self.fwd_variance[word] + mean = self.mean[word] + fwd_mean = self.fwd_mean[word] + + # forward + fwd_mean[0] = 0 + for t in range(1, T + 1): + c = self.obs_variance / (fwd_variance[t - 1] + chain_variance + self.obs_variance) + fwd_mean[t] = c * fwd_mean[t - 1] + (1 - c) * obs[t - 1] + + # backward pass + mean[T] = fwd_mean[T] + for t in range(T - 1, -1, -1): + if chain_variance == 0.0: + c = 0.0 + else: + c = chain_variance / (fwd_variance[t] + chain_variance) + mean[t] = c * fwd_mean[t] + (1 - c) * mean[t + 1] + return mean, fwd_mean + + + def compute_expected_log_prob(self): + """ + Compute the expected log probability given values of m. + The appendix describes the Expectation of log-probabilities in equation 5 of the DTM paper; + The below implementation is the result of solving the equation and is as implemented in the original Blei DTM code. + """ + for (w, t), val in numpy.ndenumerate(self.e_log_prob): + self.e_log_prob[w][t] = self.mean[w][t + 1] - numpy.log(self.zeta[t]) + return self.e_log_prob + + + def sslm_counts_init(self, obs_variance, chain_variance, sstats): + """ + Initialize State Space Language Model with LDA sufficient statistics. + Called for each topic-chain and initializes intial mean, variance and Topic-Word probabilities for the first time-slice. + """ + W = self.vocab_len + T = self.num_time_slices + + log_norm_counts = numpy.copy(sstats) + log_norm_counts = log_norm_counts / sum(log_norm_counts) + log_norm_counts = log_norm_counts + 1.0 / W + log_norm_counts = log_norm_counts / sum(log_norm_counts) + log_norm_counts = numpy.log(log_norm_counts) + + # setting variational observations to transformed counts + self.obs = (numpy.repeat(log_norm_counts, T, axis=0)).reshape(W, T) + # set variational parameters + self.obs_variance = obs_variance + self.chain_variance = chain_variance + + # compute post variance, mean + for w in range(0, W): + self.variance[w], self.fwd_variance[w] = self.compute_post_variance(w, self.chain_variance) + self.mean[w], self.fwd_mean[w] = self.compute_post_mean(w, self.chain_variance) + + self.zeta = self.update_zeta() + self.e_log_prob = self.compute_expected_log_prob() + + + def fit_sslm(self, sstats): + """ + Fits variational distribution. + This is essentially the m-step. + Accepts the sstats for a particular topic for input and maximizes values for that topic. + Updates the values in the update_obs() and compute_expected_log_prob methods. + """ + W = self.vocab_len + bound = 0 + old_bound = 0 + sslm_fit_threshold = 1e-6 + sslm_max_iter = 2 + converged = sslm_fit_threshold + 1 + + totals = numpy.zeros(sstats.shape[1]) + + # computing variance, fwd_variance + self.variance, self.fwd_variance = map(numpy.array, list(zip(*[self.compute_post_variance(w, self.chain_variance) for w in range(0, W)]))) + + # column sum of sstats + totals = sstats.sum(axis=0) + iter_ = 0 + + model = "DTM" + if model == "DTM": + bound = self.compute_bound(sstats, totals) + if model == "DIM": + bound = self.compute_bound_fixed(sstats, totals) + + logger.info("initial sslm bound is ", bound) + + while converged > sslm_fit_threshold and iter_ < sslm_max_iter: + iter_ += 1 + old_bound = bound + self.obs, self.zeta = self.update_obs(sstats, totals) + + if model == "DTM": + bound = self.compute_bound(sstats, totals) + if model == "DIM": + bound = self.compute_bound_fixed(sstats, totals) + + converged = numpy.fabs((bound - old_bound) / old_bound) + logger.info(iter_, " iteration lda seq bound is ", bound, " convergence is", converged) + + self.e_log_prob = self.compute_expected_log_prob() + return bound + + + def compute_bound(self, sstats, totals): + """ + Compute log probability bound. + Forumula is as described in appendix of DTM by Blei. (formula no. 5) + """ + W = self.vocab_len + T = self.num_time_slices + + term_1 = 0 + term_2 = 0 + term_3 = 0 + + val = 0 + ent = 0 + + chain_variance = self.chain_variance + # computing mean, fwd_mean + self.mean, self.fwd_mean = map(numpy.array, (zip(*[self.compute_post_mean(w, self.chain_variance) for w in range(0, W)]))) + self.zeta = self.update_zeta() + + for w in range(0, W): + val += (self.variance[w][0] - self.variance[w][T]) / 2 * chain_variance + + logger.info("Computing bound, all times") + + for t in range(1, T + 1): + term_1 = 0.0 + term_2 = 0.0 + ent = 0.0 + for w in range(0, W): + + m = self.mean[w][t] + prev_m = self.mean[w][t - 1] + + v = self.variance[w][t] + + # w_phi_l is only used in Document Influence Model; the values are aleays zero in this case + # w_phi_l = sslm.w_phi_l[w][t - 1] + # exp_i = numpy.exp(-prev_m) + # term_1 += (numpy.power(m - prev_m - (w_phi_l * exp_i), 2) / (2 * chain_variance)) - (v / chain_variance) - numpy.log(chain_variance) + + term_1 += (numpy.power(m - prev_m, 2) / (2 * chain_variance)) - (v / chain_variance) - numpy.log(chain_variance) + term_2 += sstats[w][t - 1] * m + ent += numpy.log(v) / 2 # note the 2pi's cancel with term1 (see doc) + + term_3 = -totals[t - 1] * numpy.log(self.zeta[t - 1]) + val += term_2 + term_3 + ent - term_1 + + return val + + + def update_obs(self, sstats, totals): + """ + Function to perform optimization of obs. Parameters are suff_stats set up in the fit_sslm method. + + TODO: + This is by far the slowest function in the whole algorithm. + Replacing or improving the performance of this would greatly speed things up. + """ + + OBS_NORM_CUTOFF = 2 + STEP_SIZE = 0.01 + TOL = 1e-3 + + W = self.vocab_len + T = self.num_time_slices + + runs = 0 + mean_deriv_mtx = numpy.resize(numpy.zeros(T * (T + 1)), (T, T + 1)) + + norm_cutoff_obs = None + for w in range(0, W): + w_counts = sstats[w] + counts_norm = 0 + # now we find L2 norm of w_counts + for i in range(0, len(w_counts)): + counts_norm += w_counts[i] * w_counts[i] + + counts_norm = numpy.sqrt(counts_norm) + + if counts_norm < OBS_NORM_CUTOFF and norm_cutoff_obs is not None: + obs = self.obs[w] + norm_cutoff_obs = numpy.copy(obs) + else: + if counts_norm < OBS_NORM_CUTOFF: + w_counts = numpy.zeros(len(w_counts)) + + # TODO: apply lambda function + for t in range(0, T): + mean_deriv = mean_deriv_mtx[t] + mean_deriv = self.compute_mean_deriv(w, t, mean_deriv) + mean_deriv_mtx[t] = mean_deriv + + deriv = numpy.zeros(T) + args = self, w_counts, totals, mean_deriv_mtx, w, deriv + obs = self.obs[w] + model = "DTM" + + if model == "DTM": + # slowest part of method + obs = optimize.fmin_cg(f=f_obs, fprime=df_obs, x0=obs, gtol=TOL, args=args, epsilon=STEP_SIZE, disp=0) + if model == "DIM": + pass + runs += 1 + + if counts_norm < OBS_NORM_CUTOFF: + norm_cutoff_obs = obs + + self.obs[w] = obs + + self.zeta = self.update_zeta() + + return self.obs, self.zeta + + + def compute_mean_deriv(self, word, time, deriv): + """ + Used in helping find the optimum function. + computes derivative of E[\beta_{t,w}]/d obs_{s,w} for t = 1:T. + put the result in deriv, allocated T+1 vector + """ + + T = self.num_time_slices + fwd_variance = self.variance[word] + + deriv[0] = 0 + + # forward pass + for t in range(1, T + 1): + if self.obs_variance > 0.0: + w = self.obs_variance / (fwd_variance[t - 1] + self.chain_variance + self.obs_variance) + else: + w = 0.0 + val = w * deriv[t - 1] + if time == t - 1: + val += (1 - w) + deriv[t] = val + + for t in range(T - 1, -1, -1): + if self.chain_variance == 0.0: + w = 0.0 + else: + w = self.chain_variance / (fwd_variance[t] + self.chain_variance) + deriv[t] = w * deriv[t] + (1 - w) * deriv[t + 1] + + return deriv + + + def compute_obs_deriv(self, word, word_counts, totals, mean_deriv_mtx, deriv): + """ + Derivation of obs which is used in derivative function [df_obs] while optimizing. + """ + + # flag + init_mult = 1000 + + T = self.num_time_slices + + mean = self.mean[word] + variance = self.variance[word] + + # only used for DIM mode + # w_phi_l = self.w_phi_l[word] + # m_update_coeff = self.m_update_coeff[word] + + # temp_vector holds temporary zeta values + self.temp_vect = numpy.zeros(T) + + for u in range(0, T): + self.temp_vect[u] = numpy.exp(mean[u + 1] + variance[u + 1] / 2) + + for t in range(0, T): + mean_deriv = mean_deriv_mtx[t] + term1 = 0 + term2 = 0 + term3 = 0 + term4 = 0 + + for u in range(1, T + 1): + mean_u = mean[u] + variance_u_prev = variance[u - 1] + mean_u_prev = mean[u - 1] + dmean_u = mean_deriv[u] + dmean_u_prev = mean_deriv[u - 1] + + term1 += (mean_u - mean_u_prev) * (dmean_u - dmean_u_prev) + term2 += (word_counts[u - 1] - (totals[u - 1] * self.temp_vect[u - 1] / self.zeta[u - 1])) * dmean_u + + model = "DTM" + if model == "DIM": + # do some stuff + pass + + if self.chain_variance: + term1 = - (term1 / self.chain_variance) + term1 = term1 - (mean[0] * mean_deriv[0]) / (init_mult * self.chain_variance) + else: + term1 = 0.0 + + deriv[t] = term1 + term2 + term3 + term4 + + return deriv +# endclass sslm + +class LdaPost(utils.SaveLoad): + + """ + Posterior values associated with each set of documents. + TODO: use **Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.** + to update phi, gamma. End game would be to somehow replace LdaPost entirely with LdaModel. + """ + + def __init__(self, doc=None, lda=None, max_doc_len=None, num_topics=None, gamma=None, lhood=None): + + self.doc = doc + self.lda = lda + self.gamma = gamma + self.lhood = lhood + if self.gamma is None: + self.gamma = numpy.zeros(num_topics) + if self.lhood is None: + self.lhood = numpy.zeros(num_topics + 1) + + if max_doc_len is not None and num_topics is not None: + self.phi = numpy.resize(numpy.zeros(max_doc_len * num_topics), (max_doc_len, num_topics)) + self.log_phi = numpy.resize(numpy.zeros(max_doc_len * num_topics), (max_doc_len, num_topics)) + + # the following are class variables which are to be integrated during Document Influence Model + + self.doc_weight = None + self.renormalized_doc_weight = None + + + def update_phi(self, doc_number, time): + """ + Update variational multinomial parameters, based on a document and a time-slice. + This is done based on the original Blei-LDA paper, where: + log_phi := beta * exp(Ψ(gamma)), over every topic for every word. + + TODO: incorporate lee-sueng trick used in **Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001**. + """ + num_topics = self.lda.num_topics + # digamma values + dig = numpy.zeros(num_topics) + + for k in range(0, num_topics): + dig[k] = digamma(self.gamma[k]) + + n = 0 # keep track of iterations for phi, log_phi + for word_id, count in self.doc: + for k in range(0, num_topics): + self.log_phi[n][k] = dig[k] + self.lda.topics[word_id][k] + + log_phi_row = self.log_phi[n] + phi_row = self.phi[n] + + # log normalize + v = log_phi_row[0] + for i in range(1, len(log_phi_row)): + v = numpy.logaddexp(v, log_phi_row[i]) + + # subtract every element by v + log_phi_row = log_phi_row - v + phi_row = numpy.exp(log_phi_row) + self.log_phi[n] = log_phi_row + self.phi[n] = phi_row + n +=1 # increase iteration + + return self.phi, self.log_phi + + + def update_gamma(self): + """ + update variational dirichlet parameters as described in the original Blei LDA paper: + gamma = alpha + sum(phi), over every topic for every word. + """ + self.gamma = numpy.copy(self.lda.alpha) + n = 0 # keep track of number of iterations for phi, log_phi + for word_id, count in self.doc: + phi_row = self.phi[n] + for k in range(0, self.lda.num_topics): + self.gamma[k] += phi_row[k] * count + n += 1 + + return self.gamma + + + def init_lda_post(self): + """ + Initialize variational posterior, does not return anything. + """ + total = sum(count for word_id, count in self.doc) + self.gamma.fill(self.lda.alpha[0] + float(total) / self.lda.num_topics) + self.phi[:len(self.doc),:] = 1.0 / self.lda.num_topics + # doc_weight used during DIM + # ldapost.doc_weight = None + + + def compute_lda_lhood(self): + """ + compute the likelihood bound + """ + num_topics = self.lda.num_topics + gamma_sum = numpy.sum(self.gamma) + + # to be used in DIM + # sigma_l = 0 + # sigma_d = 0 + + lhood = gammaln(numpy.sum(self.lda.alpha)) - gammaln(gamma_sum) + self.lhood[num_topics] = lhood + + # influence_term = 0 + digsum = digamma(gamma_sum) + + model = "DTM" + for k in range(0, num_topics): + # below code only to be used in DIM mode + # if ldapost.doc_weight is not None and (model == "DIM" or model == "fixed"): + # influence_topic = ldapost.doc_weight[k] + # influence_term = - ((influence_topic * influence_topic + sigma_l * sigma_l) / 2.0 / (sigma_d * sigma_d)) + + e_log_theta_k = digamma(self.gamma[k]) - digsum + lhood_term = (self.lda.alpha[k] - self.gamma[k]) * e_log_theta_k + gammaln(self.gamma[k]) - gammaln(self.lda.alpha[k]) + # TODO: check why there's an IF + n = 0 + for word_id, count in self.doc: + if self.phi[n][k] > 0: + lhood_term += count * self.phi[n][k] * (e_log_theta_k + self.lda.topics[word_id][k] - self.log_phi[n][k]) + n += 1 + self.lhood[k] = lhood_term + lhood += lhood_term + # in case of DIM add influence term + # lhood += influence_term + + return lhood + + def fit_lda_post(self, doc_number, time, ldaseq, LDA_INFERENCE_CONVERGED = 1e-8, + lda_inference_max_iter = 25, g=None, g3_matrix=None, g4_matrix=None, g5_matrix=None): + """ + Posterior inference for lda. + g, g3, g4 and g5 are matrices used in Document Influence Model and not used currently. + """ + + self.init_lda_post() + # sum of counts in a doc + total = sum(count for word_id, count in self.doc) + + model = "DTM" + if model == "DIM": + # if in DIM then we initialise some variables here + pass + + lhood = self.compute_lda_lhood() + lhood_old = 0 + converged = 0 + iter_ = 0 + + # first iteration starts here + iter_ += 1 + lhood_old = lhood + self.gamma = self.update_gamma() + + model = "DTM" + + if model == "DTM" or sslm is None: + self.phi, self.log_phi = self.update_phi(doc_number, time) + elif model == "DIM" and sslm is not None: + self.phi, self.log_phi = self.update_phi_fixed(doc_number, time, sslm, g3_matrix, g4_matrix, g5_matrix) + + lhood = self.compute_lda_lhood() + converged = numpy.fabs((lhood_old - lhood) / (lhood_old * total)) + + while converged > LDA_INFERENCE_CONVERGED and iter_ <= lda_inference_max_iter: + + iter_ += 1 + lhood_old = lhood + self.gamma = self.update_gamma() + model = "DTM" + + if model == "DTM" or sslm is None: + self.phi, self.log_phi = self.update_phi(doc_number, time) + elif model == "DIM" and sslm is not None: + self.phi, self.log_phi = self.update_phi_fixed(doc_number, time, sslm, g3_matrix, g4_matrix, g5_matrix) + + lhood = self.compute_lda_lhood() + converged = numpy.fabs((lhood_old - lhood) / (lhood_old * total)) + + return lhood + + + def update_lda_seq_ss(self, time, doc, topic_suffstats): + """ + Update lda sequence sufficient statistics from an lda posterior. + This is very similar to the update_gamma method and uses the same formula. + """ + num_topics = self.lda.num_topics + + for k in range(0, num_topics): + topic_ss = topic_suffstats[k] + n = 0 + for word_id, count in self.doc: + topic_ss[word_id][time] += count * self.phi[n][k] + n += 1 + topic_suffstats[k] = topic_ss + + return topic_suffstats +# endclass LdaPost + + +# the following functions are used in update_obs as the function to optimize +def f_obs(x, *args): + """ + Function which we are optimising for minimizing obs. + """ + sslm, word_counts, totals, mean_deriv_mtx, word, deriv = args + # flag + init_mult = 1000 + + T = len(x) + val = 0 + term1 = 0 + term2 = 0 + + # term 3 and 4 for DIM + term3 = 0 + term4 = 0 + + sslm.obs[word] = x + sslm.mean[word], sslm.fwd_mean[word] = sslm.compute_post_mean(word, sslm.chain_variance) + + mean = sslm.mean[word] + variance = sslm.variance[word] + + # only used for DIM mode + # w_phi_l = sslm.w_phi_l[word] + # m_update_coeff = sslm.m_update_coeff[word] + + for t in range(1, T + 1): + mean_t = mean[t] + mean_t_prev = mean[t - 1] + var_t_prev = variance[t - 1] + + val = mean_t - mean_t_prev + term1 += val * val + term2 += word_counts[t - 1] * mean_t - totals[t - 1] * numpy.exp(mean_t + variance[t] / 2) / sslm.zeta[t - 1] + + model = "DTM" + if model == "DIM": + # stuff happens + pass + + if sslm.chain_variance > 0.0: + + term1 = - (term1 / (2 * sslm.chain_variance)) + term1 = term1 - mean[0] * mean[0] / (2 * init_mult * sslm.chain_variance) + else: + term1 = 0.0 + + final = -(term1 + term2 + term3 + term4) + + return final + +def df_obs(x, *args): + + """ + Derivative of function which optimises obs. + """ + sslm, word_counts, totals, mean_deriv_mtx, word, deriv = args + + sslm.obs[word] = x + sslm.mean[word], sslm.fwd_mean[word] = sslm.compute_post_mean(word, sslm.chain_variance) + + model = "DTM" + if model == "DTM": + deriv = sslm.compute_obs_deriv(word, word_counts, totals, mean_deriv_mtx, deriv) + elif model == "DIM": + deriv = sslm.compute_obs_deriv_fixed(p.word, p.word_counts, p.totals, p.sslm, p.mean_deriv_mtx, deriv) + + return numpy.negative(deriv) + \ No newline at end of file diff --git a/gensim/test/test_data/DTM/sstats_test.txt b/gensim/test/test_data/DTM/sstats_test.txt new file mode 100644 index 0000000000..48166e31ef --- /dev/null +++ b/gensim/test/test_data/DTM/sstats_test.txt @@ -0,0 +1,562 @@ +1.092443654001555575e-02 2.989075563459984597e+00 +3.383108460155363154e-03 1.996616891539844563e+00 +3.738432410516782326e-03 9.962615675894831435e-01 +1.002704902166056566e+00 9.972950978339434336e-01 +3.340271287680170176e-03 9.966597287123197813e-01 +1.002449871314430752e+00 9.975501286855691374e-01 +8.356695257481134426e-03 1.991643304742518605e+00 +1.309370676418351864e-02 3.986906293235815468e+00 +3.137260791273940256e-03 9.968627392087261452e-01 +1.200823017708829016e-02 1.987991769822911703e+00 +1.000000000000000000e+00 1.209863911587779773e-46 +3.000194513455103351e+00 1.999805486544896427e+00 +3.000987461615826746e+00 9.990125383841731432e-01 +1.000000000000000000e+00 1.215915071808483999e-46 +1.700757844563687016e+01 7.992421554363129843e+00 +1.000000000000000000e+00 1.240309436526453858e-46 +6.001785619555818130e+00 1.998214380444180982e+00 +1.000000000000000000e+00 1.224962141876083178e-46 +1.999820777049937215e+00 1.792229500627864848e-04 +4.998437621410796616e+00 1.562378589203469552e-03 +4.037905031533501443e+00 1.962094968466498557e+00 +1.000000000000000000e+00 1.233294072712612592e-46 +2.000000000000000000e+00 1.550659367209450689e-46 +1.000000000000000000e+00 1.239631097852270464e-46 +1.000000000000000000e+00 1.223654131971970673e-46 +3.003013029394103306e+00 9.969869706058966941e-01 +2.000000000000000000e+00 1.511039413594772737e-46 +1.000000000000000000e+00 1.238883267240051435e-46 +1.000000000000000000e+00 1.237196475910151841e-46 +1.000000000000000000e+00 1.224125316488377365e-46 +1.206512985692477002e+01 3.934870143075228199e+00 +3.005250132594285528e+00 1.994749867405714028e+00 +9.002107740412895964e+00 9.978922595871050349e-01 +8.002402847402054320e+00 2.997597152597945680e+00 +4.018809036764009690e+00 4.981190963235990310e+00 +2.000000000000000000e+00 1.714352301734856346e-46 +1.000000000000000000e+00 1.638741129099750070e-46 +2.002009519148556294e+00 1.997990480851443706e+00 +1.000000000000000000e+00 1.623460541769694160e-46 +1.000000000000000000e+00 1.645868712016503705e-46 +1.000000000000000000e+00 1.653105814070478824e-46 +3.998077162205545321e+00 1.922837794454700534e-03 +1.000000000000000000e+00 1.675036050505256413e-46 +1.000000000000000000e+00 1.658581341998854429e-46 +2.000000000000000000e+00 1.169979922256087353e-46 +2.997957644510762520e+00 2.042355489237303486e-03 +1.000000000000000000e+00 1.676560014349381248e-46 +1.000000000000000000e+00 1.639198649887964965e-46 +1.000000000000000000e+00 1.652487861203685329e-46 +4.998311867891332305e+00 1.688132108667998385e-03 +2.999086364543500860e+00 9.136354564987795223e-04 +2.999696536462834739e+00 3.034635371649604257e-04 +9.998559688699326653e-01 1.440311300672572885e-04 +2.999867773599241172e+00 1.322264007587989668e-04 +2.999905448278212816e+00 9.455172178710631711e-05 +9.998521873013139771e-01 1.478126986858749846e-04 +9.998305871043033921e-01 1.694128956965717321e-04 +1.997932596192755206e+00 2.067403807244707487e-03 +9.994339718569504427e+00 3.005660281430495573e+00 +9.998322674507920116e-01 1.000167732549207988e+00 +1.212725111214719220e+01 4.872748887852806021e+00 +6.012609552263404211e+00 2.987390447736595345e+00 +9.998383748897964329e-01 1.616251102034746401e-04 +2.995950087636213954e+00 4.049912363786161047e-03 +3.000715515130669253e+00 9.992844848693303028e-01 +9.998291157956640252e-01 1.708842043359251324e-04 +9.998157099037094930e-01 1.842900962904826258e-04 +2.999726897549725102e+00 2.731024502749187523e-04 +2.015504479342879574e+00 9.844955206571202044e-01 +2.410596059386410417e+01 2.894039406135898940e+00 +9.998033418007618023e-01 1.966581992381829722e-04 +9.998075232710741389e-01 1.924767289258012479e-04 +1.002279354386198751e+00 9.977206456138011381e-01 +2.997811512683630220e+00 1.002188487316369780e+00 +2.998195806597341306e+00 1.804193402658541435e-03 +1.003555434723103712e+00 9.964445652768963990e-01 +1.997218555430012144e+00 2.781444569988119741e-03 +9.991620258750940620e-01 8.379741249060272635e-04 +1.099080248193437903e+01 1.009197518065621413e+00 +2.997894538018985866e+00 2.105461981013825173e-03 +3.010092129008588913e+00 3.989907870991411087e+00 +2.998046681907480693e+00 1.953318092519324995e-03 +9.986430903228902256e-01 1.356909677109647541e-03 +3.026085152492925978e+00 1.973914847507074466e+00 +9.988352858478318774e-01 1.164714152168156401e-03 +9.992409606975610759e-01 7.590393024389190503e-04 +1.999369355722667718e+00 6.306442773322752833e-04 +9.991072509110334732e-01 8.927490889664690390e-04 +4.004260648291602109e+00 4.995739351708398779e+00 +9.991127146583672625e-01 8.872853416327262585e-04 +9.989154885617119728e-01 1.001084511438288027e+00 +9.991522429675332440e-01 8.477570324667618989e-04 +2.997091644364889618e+00 2.908355635110812520e-03 +1.997821385771725078e+00 2.178614228274795058e-03 +2.010225552882992872e+00 9.897744471170069058e-01 +1.998360628077472168e+00 1.639371922528253152e-03 +9.991882677434840154e-01 8.117322565158860198e-04 +9.987033966946465835e-01 1.296603305353423007e-03 +9.989458158997651660e-01 1.054184100234897316e-03 +7.997583342350329083e+00 4.002416657649670917e+00 +9.988825874049864773e-01 1.117412595013351660e-03 +9.992304256320166944e-01 7.695743679833144755e-04 +9.989870394258032471e-01 1.012960574196889937e-03 +1.999795425757386003e+00 1.000204574242614219e+00 +1.997464347179166433e+00 2.535652820833518128e-03 +2.996957541224438692e+00 3.042458775561324418e-03 +9.989816162438666103e-01 1.018383756133388199e-03 +9.988789964070644567e-01 1.121003592935605319e-03 +1.010270267365846353e+00 3.989729732634153425e+00 +9.991509091699821710e-01 8.490908300177801371e-04 +9.992072230496233942e-01 7.927769503766631367e-04 +9.989062297695342485e-01 1.093770230465713339e-03 +1.002269332445513861e+00 9.977306675544861392e-01 +1.997659007869493131e+00 2.340992130507234432e-03 +9.991784968889194651e-01 8.215031110804089888e-04 +9.987818824810549279e-01 1.218117518945004271e-03 +1.998051884363636788e+00 1.948115636363505557e-03 +3.005398990266651804e+00 3.994601009733348640e+00 +3.003419706440591863e+00 1.996580293559408137e+00 +1.998359014771899034e+00 1.640985228101033908e-03 +1.998244530367741856e+00 1.755469632257978130e-03 +9.992617521686939508e-01 7.382478313059710360e-04 +1.001271884915868160e+00 9.987281150841316180e-01 +1.999070885254491836e+00 9.291147455081163759e-04 +9.991080417663374957e-01 8.919582336625818918e-04 +2.006132525740382810e+00 1.993867474259616746e+00 +9.986405151355165488e-01 1.359484864483574999e-03 +9.990291850212226210e-01 9.708149787774489206e-04 +9.988537506281531808e-01 1.146249371846780590e-03 +9.991260523976819297e-01 8.739476023179571472e-04 +1.997620655387410737e+00 2.379344612589276431e-03 +9.992993661068275690e-01 7.006338931723443382e-04 +2.001700032948530605e+00 9.982999670514691726e-01 +9.993122663589073529e-01 6.877336410926577440e-04 +2.999220191240262068e+00 7.798087597375051106e-04 +9.990037551581054664e-01 9.962448418943342741e-04 +1.997550498324772494e+00 2.449501675227426685e-03 +2.999536614256048672e+00 1.000463385743951550e+00 +9.991215414369554182e-01 8.784585630446485952e-04 +1.996911509760880499e+00 3.088490239119342975e-03 +6.000264083607260268e+00 2.999735916392738844e+00 +1.000000000000000000e+00 1.387618972278752214e-46 +2.998581530259968098e+00 2.001418469740031902e+00 +1.000000000000000000e+00 1.420455547333084642e-46 +1.000000000000000000e+00 1.420379826638228227e-46 +4.999817091764811927e+00 2.000182908235187629e+00 +1.000000000000000000e+00 1.404916727954038099e-46 +1.000000000000000000e+00 1.400554365624505746e-46 +1.999223834776728825e+00 7.761652232711166487e-04 +1.000000000000000000e+00 1.358499241355122662e-46 +1.002841383840783918e+00 1.997158616159216304e+00 +2.000000000000000000e+00 1.416372644731821492e-46 +1.000000000000000000e+00 1.347835421818372223e-46 +1.997397650137396674e+00 2.602349862603311591e-03 +1.000000000000000000e+00 1.357524041501837515e-46 +2.997559296873980372e+00 2.440703126019304597e-03 +2.017532287072577812e+00 2.982467712927422632e+00 +1.000000000000000000e+00 1.363035240757054315e-46 +1.000000000000000000e+00 1.359283448019270130e-46 +2.000000000000000000e+00 1.453770399299675476e-46 +1.997956170495806649e+00 2.043829504193442684e-03 +2.997711068867576767e+00 2.288931132422806786e-03 +1.000000000000000000e+00 1.335003685771777313e-46 +1.000000000000000000e+00 1.376725117858755571e-46 +1.000000000000000000e+00 1.355030402701817305e-46 +4.819585856337125843e-47 1.000000000000000000e+00 +4.896954104916352142e-47 1.000000000000000000e+00 +5.976086637921076451e-03 1.994023913362079181e+00 +4.957962279200116587e-47 1.000000000000000000e+00 +9.974629116581600341e-01 1.002537088341839855e+00 +5.062172221104585915e-47 1.000000000000000000e+00 +9.981405376604133295e-01 1.001859462339586671e+00 +1.997222386591746668e+00 2.002777613408253110e+00 +4.930127088692983176e-47 1.000000000000000000e+00 +4.827835426303613263e-47 1.000000000000000000e+00 +4.908450396540412422e-47 1.000000000000000000e+00 +4.993233582324611675e+00 6.006766417675388325e+00 +4.923654511710189096e-47 1.000000000000000000e+00 +4.906420793186469672e-47 1.000000000000000000e+00 +4.779414971434191071e-47 1.000000000000000000e+00 +4.794191602963843442e-47 1.000000000000000000e+00 +5.121519036399629698e-47 1.000000000000000000e+00 +5.701818989978316328e-03 1.994298181010021320e+00 +9.975608629196323074e-01 1.002439137080367804e+00 +2.905237757825851301e-03 1.997094762242173971e+00 +4.830744238879148703e-47 1.000000000000000000e+00 +1.145125243547011167e-02 3.988548747564530039e+00 +5.484152132770268104e-47 2.000000000000000000e+00 +2.020922527590056639e+00 4.979077472409944249e+00 +4.852325547515752104e-47 1.000000000000000000e+00 +4.883797664649919460e-47 1.000000000000000000e+00 +4.895435610470865504e-47 1.000000000000000000e+00 +9.979528025302841776e-01 2.047197469715753004e-03 +9.976058991610120552e-01 2.394100838987878004e-03 +1.012224561675880263e+00 9.877754383241197367e-01 +1.997242196611644482e+00 2.757803388355355330e-03 +9.975977767936994312e-01 2.402223206300585761e-03 +9.981794185130173913e-01 1.820581486982481836e-03 +2.014808977471377105e+00 9.851910225286228950e-01 +9.980602612774225335e-01 1.939738722577383657e-03 +9.977277673365404498e-01 2.272232663459596187e-03 +9.982401773252256305e-01 1.759822674774371223e-03 +9.982123358105545741e-01 1.787664189445417693e-03 +1.013763454563903599e+00 9.862365454360961792e-01 +9.971409172847610636e-01 2.859082715238980225e-03 +9.977559958046933186e-01 2.244004195306692224e-03 +4.011053488697122305e+00 9.889465113028771404e-01 +2.008628622686274845e+00 2.991371377313724711e+00 +9.978965062968028210e-01 2.103493703197112201e-03 +9.975352903754417522e-01 2.464709624558053969e-03 +1.000090979962959814e+00 9.999090200370399639e-01 +9.977543743073555849e-01 2.245625692644432486e-03 +2.013128536014390235e+00 9.868714639856094317e-01 +9.975838750047251180e-01 2.416124995274857284e-03 +1.997703208088337234e+00 2.296791911662399068e-03 +1.997777184502102399e+00 2.222815497897675727e-03 +9.979922615901534177e-01 2.007738409846590490e-03 +4.016810793723678330e+00 2.983189206276321670e+00 +9.978053104526063422e-01 2.194689547393774507e-03 +9.979734582853977409e-01 2.026541714602358817e-03 +2.001062580134930435e+00 9.989374198650693426e-01 +9.975179947676557912e-01 2.482005232344199671e-03 +3.177941307754205465e-03 9.968220586922457160e-01 +3.241529793316107622e-03 9.967584702066838442e-01 +3.148692154952284755e-03 9.968513078450476073e-01 +7.014590228179245698e-03 1.992985409771820704e+00 +3.515906595744435620e-03 9.964840934042553666e-01 +1.300132154876339555e-02 2.986998678451236344e+00 +4.015295704149533407e+00 3.984704295850466149e+00 +4.233913924621668899e-03 1.995766086075378531e+00 +3.661909428208910688e-03 9.963380905717911240e-01 +3.237449006591419567e-03 9.967625509934086026e-01 +1.238745828603949120e-02 2.987612541713960290e+00 +3.720746028236795750e-03 9.962792539717632012e-01 +4.002976063709749788e+00 9.970239362902499902e-01 +2.871126402875520157e-03 9.971288735971244321e-01 +3.034963761052275784e-03 9.969650362389474996e-01 +3.212557152518188616e-03 9.967874428474818860e-01 +4.636700930167085392e-03 9.953632990698328964e-01 +8.373120050646094478e-03 2.991626879949354390e+00 +3.989185382953831099e-03 9.960108146170461030e-01 +4.054296999805730162e-03 9.959457030001941380e-01 +4.357748680200410583e-03 9.956422513197996649e-01 +3.032102883750231043e-03 9.969678971162497794e-01 +1.003115194952157507e+00 9.968848050478422707e-01 +1.666777387072908290e-02 4.983332226129271625e+00 +3.065554092629207538e-03 9.969344459073707920e-01 +1.002468322020088953e+00 9.975316779799109357e-01 +4.038668513787060238e-02 1.195961331486212842e+01 +5.819445709624944223e-03 1.994180554290374996e+00 +3.983358162066140233e-03 9.960166418379338138e-01 +3.198285415576220935e-03 9.968017145844237925e-01 +3.436851425771414968e-03 9.965631485742285633e-01 +9.496364742518654395e-03 2.990503635257481196e+00 +2.016062808450457489e+00 2.983937191549542511e+00 +3.009567162320287093e+00 2.990432837679713352e+00 +5.918374496911796116e-03 1.994081625503087984e+00 +3.518290440879322505e-03 9.964817095591206320e-01 +3.418144033343013810e-03 9.965818559666570486e-01 +3.300151342547691033e-03 9.966998486574522031e-01 +3.994295494790671246e-03 9.960057045052093105e-01 +3.416599850905048009e-03 9.965834001490949667e-01 +3.356972329386859102e-03 9.966430276706130797e-01 +3.000282421225842988e+00 9.997175787741569009e-01 +9.056865706284751918e-03 2.990943134293714945e+00 +1.002290675026068945e+00 9.977093249739310554e-01 +4.091145708739654663e-03 9.959088542912601927e-01 +1.825093218694189348e-02 4.981749067813057152e+00 +3.737685726650237083e-03 9.962623142733496584e-01 +2.921751731890475951e-03 9.970782482681095349e-01 +9.988588190527059041e-01 1.141180947293989447e-03 +9.986304567521199216e-01 1.369543247879973901e-03 +9.983973920083657472e-01 1.602607991634340213e-03 +9.984063285210268424e-01 1.593671478973150485e-03 +9.988059316171471469e-01 1.194068382852883897e-03 +1.998153730342258960e+00 1.846269657741023097e-03 +3.997137010232037380e+00 2.862989767962439546e-03 +1.007497053635540762e+00 1.992502946364459238e+00 +9.986892185605520389e-01 1.310781439447969562e-03 +9.984024870204479818e-01 1.597512979552127713e-03 +9.988601421668458213e-01 1.139857833154103413e-03 +1.002961670357248813e+00 1.997038329642751187e+00 +2.998590417856644486e+00 1.409582143355437049e-03 +9.989936241531106598e-01 1.006375846889253928e-03 +9.987357785037558333e-01 1.264221496244111176e-03 +1.010251714808831469e+00 9.897482851911687529e-01 +2.019245977681841886e+00 9.807540223181581140e-01 +9.986060605219938546e-01 1.393939478006202184e-03 +9.988702864233018897e-01 1.129713576698181902e-03 +9.987525141849326049e-01 1.247485815067243516e-03 +9.984886613853292125e-01 1.511338614670673175e-03 +1.998929883890762538e+00 1.070116109237299841e-03 +2.000000000000000000e+00 5.628010851213472823e-47 +1.013463113493671797e+00 9.865368865063280923e-01 +1.000000000000000000e+00 5.363905181271526340e-47 +1.000000000000000000e+00 5.483435919657967408e-47 +1.000000000000000000e+00 5.341923528071615718e-47 +1.000000000000000000e+00 5.432709336625410595e-47 +1.000000000000000000e+00 5.410109792678980254e-47 +1.000000000000000000e+00 5.450042465975079661e-47 +1.000000000000000000e+00 5.463461012836395391e-47 +1.000000000000000000e+00 5.407659123115096433e-47 +1.000000000000000000e+00 5.450800604772544493e-47 +1.999394891263923224e+00 6.051087360771560762e-04 +3.994734362347510004e+00 5.265637652490104577e-03 +1.000000000000000000e+00 5.385704670449697648e-47 +1.000000000000000000e+00 5.589677981264196073e-47 +1.000000000000000000e+00 5.392808800592590294e-47 +1.000000000000000000e+00 5.414534751603699097e-47 +1.000000000000000000e+00 5.409020057246302525e-47 +1.010963858471109145e+00 1.989036141528890855e+00 +1.999311430984608284e+00 6.885690153915402425e-04 +1.000000000000000000e+00 5.427873615985144791e-47 +1.000000000000000000e+00 5.348045157754681066e-47 +1.000000000000000000e+00 5.433217651490795879e-47 +1.010426404982677440e+00 1.989573595017322560e+00 +1.000000000000000000e+00 1.000000000000000000e+00 +1.000000000000000000e+00 5.382636235299113255e-47 +1.000000000000000000e+00 1.000000000000000000e+00 +1.000000000000000000e+00 5.385033285053781332e-47 +1.010498272852458745e+00 3.989501727147541477e+00 +1.000000000000000000e+00 5.394008653170593334e-47 +1.000000000000000000e+00 5.387208176351058129e-47 +1.000000000000000000e+00 5.439990202240452791e-47 +1.960581881448665459e-46 1.000000000000000000e+00 +1.997460327654513534e+00 3.002539672345486466e+00 +9.980752103877377213e-01 2.001924789612262057e+00 +1.863707642336268002e-46 1.000000000000000000e+00 +1.000838855344041933e+00 2.999161144655957845e+00 +9.984465434956540930e-01 1.553456504345977459e-03 +9.986978174077280057e-01 1.302182592271954606e-03 +9.986232508512252748e-01 1.376749148774716967e-03 +1.996970602889334589e+00 3.029397110665502883e-03 +9.985400140970224214e-01 1.459985902977557976e-03 +2.996357944442667343e+00 3.642055557332484796e-03 +2.995298496900147711e+00 4.701503099852073764e-03 +9.983716599614101961e-01 1.628340038589744286e-03 +1.996814394392947722e+00 3.185605607052280179e-03 +9.985897780413758307e-01 1.410221958624125740e-03 +2.995809728073926426e+00 4.190271926073793791e-03 +9.984915176024338201e-01 1.508482397566071869e-03 +9.986699387751998280e-01 1.330061224800234421e-03 +9.988104709496933298e-01 1.189529050306513663e-03 +1.998350893725822974e+00 1.649106274176976054e-03 +9.984727329034832621e-01 1.527267096516674331e-03 +9.984991720224165590e-01 1.500827977583340384e-03 +9.989044041420798159e-01 1.095595857920101703e-03 +9.984825625803541715e-01 1.517437419645837563e-03 +9.983241729394628505e-01 1.675827060537104802e-03 +9.982357489776417792e-01 1.764251022358196124e-03 +2.995538813998040784e+00 1.004461186001958994e+00 +2.995561176756564858e+00 4.438823243434481501e-03 +9.989790828844368198e-01 1.020917115563336553e-03 +9.985410910444760813e-01 1.458908955523777803e-03 +1.997780803959834817e+00 2.219196040165207200e-03 +9.984759327326686584e-01 1.524067267331346339e-03 +9.982303674093269130e-01 1.769632590673108881e-03 +9.988461372671993965e-01 1.153862732800751576e-03 +9.985386844384706029e-01 1.461315561529347683e-03 +1.997781004590848397e+00 2.218995409151277610e-03 +9.989859222686811036e-01 1.014077731318940682e-03 +1.001688456353303280e+00 9.983115436466967196e-01 +3.993756504800621165e+00 6.243495199379020186e-03 +9.989724561328680030e-01 1.027543867131981790e-03 +9.981531498636873057e-01 1.846850136312674397e-03 +9.124262078922782543e-03 9.908757379210773042e-01 +7.087473507272424103e-03 9.929125264927276540e-01 +1.730200752596639316e-02 1.982697992474033555e+00 +7.588796850684373721e-03 9.924112031493156350e-01 +1.076144606437997621e-02 2.989238553935619880e+00 +8.008370332056095192e-03 9.919916296679439638e-01 +1.249727078667985797e-02 2.987502729213320052e+00 +1.823585461741900754e-02 1.981764145382580722e+00 +7.111670412860333218e-03 9.928883295871396841e-01 +6.993936802907376078e-03 9.930060631970927254e-01 +8.293679746253801791e-03 9.917063202537461652e-01 +9.161805086955949681e-03 9.908381949130440347e-01 +3.065101947232996837e-02 3.969348980527670268e+00 +8.996362109145786962e-03 9.910036378908541055e-01 +1.027766605423131342e-02 1.989722333945768362e+00 +1.157429833189310003e-02 9.884257016681069485e-01 +7.989614994016056015e-03 1.992010385005983597e+00 +7.459103172914599220e-03 9.925408968270854615e-01 +7.497664122163057686e-03 9.925023358778368721e-01 +9.103021068668826904e-03 9.908969789313312182e-01 +9.497825192457736918e-03 9.905021748075422128e-01 +1.012458955947394657e+00 9.875410440526054545e-01 +8.494679242149262219e-03 9.915053207578506944e-01 +1.060311791197339829e-02 9.893968820880265636e-01 +7.979970752170486778e-03 9.920200292478296156e-01 +9.365781290287800848e-03 9.906342187097122443e-01 +7.015553563935422558e-03 9.929844464360646494e-01 +7.697529303921076503e-03 9.923024706960789088e-01 +7.539541571406690491e-03 9.924604584285933173e-01 +7.141331812382661036e-03 9.928586681876174769e-01 +7.948552723310436940e-03 9.920514472766894798e-01 +9.269741192890723971e-03 9.907302588071091876e-01 +7.972647097131777688e-03 9.920273529028682535e-01 +1.091404959510783312e-02 1.989085950404892111e+00 +8.334431604320010484e-03 9.916655683956800971e-01 +8.457411168914697949e-03 9.915425888310851477e-01 +9.629292334126326666e-03 9.903707076658737618e-01 +1.011513214585687326e+00 1.988486785414312896e+00 +8.953959286828473349e-03 9.910460407131714833e-01 +7.207789166760603626e-03 1.992792210833239430e+00 +5.270782711207388985e-47 1.000000000000000000e+00 +5.580180805816032638e-47 1.000000000000000000e+00 +5.539960863442175868e-47 1.000000000000000000e+00 +5.691519543662593657e-47 2.000000000000000000e+00 +5.417042488228900129e-47 1.000000000000000000e+00 +5.376860156923605548e-47 1.000000000000000000e+00 +5.368213776930869781e-47 1.000000000000000000e+00 +5.337987571368773632e-47 1.000000000000000000e+00 +9.994057524713002572e-01 1.000594247528699743e+00 +5.538325688887702500e-47 1.000000000000000000e+00 +5.344364807494659848e-47 1.000000000000000000e+00 +9.992721247892204506e-01 1.000727875210779327e+00 +5.286778760675359903e-47 1.000000000000000000e+00 +5.381208512537048970e-47 1.000000000000000000e+00 +5.171411498455629607e-47 1.000000000000000000e+00 +6.336942015015505992e-47 4.000000000000000000e+00 +9.992723456755825406e-01 1.000727654324417237e+00 +5.334341324232295843e-47 1.000000000000000000e+00 +5.383783829044589455e-47 1.000000000000000000e+00 +5.565471155797933313e-47 1.000000000000000000e+00 +5.324307053655013612e-47 1.000000000000000000e+00 +5.228269958744074204e-47 1.000000000000000000e+00 +1.003102967912408383e+00 1.996897032087591617e+00 +9.995490328857724593e-01 4.509671142275224406e-04 +9.995225344305616044e-01 4.774655694384753735e-04 +1.998866682191447142e+00 1.001133317808552858e+00 +9.991725717322195166e-01 8.274282677804393766e-04 +9.993633249109644678e-01 6.366750890355846355e-04 +9.995339818580941671e-01 4.660181419059424642e-04 +9.992364801078539305e-01 7.635198921461642824e-04 +9.994344571626554430e-01 5.655428373444534732e-04 +9.994986620746204586e-01 5.013379253795498171e-04 +9.992273366927377776e-01 7.726633072622512716e-04 +9.993652102928332059e-01 6.347897071668354759e-04 +9.992759733702318847e-01 7.240266297679713971e-04 +1.001118980243569379e+00 9.988810197564306215e-01 +9.994047252781812496e-01 5.952747218187558285e-04 +9.994816003166400176e-01 5.183996833599074060e-04 +9.994108499861265038e-01 5.891500138734384042e-04 +9.993989580413302765e-01 6.010419586697927138e-04 +9.994081723367678194e-01 5.918276632321647825e-04 +9.994396780351958443e-01 5.603219648041953211e-04 +9.993219663376287087e-01 6.780336623713374439e-04 +9.993167569313433640e-01 6.832430686565646363e-04 +9.994480592357621873e-01 5.519407642377641143e-04 +1.003460368011329873e+00 9.965396319886701271e-01 +1.002490683532780258e+00 9.975093164672195201e-01 +9.993309914674171068e-01 6.690085325830703259e-04 +9.992383878696635691e-01 7.616121303364277928e-04 +9.994928197467700670e-01 5.071802532299573814e-04 +9.994224054751338349e-01 5.775945248661254559e-04 +9.994789045065237687e-01 5.210954934762590529e-04 +9.989684961544569308e-01 1.031503845543107773e-03 +5.999491990721200096e+00 1.000508009278799904e+00 +9.992614413621229152e-01 7.385586378769903220e-04 +1.998856260006944607e+00 1.143739993055435580e-03 +1.000789404142109706e+00 9.992105958578900715e-01 +9.993129316067721479e-01 6.870683932278783077e-04 +9.993520857532155466e-01 6.479142467843919439e-04 +5.996220310589613689e+00 3.779689410385914626e-03 +9.989645520736561979e-01 1.035447926343787805e-03 +9.995352578513273523e-01 4.647421486725714090e-04 +1.998638312675599993e+00 1.361687324399918266e-03 +1.998864286532588297e+00 1.135713467411604647e-03 +9.994154047211341041e-01 5.845952788659619344e-04 +9.994449882555606068e-01 5.550117444394945350e-04 +1.004137072738518821e+00 1.995862927261481179e+00 +9.994160422634119634e-01 5.839577365879889195e-04 +1.998675550529696032e+00 1.324449470303806862e-03 +1.001907049658891324e+00 9.980929503411086756e-01 +1.998218291976068706e+00 1.781708023931410792e-03 +4.930581962142418541e-03 1.995069418037857201e+00 +2.636058468195731468e-03 9.973639415318041879e-01 +2.433040392811023755e-03 9.975669596071889966e-01 +2.820563199857915263e-03 9.971794368001420938e-01 +2.364862873475574983e-03 9.976351371265241852e-01 +1.371805593927369121e-03 9.986281944060726445e-01 +2.835273207999831468e-03 9.971647267920000779e-01 +2.260716511374811707e-03 9.977392834886251727e-01 +3.042134693785419897e-03 9.969578653062145523e-01 +2.873398356045693894e-03 9.971266016439543911e-01 +2.421500574930267498e-03 9.975784994250697091e-01 +2.573601563316223109e-03 9.974263984366836810e-01 +2.335470311578421131e-03 9.976645296884214531e-01 +2.219121651938276481e-03 9.977808783480617283e-01 +2.548969841397718747e-03 9.974510301586021477e-01 +3.045830669696215388e-03 9.969541693303036745e-01 +5.077092028267345993e-03 1.994922907971732773e+00 +2.588340605777835618e-03 9.974116593942221609e-01 +2.504906238912875480e-03 9.974950937610871371e-01 +2.519205975746906072e-03 9.974807940242530480e-01 +2.993347521930599803e-03 9.970066524780692996e-01 +3.185515102900045130e-03 9.968144848970998950e-01 +2.182956532876598535e-03 9.978170434671232991e-01 +3.065421006780637281e-03 9.969345789932193558e-01 +2.972365071207913514e-03 9.970276349287918727e-01 +2.841359360156174747e-03 9.971586406398438296e-01 +2.522531781310038100e-03 9.974774682186899888e-01 +2.290827384413663282e-03 9.977091726155863372e-01 +3.256928425392405409e-03 9.967430715746075087e-01 +2.028693019353658249e-03 9.979713069806462888e-01 +3.706639797699594167e-03 9.962933602023004154e-01 +3.703432968304423614e-03 9.962965670316955569e-01 +2.770442014570877187e-03 9.972295579854291159e-01 +7.425282838760093561e-03 2.992574717161239573e+00 +2.835176877096028201e-03 9.971648231229038872e-01 +2.471562687491314376e-03 9.975284373125087312e-01 +5.060098271923050769e-03 1.994939901728076892e+00 +2.678929902378660542e-03 9.973210700976212761e-01 +2.970006811260944134e-03 9.970299931887389722e-01 +2.483213847836801373e-03 9.975167861521629931e-01 +2.493776758408526150e-03 9.975062232415915497e-01 +2.467263830340076132e-03 9.975327361696598727e-01 +1.897689543945637377e-03 9.981023104560543535e-01 +3.476952182203667252e-03 9.965230478177963258e-01 +3.337193371273448200e-03 9.966628066287264165e-01 +2.544823469388982420e-03 9.974551765306109985e-01 +2.462580271568771614e-03 9.975374197284312405e-01 +2.340380882327547808e-03 9.976596191176724249e-01 +2.185335770818864522e-03 9.978146642291810808e-01 +2.517370852873779524e-03 9.974826291471261541e-01 +3.130489262019590675e-03 9.968695107379803577e-01 +2.535715351468906080e-03 9.974642846485310965e-01 +2.299316940934231172e-03 9.977006830590657849e-01 +2.513145727260615060e-03 9.974868542727393095e-01 +2.588866498291569462e-03 9.974111335017085134e-01 +2.915920093548339444e-03 9.970840799064515370e-01 +3.699671171343620578e-03 9.963003288286563786e-01 +2.859505812405921214e-03 9.971404941875940953e-01 +2.642112090443233478e-03 9.973578879095565952e-01 +2.919155537061107926e-03 9.970808444629389866e-01 +3.189411004639295961e-03 9.968105889953605692e-01 +4.770987700952431486e-03 1.995229012299047255e+00 +2.101050329724805758e-03 9.978989496702751483e-01 +2.909288793848420051e-03 9.970907112061515587e-01 +2.981809299405411683e-03 9.970181907005943867e-01 +2.300351021618102232e-03 9.976996489783818145e-01 +2.200811262226234216e-03 9.977991887377737346e-01 +3.394828342689515562e-03 9.966051716573105512e-01 +5.219029400262317075e-46 1.000000000000000000e+00 +5.303499110905681753e-46 1.000000000000000000e+00 +3.000000000000000000e+00 2.000000000000000000e+00 +1.000000000000000000e+00 1.085636710501089580e-45 +1.999999999999999778e+00 2.000000000000000000e+00 +5.000000000000000000e+00 3.999999999999999556e+00 +1.999999999999999778e+00 1.025578214001572587e-45 +1.000000000000000000e+00 1.000000000000000000e+00 +1.000000000000000000e+00 8.561152365335069377e-46 +1.000000000000000000e+00 1.115636710663362352e-45 +5.282891002260288514e-46 1.000000000000000000e+00 +6.639354372121164232e-46 1.000000000000000000e+00 +2.000000000000000000e+00 1.000000000000000000e+00 +6.891953067090714591e-46 1.000000000000000000e+00 +1.000000000000000000e+00 2.000000000000000000e+00 +2.000000000000000000e+00 2.117499853764445525e-45 +3.000000000000000000e+00 1.911144203457866006e-45 +1.000000000000000000e+00 1.536432633516627284e-45 diff --git a/gensim/test/test_ldaseqmodel.py b/gensim/test/test_ldaseqmodel.py new file mode 100644 index 0000000000..0df3198286 --- /dev/null +++ b/gensim/test/test_ldaseqmodel.py @@ -0,0 +1,49 @@ +""" + +Tests to check DTM math functions and Topic-Word, Doc-Topic proportions. + +""" + +import numpy # for arrays, array broadcasting etc. +from gensim.models import ldaseqmodel, ldamodel +from gensim.corpora import Dictionary +import os.path +import unittest +import logging + + +module_path = os.path.dirname(__file__) # needed because sample data files are located in the same folder +datapath = lambda fname: os.path.join(module_path, 'test_data/DTM', fname) + + +class TestLdaSeq(unittest.TestCase): + # we are setting up a DTM model and fitting it, and checking topic-word and doc-topic results. + def setUp(self): + texts = [[u'senior', u'studios', u'studios', u'studios', u'creators', u'award', u'mobile', u'currently', u'challenges', u'senior', u'summary', u'senior', u'motivated', u'creative', u'senior'],[u'performs', u'engineering', u'tasks', u'infrastructure', u'focusing', u'primarily', u'programming', u'interaction', u'designers', u'engineers', u'leadership', u'teams', u'teams', u'crews', u'responsibilities', u'engineering', u'quality', u'functional', u'functional', u'teams', u'organizing', u'prioritizing', u'technical', u'decisions', u'engineering', u'participates', u'participates', u'reviews', u'participates', u'hiring', u'conducting', u'interviews'],[u'feedback', u'departments', u'define', u'focusing', u'engineering', u'teams', u'crews', u'facilitate', u'engineering', u'departments', u'deadlines', u'milestones', u'typically', u'spends', u'designing', u'developing', u'updating', u'bugs', u'mentoring', u'engineers', u'define', u'schedules', u'milestones', u'participating'],[ u'reviews', u'interviews', u'sized', u'teams', u'interacts', u'disciplines', u'knowledge', u'skills', u'knowledge', u'knowledge', u'xcode', u'scripting', u'debugging', u'skills', u'skills', u'knowledge', u'disciplines', u'animation', u'networking', u'expertise', u'competencies', u'oral', u'skills', u'management', u'skills', u'proven', u'effectively', u'teams', u'deadline', u'environment', u'bachelor', u'minimum', u'shipped', u'leadership', u'teams', u'location', u'resumes', u'jobs', u'candidates', u'openings', u'jobs'], + [u'maryland', u'client', u'producers', u'electricity', u'operates', u'storage', u'utility', u'retail', u'customers', u'engineering', u'consultant', u'maryland', u'summary', u'technical', u'technology', u'departments', u'expertise', u'maximizing', u'output', u'reduces', u'operating', u'participates', u'areas', u'engineering', u'conducts', u'testing', u'solve', u'supports', u'environmental', u'understands', u'objectives', u'operates', u'responsibilities', u'handles', u'complex', u'engineering', u'aspects', u'monitors', u'quality', u'proficiency', u'optimization', u'recommendations', u'supports', u'personnel', u'troubleshooting', u'commissioning', u'startup', u'shutdown', u'supports', u'procedure', u'operating', u'units', u'develops', u'simulations', u'troubleshooting', u'tests', u'enhancing', u'solving', u'develops', u'estimates', u'schedules', u'scopes', u'understands', u'technical', u'management', u'utilize', u'routine', u'conducts', u'hazards', u'utilizing', u'hazard', u'operability', u'methodologies', u'participates', u'startup', u'reviews', u'pssr', u'participate', u'teams', u'participate', u'regulatory', u'audits', u'define', u'scopes', u'budgets', u'schedules', u'technical', u'management', u'environmental', u'awareness', u'interfacing', u'personnel', u'interacts', u'regulatory', u'departments', u'input', u'objectives', u'identifying', u'introducing', u'concepts', u'solutions', u'peers', u'customers', u'coworkers', u'knowledge', u'skills', u'engineering', u'quality', u'engineering'], [u'commissioning', u'startup', u'knowledge', u'simulators', u'technologies', u'knowledge', u'engineering', u'techniques', u'disciplines', u'leadership', u'skills', u'proven', u'engineers', u'oral', u'skills', u'technical', u'skills', u'analytically', u'solve', u'complex', u'interpret', u'proficiency', u'simulation', u'knowledge', u'applications', u'manipulate', u'applications', u'engineering'],[u'calculations', u'programs', u'matlab', u'excel', u'independently', u'environment', u'proven', u'skills', u'effectively', u'multiple', u'tasks', u'planning', u'organizational', u'management', u'skills', u'rigzone', u'jobs', u'developer', u'exceptional', u'strategies', u'junction', u'exceptional', u'strategies', u'solutions', u'solutions', u'biggest', u'insurers', u'operates', u'investment'], [u'vegas', u'tasks', u'electrical', u'contracting', u'expertise', u'virtually', u'electrical', u'developments', u'institutional', u'utilities', u'technical', u'experts', u'relationships', u'credibility', u'contractors', u'utility', u'customers', u'customer', u'relationships', u'consistently', u'innovations', u'profile', u'construct', u'envision', u'dynamic', u'complex', u'electrical', u'management', u'grad', u'internship', u'electrical', u'engineering', u'infrastructures', u'engineers', u'documented', u'management', u'engineering', u'quality', u'engineering', u'electrical', u'engineers', u'complex', u'distribution', u'grounding', u'estimation', u'testing', u'procedures', u'voltage', u'engineering'],[u'troubleshooting', u'installation', u'documentation', u'bsee', u'certification', u'electrical', u'voltage', u'cabling', u'electrical', u'engineering', u'candidates', u'electrical', u'internships', u'oral', u'skills', u'organizational', u'prioritization', u'skills', u'skills', u'excel', u'cadd', u'calculation', u'autocad', u'mathcad', u'skills', u'skills', u'customer', u'relationships', u'solving', u'ethic', u'motivation', u'tasks', u'budget', u'affirmative', u'diversity', u'workforce', u'gender', u'orientation', u'disability', u'disabled', u'veteran', u'vietnam', u'veteran', u'qualifying', u'veteran', u'diverse', u'candidates', u'respond', u'developing', u'workplace', u'reflects', u'diversity', u'communities', u'reviews', u'electrical', u'contracting', u'southwest', u'electrical', u'contractors'], [u'intern', u'electrical', u'engineering', u'idexx', u'laboratories', u'validating', u'idexx', u'integrated', u'hardware', u'entails', u'planning', u'debug', u'validation', u'engineers', u'validation', u'methodologies', u'healthcare', u'platforms', u'brightest', u'solve', u'challenges', u'innovation', u'technology', u'idexx', u'intern', u'idexx', u'interns', u'supplement', u'interns', u'teams', u'roles', u'competitive', u'interns', u'idexx', u'interns', u'participate', u'internships', u'mentors', u'seminars', u'topics', u'leadership', u'workshops', u'relevant', u'planning', u'topics', u'intern', u'presentations', u'mixers', u'applicants', u'ineligible', u'laboratory', u'compliant', u'idexx', u'laboratories', u'healthcare', u'innovation', u'practicing', u'veterinarians', u'diagnostic', u'technology', u'idexx', u'enhance', u'veterinarians', u'efficiency', u'economically', u'idexx', u'worldwide', u'diagnostic', u'tests', u'tests', u'quality', u'headquartered', u'idexx', u'laboratories', u'employs', u'customers', u'qualifications', u'applicants', u'idexx', u'interns', u'potential', u'demonstrated', u'portfolio', u'recommendation', u'resumes', u'marketing', u'location', u'americas', u'verification', u'validation', u'schedule', u'overtime', u'idexx', u'laboratories', u'reviews', u'idexx', u'laboratories', u'nasdaq', u'healthcare', u'innovation', u'practicing', u'veterinarians'], [u'location', u'duration', u'temp', u'verification', u'validation', u'tester', u'verification', u'validation', u'middleware', u'specifically', u'testing', u'applications', u'clinical', u'laboratory', u'regulated', u'environment', u'responsibilities', u'complex', u'hardware', u'testing', u'clinical', u'analyzers', u'laboratory', u'graphical', u'interfaces', u'complex', u'sample', u'sequencing', u'protocols', u'developers', u'correction', u'tracking', u'tool', u'timely', u'troubleshoot', u'testing', u'functional', u'manual', u'automated', u'participate', u'ongoing'],[u'testing', u'coverage', u'planning', u'documentation', u'testing', u'validation', u'corrections', u'monitor', u'implementation', u'recurrence', u'operating', u'statistical', u'quality', u'testing', u'global', u'multi', u'teams', u'travel', u'skills', u'concepts', u'waterfall', u'agile', u'methodologies', u'debugging', u'skills', u'complex', u'automated', u'instrumentation', u'environment', u'hardware', u'mechanical', u'components', u'tracking', u'lifecycle', u'management', u'quality', u'organize', u'define', u'priorities', u'organize', u'supervision', u'aggressive', u'deadlines', u'ambiguity', u'analyze', u'complex', u'situations', u'concepts', u'technologies', u'verbal', u'skills', u'effectively', u'technical', u'clinical', u'diverse', u'strategy', u'clinical', u'chemistry', u'analyzer', u'laboratory', u'middleware', u'basic', u'automated', u'testing', u'biomedical', u'engineering', u'technologists', u'laboratory', u'technology', u'availability', u'click', u'attach'], [u'scientist', u'linux', u'asrc', u'scientist', u'linux', u'asrc', u'technology', u'solutions', u'subsidiary', u'asrc', u'engineering', u'technology', u'contracts'], [u'multiple', u'agencies', u'scientists', u'engineers', u'management', u'personnel', u'allows', u'solutions', u'complex', u'aeronautics', u'aviation', u'management', u'aviation', u'engineering', u'hughes', u'technical', u'technical', u'aviation', u'evaluation', u'engineering', u'management', u'technical', u'terminal', u'surveillance', u'programs', u'currently', u'scientist', u'travel', u'responsibilities', u'develops', u'technology', u'modifies', u'technical', u'complex', u'reviews', u'draft', u'conformity', u'completeness', u'testing', u'interface', u'hardware', u'regression', u'impact', u'reliability', u'maintainability', u'factors', u'standardization', u'skills', u'travel', u'programming', u'linux', u'environment', u'cisco', u'knowledge', u'terminal', u'environment', u'clearance', u'clearance', u'input', u'output', u'digital', u'automatic', u'terminal', u'management', u'controller', u'termination', u'testing', u'evaluating', u'policies', u'procedure', u'interface', u'installation', u'verification', u'certification', u'core', u'avionic', u'programs', u'knowledge', u'procedural', u'testing', u'interfacing', u'hardware', u'regression', u'impact', u'reliability', u'maintainability', u'factors', u'standardization', u'missions', u'asrc', u'subsidiaries', u'affirmative', u'employers', u'applicants', u'disability', u'veteran', u'technology', u'location', u'airport', u'bachelor', u'schedule', u'travel', u'contributor', u'management', u'asrc', u'reviews'], [u'technical', u'solarcity', u'niche', u'vegas', u'overview', u'resolving', u'customer', u'clients', u'expanding', u'engineers', u'developers', u'responsibilities', u'knowledge', u'planning', u'adapt', u'dynamic', u'environment', u'inventive', u'creative', u'solarcity', u'lifecycle', u'responsibilities', u'technical', u'analyzing', u'diagnosing', u'troubleshooting', u'customers', u'ticketing', u'console', u'escalate', u'knowledge', u'engineering', u'timely', u'basic', u'phone', u'functionality', u'customer', u'tracking', u'knowledgebase', u'rotation', u'configure', u'deployment', u'sccm', u'technical', u'deployment', u'deploy', u'hardware', u'solarcity', u'bachelor', u'knowledge', u'dell', u'laptops', u'analytical', u'troubleshooting', u'solving', u'skills', u'knowledge', u'databases', u'preferably', u'server', u'preferably', u'monitoring', u'suites', u'documentation', u'procedures', u'knowledge', u'entries', u'verbal', u'skills', u'customer', u'skills', u'competitive', u'solar', u'package', u'insurance', u'vacation', u'savings', u'referral', u'eligibility', u'equity', u'performers', u'solarcity', u'affirmative', u'diversity', u'workplace', u'applicants', u'orientation', u'disability', u'veteran', u'careerrookie'], [u'embedded', u'exelis', u'junction', u'exelis', u'embedded', u'acquisition', u'networking', u'capabilities', u'classified', u'customer', u'motivated', u'develops', u'tests', u'innovative', u'solutions', u'minimal', u'supervision', u'paced', u'environment', u'enjoys', u'assignments', u'interact', u'multi', u'disciplined', u'challenging', u'focused', u'embedded', u'developments', u'spanning', u'engineering', u'lifecycle', u'specification', u'enhancement', u'applications', u'embedded', u'freescale', u'applications', u'android', u'platforms', u'interface', u'customers', u'developers', u'refine', u'specifications', u'architectures'],[u'java', u'programming', u'scripts', u'python', u'debug', u'debugging', u'emulators', u'regression', u'revisions', u'specialized', u'setups', u'capabilities', u'subversion', u'technical', u'documentation', u'multiple', u'engineering', u'techexpousa', u'reviews'], [u'modeler', u'semantic', u'modeling', u'models', u'skills', u'ontology', u'resource', u'framework', u'schema', u'technologies', u'hadoop', u'warehouse', u'oracle', u'relational', u'artifacts', u'models', u'dictionaries', u'models', u'interface', u'specifications', u'documentation', u'harmonization', u'mappings', u'aligned', u'coordinate', u'technical', u'peer', u'reviews', u'stakeholder', u'communities', u'impact', u'domains', u'relationships', u'interdependencies', u'models', u'define', u'analyze', u'legacy', u'models', u'corporate', u'databases', u'architectural', u'alignment', u'customer', u'expertise', u'harmonization', u'modeling', u'modeling', u'consulting', u'stakeholders', u'quality', u'models', u'storage', u'agile', u'specifically', u'focus', u'modeling', u'qualifications', u'bachelors', u'accredited', u'modeler', u'encompass', u'evaluation', u'skills', u'knowledge', u'modeling', u'techniques', u'resource', u'framework', u'schema', u'technologies', u'unified', u'modeling', u'technologies', u'schemas', u'ontologies', u'sybase', u'knowledge', u'skills', u'interpersonal', u'skills', u'customers', u'clearance', u'applicants', u'eligibility', u'classified', u'clearance', u'polygraph', u'techexpousa', u'solutions', u'partnership', u'solutions', u'integration'], [u'technologies', u'junction', u'develops', u'maintains', u'enhances', u'complex', u'diverse', u'intensive', u'analytics', u'algorithm', u'manipulation', u'management', u'documented', u'individually', u'reviews', u'tests', u'components', u'adherence', u'resolves', u'utilizes', u'methodologies', u'environment', u'input', u'components', u'hardware', u'offs', u'reuse', u'cots', u'gots', u'synthesis', u'components', u'tasks', u'individually', u'analyzes', u'modifies', u'debugs', u'corrects', u'integrates', u'operating', u'environments', u'develops', u'queries', u'databases', u'repositories', u'recommendations', u'improving', u'documentation', u'develops', u'implements', u'algorithms', u'functional', u'assists', u'developing', u'executing', u'procedures', u'components', u'reviews', u'documentation', u'solutions', u'analyzing', u'conferring', u'users', u'engineers', u'analyzing', u'investigating', u'areas', u'adapt', u'hardware', u'mathematical', u'models', u'predict', u'outcome', u'implement', u'complex', u'database', u'repository', u'interfaces', u'queries', u'bachelors', u'accredited', u'substituted', u'bachelors', u'firewalls', u'ipsec', u'vpns', u'technology', u'administering', u'servers', u'apache', u'jboss', u'tomcat', u'developing', u'interfaces', u'firefox', u'internet', u'explorer', u'operating', u'mainframe', u'linux', u'solaris', u'virtual', u'scripting', u'programming', u'oriented', u'programming', u'ajax', u'script', u'procedures', u'cobol', u'cognos', u'fusion', u'focus', u'html', u'java', u'java', u'script', u'jquery', u'perl', u'visual', u'basic', u'powershell', u'cots', u'cots', u'oracle', u'apex', u'integration', u'competitive', u'package', u'bonus', u'corporate', u'equity', u'tuition', u'reimbursement', u'referral', u'bonus', u'holidays', u'insurance', u'flexible', u'disability', u'insurance'], [u'technologies', u'disability', u'accommodation', u'recruiter', u'techexpousa'], + ['bank','river','shore','water'],['river','water','flow','fast','tree'],['bank','water','fall','flow'],['bank','bank','water','rain','river'], + ['river','water','mud','tree'],['money','transaction','bank','finance'], + ['bank','borrow','money'], ['bank','finance'], ['finance','money','sell','bank'],['borrow','sell'],['bank','loan','sell']] + # initializing using own LDA sufficient statistics so that we get same results each time. + sstats = numpy.loadtxt(datapath('sstats_test.txt')) + dictionary = Dictionary(texts) + corpus = [dictionary.doc2bow(text) for text in texts] + self.ldaseq = ldaseqmodel.LdaSeqModel(corpus = corpus , id2word= dictionary, num_topics=2, time_slice=[10, 10, 11], initialize='own', sstats=sstats) + + # testing topic word proportions + def testTopicWord(self): + + topics = self.ldaseq.print_topics(0) + expected_topic_word = [( 0.035999999999999997, 'skills')] + self.assertAlmostEqual(topics[0][0][0], expected_topic_word[0][0], places=2) + self.assertEqual(topics[0][0][1], expected_topic_word[0][1]) + + # testing document-topic proportions + def testDocTopic(self): + doc_topic = self.ldaseq.doc_topics(0) + expected_doc_topic = 0.00066577896138482028 + self.assertAlmostEqual(doc_topic[0], expected_doc_topic, places=2) + +if __name__ == '__main__': + logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG) + unittest.main()