Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Lda training visualization in visdom #1399

Merged
merged 36 commits into from
Aug 30, 2017

Conversation

parulsethi
Copy link
Contributor

@parulsethi parulsethi commented Jun 7, 2017

This PR adds an option to visualize LDA evaluation parameters in real-time while training/or after, using visdom.

  • Add notebook

@parulsethi parulsethi changed the title [WIP] Lda logging for Tensorboard [WIP] Lda training visualization in visdom Jun 16, 2017
@parulsethi
Copy link
Contributor Author

Updated this PR to use visdom, removed tensorboard code.

There are few parameters (distance, coherence, texts, window_size, topn) in ldamodel required for two different fuctionalities (one for visualization in this PR and other for logging in #1381).

Maybe we can define a dict based input for visualization related parameters differently, for example Viz={param1:value1, param2:value2, ..}. Or otherwise, what alternatives can be used?

@tmylk
Copy link
Contributor

tmylk commented Jun 20, 2017

Please add the ipynb with instructions on how to setup visdom to show the live stats

@parulsethi parulsethi changed the title [WIP] Lda training visualization in visdom Lda training visualization in visdom Jun 27, 2017
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf={}, minimum_phi_value=0.01,
per_word_topics=False, viz=False, env=None, distance="kulback_leibler",
Copy link
Contributor

@menshikh-iv menshikh-iv Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all parameters to metrics
by default, metrics={},
in another case, metrics contains needed parameters for viz some things, for example

metrics = {
   'diff_distance': 'jaccard',
   'coherence_windows_size': 10,
   ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and don't forget about args validation (write a tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation about every arg's object type? Or for the valid values of 'diff_distance' and 'coherence' (though, for which there is a test already in their respective functions of Diff and CoherenceModel)?

" Now, this graph along with the others explained below, can be used to decide if it's time to stop the training. We can see if the value stops changing after some epochs and that we are able to get the highest possible coherence of our model. \n",
"\n",
"\n",
"2. **Perplexity**\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers will not be rendered correctly, please remove them


`env` defines the environment to use in visdom browser

`distance` measure to be used for Diff plot visualization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit docstring in accordance with comment about viz

gamma_threshold=None, chunks_as_numpy=False):
def update(self, corpus, chunksize=None, decay=None, offset=None, passes=None, update_every=None,
eval_every=None, iterations=None, gamma_threshold=None, chunks_as_numpy=False,
viz=None, env=None, distance=None, coherence=None, texts=None, window_size=None, topn=None):
Copy link
Contributor

@menshikh-iv menshikh-iv Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add metrics for all args about metrics

if self.viz:
# calculate coherence
cm = gensim.models.CoherenceModel(model=self, corpus=corpus, texts=texts, coherence=coherence, window_size=window_size, topn=topn)
Coherence = np.array([cm.get_coherence()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name should be in lowercase (here and anywhere)

if reallen != lencorpus:
raise RuntimeError("input corpus size changed during training (don't use generators as input)")

if self.viz:
# calculate coherence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this block in separate function (with stat calc)

@EmilStenstrom
Copy link

The error för Travis:

======================================================================

ERROR: testIdentity (gensim.test.test_tmdiff.TestLdaDiff)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_tmdiff.py", line 42, in testIdentity

    mdiff, annotation = self.model.diff(self.model, n_ann_terms=self.n_ann_terms, distance=dist_name)

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/models/ldamodel.py", line 1106, in diff

    raise ValueError("Incorrect distance, valid only {}".format(valid_keys))

ValueError: Incorrect distance, valid only `jaccard`, `kullback_leibler`, `hellinger`

Seems like a real error.

@parulsethi
Copy link
Contributor Author

There's a typo in "kullback_leibler" in diff function here. I replaced it with correct spelling in this PR but forgot to make the replacement in tests also.

Though this PR might get a bit delayed due to API decisions so I'll just make another PR for correcting this typo.

@parulsethi
Copy link
Contributor Author

@tmylk @menshikh-iv updated the API structure as discussed recently

alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf={}, minimum_phi_value=0.01,
per_word_topics=False, callbacks=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description for callbacks in docstring

if any(isinstance(metric, (DiffMetric, ConvergenceMetric)) for metric in self.metrics):
self.previous = copy.deepcopy(model)
# store diff diagnols of previous epochs
self.diff_mat = Queue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a reason to use queue (not list)?

Copy link
Contributor Author

@parulsethi parulsethi Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there could be any no. of diff/convergence metric input (for ex. with different distance measures), so by using queue, we won't need to keep track of their no. and epoch count, and simply depend on the sequence order.



class CoherenceMetric(Metric):
def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, window_size=None, topn=None, logger=None, viz_env=None, title=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docstring for all parameters with description (here and anywhere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a good idea to use logger="shell" by default for all scalar metrics?

@@ -0,0 +1,189 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an example with logger="shell" in notebook (and show logging output in notebook)

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally have some time for review, wrapping my head around the classes and callbacks here :) Your help will be appreciated.


def get_value(self, **parameters):
"""
Set the parameters
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't get_value a misnomer for setting parameters?

Copy link
Contributor Author

@parulsethi parulsethi Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll replace with set_parameters

"# define other remaining metrics available\n",
"ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Coherence (u_mass)\")\n",
"diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Diff (kullback_leibler)\")\n",
"convergence_jc = ConvergenceMetric(distance=\"hellinger\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Convergence (jaccard)\")\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs say jaccard, but distance is hellinger.

"source": [
"# Training Logs\n",
"\n",
"We can also log the metric values after every epoch to the shell apart from visualizing them in Visdom. The only change is to define `logger=\"shell\"` instead of `\"visdom\"` in the input callbacks."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an example on how to get the values out programatically (no logging or plotting, using them from arbitrary Python code instead)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done by using a loop to manually iterate over model and call metric classes at the end to store value:

model=LdaModel(passes=1)
perplexity=[]
for epoch in range(epochs):
    model.update(passes=1)
    pl = PerplexityMetric().get_value(model)
    perplexity.append(pl)

and sure, I'll add an example for this in notebook.

Copy link
Owner

@piskvorky piskvorky Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is there a way to use the callbacks in a way that they collect this info?

I'm thinking a type of "logger" that instead of logging, appends the value to some internal list. Which other parts of the app can read from.

The idea is the interface would be the same as for logger=visdom/shell, without a need for an explicit outer loop like in your example.

Is that possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can store them just after they are calculated in this step.

Maybe in a dict which could be an attribute of LdaModel? Structure could be:

metrics = {'PerplexityMetric':[val1, val2, ...], 'DiffMetric':[val1, val2, ...] }

on_epoch_end() can be made to return the metric values from current epoch which could then be appended in metrics dict after this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used metrics dict to save values as described above

"INFO:gensim.models.ldamodel:topic #3 (0.200): 0.041*\"the\" + 0.024*\"of\" + 0.021*\"to\" + 0.016*\"a\" + 0.016*\"and\" + 0.012*\"s\" + 0.010*\"in\" + 0.007*\"as\" + 0.006*\"with\" + 0.006*\"party\"\n",
"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n",
"INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n",
"INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use same logging format as in LdaModel (INFO:gensim.models.ldamodel)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"INFO:gensim.models.ldamodel:topic #4 (0.200): 0.034*\"the\" + 0.023*\"of\" + 0.016*\"a\" + 0.015*\"and\" + 0.012*\"in\" + 0.011*\"to\" + 0.008*\"as\" + 0.007*\"is\" + 0.007*\"with\" + 0.007*\"that\"\n",
"INFO:gensim.models.ldamodel:topic diff=1.878616, rho=1.000000\n",
"INFO:LdaModel:Epoch 0: Perplexity estimate: 495.948922721\n",
"INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why perplexity showed twice (and different values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is for holdout corpus and other for test corpus. Though, updated the log statement now to define metric name from the input parameter title

"INFO:LdaModel:Epoch 0: Perplexity estimate: 609.481073631\n",
"INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n",
"INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712 0.96670853 0.8036907 0.72372737 0.63141336]\n",
"INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why always zero ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous state of model (needed to calculate Convergence) was being saved at wrong place. Corrected it

"INFO:LdaModel:Epoch 0: Coherence estimate: -22.4407925538\n",
"INFO:LdaModel:Epoch 0: Diff estimate: [ 0.44194712 0.96670853 0.8036907 0.72372737 0.63141336]\n",
"INFO:LdaModel:Epoch 0: Convergence estimate: 0.0\n",
"INFO:gensim.models.ldamodel:-7.121 per-word bound, 139.2 perplexity estimate based on a held-out corpus of 25 documents with 2182 words\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perplexity calculates twice, wdyt about it @parulsethi? So, you should disable evaluation flag in a notebook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabled eval_every

['graph', 'trees', 'binary', 'widths']]
"""
# only one of the model or topic would be defined
self.model = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should you do this assignment? (only in current Callback)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As both model and topics can be used to calculate Coherence, and only one of them would be defined in **kwargs. So this assignment is just to avoid name not defined error for the other variable which is not in **kwargs.

other_model : second topic model instance to calculate the difference from
"""
super(DiffMetric, self).set_parameters(**kwargs)
diff_matrix, _ = self.model.diff(self.other_model, self.distance, self.num_words, self.n_ann_terms, self.normed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you can use new version for diff (with diagonal and annotation flags)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code style comments.

"test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
"lee_corpus = test_data_dir + os.sep + 'lee.cor'\n",
"lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n",
Copy link
Owner

@piskvorky piskvorky Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join more standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

callback = Callback(self.callbacks)
callback.set_model(self)
# initialize metrics dict to store metric values after every epoch
self.metrics = {}
for metric in self.callbacks:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict comprehension more readable?

Also, defaultdict might make the logic a little simpler.

Copy link
Contributor Author

@parulsethi parulsethi Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use defaultdict

# plot all metrics in current epoch
for i, metric in enumerate(self.metrics):
value = metric.get_value(topics=topics, model=self.model, other_model=self.previous)
if metric.title is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this if/else, instead of this define __str__ or __repr__ method for each metric class, after it you can use label = str(metric)

"""
Base Metric class for topic model evaluation metrics
"""
def __init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define empty __init__ in base class

@menshikh-iv
Copy link
Contributor

LGTM I think, last minor changes and I'll merge it

@parulsethi parulsethi changed the title Lda training visualization in visdom [MRG] Lda training visualization in visdom Aug 10, 2017
@@ -87,6 +93,9 @@ def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, win
self.viz_env = viz_env
self.title = title

def __str__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, if you have no method in child-class, this method will be called from parent class -> no need to call __str__ from each callback explicitly.

"\n",
"def read_corpus(fname):\n",
" texts = []\n",
" with open(fname, encoding=\"ISO-8859-1\") as f:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't work for python2 (because encoding isn't supported in python2), replace it to smart_open.

"\n",
"\n",
"# Set file names for train and test data\n",
"test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use large dataset for this? (download from link in notebook)

"test_dictionary = Dictionary(test_texts)\n",
"\n",
"training_corpus = [training_dictionary.doc2bow(text) for text in training_texts]\n",
"holdout_corpus = [holdout_dictionary.doc2bow(text) for text in holdout_texts]\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 3 different mappings, mistake.

You should fit your Dictionary on training_texts and use it for all conversions (for holdout/test too)

@tmylk
Copy link
Contributor

tmylk commented Aug 17, 2017

Could you please add to ipynb an actual screenshot of the entire Vizdom vis that your code generates? with both train/hold-out perplexity in particular.

@menshikh-iv
Copy link
Contributor

Congratulations @parulsethi, this summer you did a lot of viz for gensim.
This was the last PR and now you are finished your GSoC project 🔥 🥇 🔥.

@menshikh-iv menshikh-iv merged commit ae31c0c into piskvorky:develop Aug 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants