Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Lda training visualization in visdom #1399

Merged
merged 36 commits into from
Aug 30, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
bb65439
save log params in a dict
parulsethi Jun 7, 2017
9d2e78d
remove redundant line
parulsethi Jun 7, 2017
33818ec
add diff log
parulsethi Jun 7, 2017
281222c
remove diff log
parulsethi Jun 8, 2017
c507bbb
write params to log directory
parulsethi Jun 8, 2017
6f75ccc
add convergence, remove alpha
parulsethi Jun 9, 2017
d9db4e2
calculate perplexity/diff instead of using log function
parulsethi Jun 9, 2017
cd5f822
add docstrings and comments
parulsethi Jun 9, 2017
f4728e0
add coherence/diff labels in graphs
parulsethi Jun 12, 2017
40cf092
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
parulsethi Jun 16, 2017
d4f69f5
optional measures for viz
parulsethi Jun 16, 2017
fde7d4d
add coherence params to lda init
parulsethi Jun 16, 2017
3f18076
added Lda Visom viz notebook
parulsethi Jun 26, 2017
546908e
add option to specify env
parulsethi Jun 26, 2017
651a61a
made requested changes
parulsethi Jun 28, 2017
13dfddc
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
parulsethi Jul 8, 2017
1376d90
add generic callback API
parulsethi Jul 8, 2017
44c8e58
modified Notebook for new API
parulsethi Jul 8, 2017
92949a3
fix flake8
parulsethi Jul 8, 2017
5b22e4d
correct lee corpus division
parulsethi Jul 12, 2017
c369fc5
added docstrings
parulsethi Jul 17, 2017
a32960d
fix flake8
parulsethi Jul 18, 2017
48526d9
add shell example
parulsethi Jul 18, 2017
adf2a60
fix queue import for both py2/py3
parulsethi Jul 19, 2017
a272090
store metrics in model instance
parulsethi Aug 2, 2017
d3389bb
add nb example for getting metrics after train
parulsethi Aug 3, 2017
96949f7
merge develop
parulsethi Aug 8, 2017
7d0f0ec
made rquested changes
parulsethi Aug 8, 2017
dcc64a1
use dict for saving metrics
parulsethi Aug 9, 2017
47434f9
use str method for metric classes
parulsethi Aug 10, 2017
30c9b64
correct a notebook description
parulsethi Aug 10, 2017
e55af47
remove child-classes str method
parulsethi Aug 10, 2017
df5e01f
made requested changes
parulsethi Aug 23, 2017
b334c50
Merge branch 'develop' into tensorboard_logs
parulsethi Aug 24, 2017
c54e6bf
add visdom screenshot
parulsethi Aug 24, 2017
5f3d902
Merge branch 'tensorboard_logs' of https://github.com/parulsethi/gens…
parulsethi Aug 24, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 54 additions & 17 deletions docs/notebooks/Training_visualizations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,11 @@
"from gensim.models import ldamodel\n",
"from gensim.corpora.dictionary import Dictionary\n",
"\n",
"# Set file names for train data\n",
"\n",
"# Set file names for train and test data\n",
"test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])\n",
"lee_corpus = test_data_dir + os.sep + 'lee.cor'\n",
"lee_train_file = test_data_dir + os.sep + 'lee_background.cor'\n",
Copy link
Owner

@piskvorky piskvorky Aug 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join more standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"lee_test_file = test_data_dir + os.sep + 'lee.cor'\n",
"\n",
"def read_corpus(fname):\n",
" texts = []\n",
Expand All @@ -59,12 +61,12 @@
" texts.append(words)\n",
" return texts\n",
"\n",
"texts = read_corpus(lee_corpus)\n",
"training_texts = read_corpus(lee_train_file)\n",
"eval_texts = read_corpus(lee_test_file)\n",
"\n",
"# Split test data into hold_out and test corpus\n",
"training_texts = texts[:25]\n",
"holdout_texts = texts[25:40]\n",
"test_texts = texts[40:50]\n",
"holdout_texts = eval_texts[:25]\n",
"test_texts = eval_texts[25:]\n",
"\n",
"training_dictionary = Dictionary(training_texts)\n",
"holdout_dictionary = Dictionary(holdout_texts)\n",
Expand All @@ -78,26 +80,25 @@
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"metadata": {},
"outputs": [],
"source": [
"from gensim.models.callbacks import CoherenceMetric, DiffMetric, PerplexityMetric, ConvergenceMetric\n",
"\n",
"# define perplexity callback for hold_out and test corpus\n",
"pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger=\"visdom\", viz_env=\"LdaModel\", title=\"Perplexity (hold_out)\")\n",
"pl_test = PerplexityMetric(corpus=test_corpus, logger=\"visdom\", viz_env=\"LdaModel\", title=\"Perplexity (test)\")\n",
"pl_holdout = PerplexityMetric(corpus=holdout_corpus, logger=\"visdom\", title=\"Perplexity (hold_out)\")\n",
"pl_test = PerplexityMetric(corpus=test_corpus, logger=\"visdom\", title=\"Perplexity (test)\")\n",
"\n",
"# define other remaining metrics available\n",
"ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Coherence (u_mass)\")\n",
"diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Diff (kullback_leibler)\")\n",
"convergence_jc = ConvergenceMetric(distance=\"hellinger\", logger=\"visdom\", viz_env=\"LdaModel\", title=\"Convergence (jaccard)\")\n",
"ch_umass = CoherenceMetric(corpus=training_corpus, coherence=\"u_mass\", logger=\"visdom\", title=\"Coherence (u_mass)\")\n",
"ch_cv = CoherenceMetric(corpus=training_corpus, texts=training_texts, coherence=\"c_v\", logger=\"visdom\", title=\"Coherence (c_v)\")\n",
"diff_kl = DiffMetric(distance=\"kullback_leibler\", logger=\"visdom\", title=\"Diff (kullback_leibler)\")\n",
"convergence_hl = ConvergenceMetric(distance=\"hellinger\", logger=\"visdom\", title=\"Convergence (hellinger)\")\n",
"\n",
"callbacks = [pl_holdout, pl_test, ch_umass, diff_kl, convergence_jc]\n",
"callbacks = [pl_holdout, pl_test, ch_umass, ch_cv, diff_kl, convergence_hl]\n",
"\n",
"# training LDA model\n",
"model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=5, num_topics=5, callbacks=callbacks)"
"model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=3, num_topics=5, callbacks=callbacks)"
]
},
{
Expand All @@ -116,7 +117,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"-22.4298221364\n"
"-0.259766196856\n"
]
}
],
Expand Down Expand Up @@ -255,6 +256,42 @@
"# training LDA model\n",
"model = ldamodel.LdaModel(corpus=training_corpus, id2word=training_dictionary, passes=3, num_topics=5, callbacks=callbacks)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The metric values can also be accessed from the model instance for custom uses."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'CoherenceMetric': [0.33266605793626819,\n",
" 0.3316839843742313,\n",
" 0.33237246830927009],\n",
" 'ConvergenceMetric': [0.0, 0.0, 0.0],\n",
" 'DiffMetric': [array([ 0.92795546, 0.83166895, 0.8926528 , 0.96382424, 0.98886188]),\n",
" array([ 0.1486518 , 0.16031907, 0.18798994, 0.13619778, 0.11326997]),\n",
" array([ 0.02155673, 0.03477041, 0.03180156, 0.02133546, 0.01840971])],\n",
" 'PerplexityMetric': [2374469.2517599338,\n",
" 1708181.2721127137,\n",
" 1485456.3900059697]}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.metrics"
]
}
],
"metadata": {
Expand Down
40 changes: 26 additions & 14 deletions gensim/models/callbacks.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,22 @@ class Metric(object):
def __init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define empty __init__ in base class

pass

def get_value(self, **parameters):
def set_parameters(self, **parameters):
"""
Set the parameters
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't get_value a misnomer for setting parameters?

Copy link
Contributor Author

@parulsethi parulsethi Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll replace with set_parameters

"""
for parameter, value in parameters.items():
setattr(self, parameter, value)

def get_value(self):
pass


class CoherenceMetric(Metric):
"""
Metric class for coherence evaluation
"""
def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, window_size=None, topn=None, logger="shell", viz_env=None, title=None):
def __init__(self, corpus=None, texts=None, dictionary=None, coherence=None, window_size=None, topn=10, logger=None, viz_env=None, title=None):
"""
Args:
corpus : Gensim document corpus.
Expand Down Expand Up @@ -98,7 +101,7 @@ def get_value(self, **kwargs):
# only one of the model or topic would be defined
self.model = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should you do this assignment? (only in current Callback)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As both model and topics can be used to calculate Coherence, and only one of them would be defined in **kwargs. So this assignment is just to avoid name not defined error for the other variable which is not in **kwargs.

self.topics = None
super(CoherenceMetric, self).get_value(**kwargs)
super(CoherenceMetric, self).set_parameters(**kwargs)
cm = gensim.models.CoherenceModel(self.model, self.topics, self.texts, self.corpus, self.dictionary, self.window_size, self.coherence, self.topn)
return cm.get_coherence()

Expand All @@ -107,7 +110,7 @@ class PerplexityMetric(Metric):
"""
Metric class for perplexity evaluation
"""
def __init__(self, corpus=None, logger="shell", viz_env=None, title=None):
def __init__(self, corpus=None, logger=None, viz_env=None, title=None):
"""
Args:
corpus : Gensim document corpus
Expand All @@ -127,7 +130,7 @@ def get_value(self, **kwargs):
Args:
model : Trained topic model
"""
super(PerplexityMetric, self).get_value(**kwargs)
super(PerplexityMetric, self).set_parameters(**kwargs)
corpus_words = sum(cnt for document in self.corpus for _, cnt in document)
perwordbound = self.model.bound(self.corpus) / corpus_words
return np.exp2(-perwordbound)
Expand All @@ -137,7 +140,7 @@ class DiffMetric(Metric):
"""
Metric class for topic difference evaluation
"""
def __init__(self, distance="jaccard", num_words=100, n_ann_terms=10, normed=True, logger="shell", viz_env=None, title=None):
def __init__(self, distance="jaccard", num_words=100, n_ann_terms=10, normed=True, logger=None, viz_env=None, title=None):
"""
Args:
distance : measure used to calculate difference between any topic pair. Available values:
Expand Down Expand Up @@ -167,7 +170,7 @@ def get_value(self, **kwargs):
model : Trained topic model
other_model : second topic model instance to calculate the difference from
"""
super(DiffMetric, self).get_value(**kwargs)
super(DiffMetric, self).set_parameters(**kwargs)
diff_matrix, _ = self.model.diff(self.other_model, self.distance, self.num_words, self.n_ann_terms, self.normed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you can use new version for diff (with diagonal and annotation flags)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

return np.diagonal(diff_matrix)

Expand All @@ -176,7 +179,7 @@ class ConvergenceMetric(Metric):
"""
Metric class for convergence evaluation
"""
def __init__(self, distance="jaccard", num_words=100, n_ann_terms=10, normed=True, logger="shell", viz_env=None, title=None):
def __init__(self, distance="jaccard", num_words=100, n_ann_terms=10, normed=True, logger=None, viz_env=None, title=None):
"""
Args:
distance : measure used to calculate difference between any topic pair. Available values:
Expand Down Expand Up @@ -206,7 +209,7 @@ def get_value(self, **kwargs):
model : Trained topic model
other_model : second topic model instance to calculate the difference from
"""
super(ConvergenceMetric, self).get_value(**kwargs)
super(ConvergenceMetric, self).set_parameters(**kwargs)
diff_matrix, _ = self.model.diff(self.other_model, self.distance, self.num_words, self.n_ann_terms, self.normed)
return np.sum(np.diagonal(diff_matrix))

Expand Down Expand Up @@ -257,10 +260,16 @@ def on_epoch_end(self, epoch, topics=None):
epoch : current epoch no.
topics : topic distribution from current epoch (required for coherence of unsupported topic models)
"""
# stores current epoch's metric values
current_metrics = {}

# plot all metrics in current epoch
for i, metric in enumerate(self.metrics):
value = metric.get_value(topics=topics, model=self.model, other_model=self.previous)
metric_label = type(metric).__name__[:-6]
metric_label = type(metric).__name__

current_metrics[metric_label] = value

# check for any metric which need model state from previous epoch
if isinstance(metric, (DiffMetric, ConvergenceMetric)):
self.previous = copy.deepcopy(self.model)
Expand All @@ -269,24 +278,27 @@ def on_epoch_end(self, epoch, topics=None):
if epoch == 0:
if value.ndim > 0:
diff_mat = np.array([value])
viz_metric = self.viz.heatmap(X=diff_mat.T, env=metric.viz_env, opts=dict(xlabel='Epochs', ylabel=metric_label, title=metric.title))
viz_metric = self.viz.heatmap(X=diff_mat.T, env=metric.viz_env, opts=dict(xlabel='Epochs', ylabel=metric_label[:-6], title=metric.title))
# store current epoch's diff diagonal
self.diff_mat.put(diff_mat)
# saving initial plot window
self.windows.append(copy.deepcopy(viz_metric))
else:
viz_metric = self.viz.line(Y=np.array([value]), X=np.array([epoch]), env=metric.viz_env, opts=dict(xlabel='Epochs', ylabel=metric_label, title=metric.title))
viz_metric = self.viz.line(Y=np.array([value]), X=np.array([epoch]), env=metric.viz_env, opts=dict(xlabel='Epochs', ylabel=metric_label[:-6], title=metric.title))
# saving initial plot window
self.windows.append(copy.deepcopy(viz_metric))
else:
if value.ndim > 0:
# concatenate with previous epoch's diff diagonals
diff_mat = np.concatenate((self.diff_mat.get(), np.array([value])))
self.viz.heatmap(X=diff_mat.T, env=metric.viz_env, win=self.windows[i], opts=dict(xlabel='Epochs', ylabel=metric_label, title=metric.title))
self.viz.heatmap(X=diff_mat.T, env=metric.viz_env, win=self.windows[i], opts=dict(xlabel='Epochs', ylabel=metric_label[:-6], title=metric.title))
self.diff_mat.put(diff_mat)
else:
self.viz.updateTrace(Y=np.array([value]), X=np.array([epoch]), env=metric.viz_env, win=self.windows[i])

if metric.logger == "shell":
statement = "".join(("Epoch ", str(epoch), ": ", metric_label, " estimate: ", str(value)))
statement = "".join(("Epoch ", str(epoch), ": ", metric_label[:-6], " estimate: ", str(value)))
self.log_type.info(statement)

return current_metrics

10 changes: 9 additions & 1 deletion gensim/models/ldamodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -631,8 +631,13 @@ def rho():
return pow(offset + pass_ + (self.num_updates / chunksize), -decay)

if self.callbacks:
# pass the list of input callbacks to Callback class
callback = Callback(self.callbacks)
callback.set_model(self)
# initialize metrics dict to store metric values after every epoch
self.metrics = {}
for metric in self.callbacks:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict comprehension more readable?

Also, defaultdict might make the logic a little simpler.

Copy link
Contributor Author

@parulsethi parulsethi Aug 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use defaultdict

self.metrics[type(metric).__name__] = []

for pass_ in xrange(passes):
if self.dispatcher:
Expand Down Expand Up @@ -686,8 +691,11 @@ def rho():
if reallen != lencorpus:
raise RuntimeError("input corpus size changed during training (don't use generators as input)")

# append current epoch's metric values
if self.callbacks:
callback.on_epoch_end(pass_)
current_metrics = callback.on_epoch_end(pass_)
for metric, value in current_metrics.items():
self.metrics[metric].append(value)

if dirty:
# finish any remaining updates
Expand Down