[MRG] Added 'score' function for LdaModel's sklearn wrapper #1445

chinmayapancholi13 · 2017-06-23T07:57:36Z

This PR adds score function for scikit-learn wrappers which would also be inherently used by functions like GridSearchCV.

menshikh-iv · 2017-06-28T06:36:24Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldamodel.py

        """
-        Sklearn wrapper for LDA model. derived class for gensim.model.LdaModel .
+        Sklearn wrapper for LDA model.


Why you remove the original class name (gensim.model.LdaModel)? The user will find it more difficult to find the documentation

And description for scorer parameter

add perplexity

@menshikh-iv I removed the comment "derived class for gensim.model.LdaModel" because the class SklLdaModel is no longer a derived class of LdaModel class (we are now using composite design pattern).

Sorry, I said inaccurate. I mean that it's a good idea to store "full classpath" in docstring (gensim.model.LdaModel)

menshikh-iv · 2017-07-03T05:53:28Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldamodel.py

+
+    def score(self, X, y=None):
+        """
+        Compute score reflecting how well the model has fit for the input data.


Describe the change of perplexity sign in docstring

chinmayapancholi13 · 2017-07-03T08:25:17Z

@dsquareindia @macks22 I have been trying to check the scoring options available in CoherenceModel and seem to have stuck in a problem. My code snippet and error log can be found here : https://gist.github.com/chinmayapancholi13/5ebda7b4d1f44968012e9edb9db9b0e1
As you can see in the error log here, the reason for the error is w_star_count being set to 0 which leads to a division by 0 error here : https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/topic_coherence/direct_confirmation_measure.py#L42 I should add that in the scorer function here, if I replace corpus=X by corpus=corpus (where corpus is defined globally above) then the error doesn't pop up. As you can see in the ipynb tutorial (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/sklearn_wrapper.ipynb), something similar is done in the scorer function right now (i.e. texts=texts) which doesn't seem right to me. I have been trying to go through the relevant code in the direct_confirmation_measure.py file, but I am having a hard time understanding the exact meaning and usage of params like accumulator, w_star_count and co_occur_count which seem to be causing the problem.

Since both of you have worked with topic coherence in the past, I thought you might be able to help me resolve this issue. :)

macks22 · 2017-07-03T11:02:16Z

@chinmayapancholi13 The naming conventions used throughout the coherence code generally reflect the namings used in the original paper Exploring the Space of Topic Coherence Measures. w_star and w_prime are word indices paired together during a topic segmentation. This is used for doing pairwise word similarity calculations. w_star_count is the number of times the word w_star was observed in the corpus. co_occur_count is the number of times w_star and w_prime co-occurred in the corpus. These counts are stored in the accumulator. If w_star is never observed, then its count will be 0. Same goes for w_prime (but not for their co-occurrence, which is accounted for by adding a small value EPSILON to avoid taking log of zero).

The solution to this is to make sure that all words in your topn topic lists are present in the corpus you are using to calculate coherence. However, it would also be useful to add some sort of error handling for zero division and include a useful message that explains the issue and re-raises.

…ability'

menshikh-iv · 2017-08-03T11:27:15Z

Thank you @chinmayapancholi13

added 'score' function for LDA

5a45c87

menshikh-iv reviewed Jun 28, 2017

View reviewed changes

Chinmaya Pancholi added 2 commits June 30, 2017 04:11

added scoring for 'perplexity'

90e4bab

updated ipynb for 'score' function in ldamodel

a575682

menshikh-iv reviewed Jul 3, 2017

View reviewed changes

added reason for returning -1*perplexity

e31efe2

PEP8 fix

9618322

chinmayapancholi13 changed the title ~~[WIP] Added 'score' function for sklearn wrappers~~ [WIP] Added 'score' function for LdaModel's sklearn wrapper Jul 6, 2017

Chinmaya Pancholi and others added 6 commits July 7, 2017 03:49

added 'u_mass' mode for scoring

b7c1219

updated ipynb for grid search

13abcfc

updated error handling when 'w_star_count'=0 in 'log_conditional_prob…

a916dbc

…ability'

updated 'score' function for lda wrapper

d4b1986

updated sklearn ipynb with ldamodel scoring examples

48b37c9

flake8 fix

2adaf99

chinmayapancholi13 changed the title ~~[WIP] Added 'score' function for LdaModel's sklearn wrapper~~ [MRG] Added 'score' function for LdaModel's sklearn wrapper Jul 14, 2017

chinmayapancholi13 changed the title ~~[MRG] Added 'score' function for LdaModel's sklearn wrapper~~ [WIP] Added 'score' function for LdaModel's sklearn wrapper Jul 14, 2017

chinmayapancholi13 added 3 commits July 17, 2017 23:43

updated examples for function for lda model api

83fe645

updated example for lda scoring in ipynb

06610f4

added comments for 'scoring' param of 'GridSearchCV'

fe3c51d

chinmayapancholi13 changed the title ~~[WIP] Added 'score' function for LdaModel's sklearn wrapper~~ [MRG] Added 'score' function for LdaModel's sklearn wrapper Jul 28, 2017

menshikh-iv merged commit 9c43ef5 into piskvorky:develop Aug 3, 2017

brycecf mentioned this pull request Jun 6, 2018

Error in sklearn_api.ldamodel.LdaTransformer: Coherence scorer 'u_mass' #2084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Added 'score' function for LdaModel's sklearn wrapper #1445

[MRG] Added 'score' function for LdaModel's sklearn wrapper #1445

chinmayapancholi13 commented Jun 23, 2017

menshikh-iv Jun 28, 2017

menshikh-iv Jun 28, 2017

menshikh-iv Jun 28, 2017

chinmayapancholi13 Jun 29, 2017

menshikh-iv Jun 29, 2017

menshikh-iv Jul 3, 2017

chinmayapancholi13 Jul 3, 2017

chinmayapancholi13 commented Jul 3, 2017

macks22 commented Jul 3, 2017

menshikh-iv commented Aug 3, 2017

[MRG] Added 'score' function for LdaModel's sklearn wrapper #1445

[MRG] Added 'score' function for LdaModel's sklearn wrapper #1445

Conversation

chinmayapancholi13 commented Jun 23, 2017

menshikh-iv Jun 28, 2017

Choose a reason for hiding this comment

menshikh-iv Jun 28, 2017

Choose a reason for hiding this comment

menshikh-iv Jun 28, 2017

Choose a reason for hiding this comment

chinmayapancholi13 Jun 29, 2017

Choose a reason for hiding this comment

menshikh-iv Jun 29, 2017

Choose a reason for hiding this comment

menshikh-iv Jul 3, 2017

Choose a reason for hiding this comment

chinmayapancholi13 Jul 3, 2017

Choose a reason for hiding this comment

chinmayapancholi13 commented Jul 3, 2017

macks22 commented Jul 3, 2017

menshikh-iv commented Aug 3, 2017