Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Added 'score' function for LdaModel's sklearn wrapper #1445

Merged
merged 14 commits into from
Aug 3, 2017

Conversation

chinmayapancholi13
Copy link
Contributor

This PR adds score function for scikit-learn wrappers which would also be inherently used by functions like GridSearchCV.

"""
Sklearn wrapper for LDA model. derived class for gensim.model.LdaModel .
Sklearn wrapper for LDA model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you remove the original class name (gensim.model.LdaModel)? The user will find it more difficult to find the documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And description for scorer parameter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • add perplexity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv I removed the comment "derived class for gensim.model.LdaModel" because the class SklLdaModel is no longer a derived class of LdaModel class (we are now using composite design pattern).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I said inaccurate. I mean that it's a good idea to store "full classpath" in docstring (gensim.model.LdaModel)


def score(self, X, y=None):
"""
Compute score reflecting how well the model has fit for the input data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the change of perplexity sign in docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@chinmayapancholi13
Copy link
Contributor Author

@dsquareindia @macks22 I have been trying to check the scoring options available in CoherenceModel and seem to have stuck in a problem. My code snippet and error log can be found here : https://gist.github.com/chinmayapancholi13/5ebda7b4d1f44968012e9edb9db9b0e1
As you can see in the error log here, the reason for the error is w_star_count being set to 0 which leads to a division by 0 error here : https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/topic_coherence/direct_confirmation_measure.py#L42 I should add that in the scorer function here, if I replace corpus=X by corpus=corpus (where corpus is defined globally above) then the error doesn't pop up. As you can see in the ipynb tutorial (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/sklearn_wrapper.ipynb), something similar is done in the scorer function right now (i.e. texts=texts) which doesn't seem right to me. I have been trying to go through the relevant code in the direct_confirmation_measure.py file, but I am having a hard time understanding the exact meaning and usage of params like accumulator, w_star_count and co_occur_count which seem to be causing the problem.

Since both of you have worked with topic coherence in the past, I thought you might be able to help me resolve this issue. :)

@macks22
Copy link
Contributor

macks22 commented Jul 3, 2017

@chinmayapancholi13 The naming conventions used throughout the coherence code generally reflect the namings used in the original paper Exploring the Space of Topic Coherence Measures. w_star and w_prime are word indices paired together during a topic segmentation. This is used for doing pairwise word similarity calculations. w_star_count is the number of times the word w_star was observed in the corpus. co_occur_count is the number of times w_star and w_prime co-occurred in the corpus. These counts are stored in the accumulator. If w_star is never observed, then its count will be 0. Same goes for w_prime (but not for their co-occurrence, which is accounted for by adding a small value EPSILON to avoid taking log of zero).

The solution to this is to make sure that all words in your topn topic lists are present in the corpus you are using to calculate coherence. However, it would also be useful to add some sort of error handling for zero division and include a useful message that explains the issue and re-raises.

@chinmayapancholi13 chinmayapancholi13 changed the title [WIP] Added 'score' function for sklearn wrappers [WIP] Added 'score' function for LdaModel's sklearn wrapper Jul 6, 2017
@chinmayapancholi13 chinmayapancholi13 changed the title [WIP] Added 'score' function for LdaModel's sklearn wrapper [MRG] Added 'score' function for LdaModel's sklearn wrapper Jul 14, 2017
@chinmayapancholi13 chinmayapancholi13 changed the title [MRG] Added 'score' function for LdaModel's sklearn wrapper [WIP] Added 'score' function for LdaModel's sklearn wrapper Jul 14, 2017
@chinmayapancholi13 chinmayapancholi13 changed the title [WIP] Added 'score' function for LdaModel's sklearn wrapper [MRG] Added 'score' function for LdaModel's sklearn wrapper Jul 28, 2017
@menshikh-iv
Copy link
Contributor

Thank you @chinmayapancholi13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants