-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed KeyError in coherence model #2830
Conversation
Thanks. Can you add a test too? Are there other places in the coherence with the same issue? (poor input validation / preprocessing) Let's fix them all at once. |
I ran some test and i fail when passing topics id's instead of tokens. |
The idea behind the issue is to allow valuation of topics with new tokens and/or ID's that are not present in the texts by excluding them from the computation. To explain i will refer to the issue example with a simple test. Given:
The following code
Returns
Instead of: which is the metric score. With the last update i get the right result from this test, but it raises some other issues due to the fact that in the initial version, token or id formatting was retrieved by handling an exception. Original code:
To find out if topics are in token or ID form, this code consider an exception as an hint that topics lists are expressed in ID form and proceed to retrieve them from their ID's But if we have a topic with a token not present in the texts the code raise the exception and proceed to compute topics as if they're in ID form. Let's try to add a control to exclude tokens which are not present in the texts:
The test in the example with tokens is now covered, but if the input is made of ID's and not tokens, the code will not raise an exception because every id is not a key in the token2id dictionary and it will be excluded, resulting in a list of 0 topics instead of an excpetion which trigger the right code. To handle the problem i proceeded to try this code:
With this code, we assume that the right formatting is the one with the higher number of matches with the token2id or id2token dictionaries. If anyone can help me to solve the problem i will gladly try to work on it again and solve the problem once and for all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picked up a code formatting nitpick.
Is this code covered by tests? If not, please add them.
gensim/models/coherencemodel.py
Outdated
@@ -120,6 +120,7 @@ class CoherenceModel(interfaces.TransformationABC): | |||
>>> coherence = cm.get_coherence() # get coherence value | |||
|
|||
""" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you add this blank line?
I think this would make a good unit test. It would demonstrate the usefulness of your fix and prevent future regressions. |
Looks like we dropped the ball on this PR. People still keep tripping over the same issue. @pietrotrope @mpenkov do you think we could finish this & merge? Thanks. |
hi, I tried to update the code by modifying it according to the requests. |
Thanks @pietrotrope .
That's fine – better than an exception at any rate. Is this fact clearly documented? @mpenkov can you please review & merge? |
pietrotrope - just wanted to say thank you for taking the time to propose this fix! New to gensim and I quickly encountered this problem, this fix will cut down on a lot of preprocessing! following closely ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, please have a look. Sorry it took me so long to review this.
Hi! I updated the code, |
Can you merge origin/develop branch into your PR's branch? Looks like there's some kind of conflict. |
Note to self: resolve conflicts, check tests. |
Merged. Thank you @pietrotrope ! |
When the token list of a topic contains a token which is not present in the dictionary token2id the framework raised an error instead of excluding the token and continue to compute the metric score.
Excluding the tokens of the topics which are not present in the dictionary allows to compute the metric on different texts that do not have to contain every representative word of the topics.