Fixed KeyError in coherence model #2830

pietrotrope · 2020-05-07T17:08:42Z

When the token list of a topic contains a token which is not present in the dictionary token2id the framework raised an error instead of excluding the token and continue to compute the metric score.

Excluding the tokens of the topics which are not present in the dictionary allows to compute the metric on different texts that do not have to contain every representative word of the topics.

piskvorky · 2020-05-07T18:31:13Z

Thanks. Can you add a test too?

Are there other places in the coherence with the same issue? (poor input validation / preprocessing) Let's fix them all at once.

pietrotrope · 2020-05-07T18:37:43Z

I ran some test and i fail when passing topics id's instead of tokens.
I need to solve this issue, after that we can proceed to search other places with the same issue

pietrotrope · 2020-05-08T15:04:05Z

The idea behind the issue is to allow valuation of topics with new tokens and/or ID's that are not present in the texts by excluding them from the computation.

To explain i will refer to the issue example with a simple test.

Given:

topics = [  ['human', 'computer', 'system', 'interface'],
            ['graph', 'minors', 'trees', 'eps']]

test_WITH_human = [  ['human', 'computer', 'system', 'interface'],
                     ['graph', 'minors', 'trees', 'eps']]

test_WITHOUT_human = [  [ 'computer', 'system', 'interface'],
                        ['graph', 'minors', 'trees', 'eps']]

The following code

dictionary = gensim.corpora.Dictionary(test_WITHOUT_human)
corpus = [dictionary.doc2bow(text) for text in test_WITHOUT_human]

cm = CoherenceModel(topics=topics, 
                    corpus=corpus, 
                    texts=test_WITHOUT_human,
                    dictionary=dictionary, 
                    coherence="c_npmi")

cm.get_coherence()

Returns


--> 445             return np.array([self.dictionary.token2id[token] for token in topic])
    446         except KeyError:  # might be a list of token ids already, but let's verify all in dict

KeyError: 'human'

Instead of:
1.000000000005771

which is the metric score.

With the last update i get the right result from this test, but it raises some other issues due to the fact that in the initial version, token or id formatting was retrieved by handling an exception.

Original code:

def _ensure_elements_are_ids(self, topic):
        try:
            return np.array([self.dictionary.token2id[token] for token in topic])
        except KeyError:  # might be a list of token ids already, but let's verify all in dict
            topic = (self.dictionary.id2token[_id] for _id in topic)
            return np.array([self.dictionary.token2id[token] for token in topic])

To find out if topics are in token or ID form, this code consider an exception as an hint that topics lists are expressed in ID form and proceed to retrieve them from their ID's

But if we have a topic with a token not present in the texts the code raise the exception and proceed to compute topics as if they're in ID form.
Topics are not in ID form and we got a runtime error.

Let's try to add a control to exclude tokens which are not present in the texts:

def _ensure_elements_are_ids(self, topic):
        try:
            return np.array([self.dictionary.token2id[token] for token in topic if token in self.dictionary.token2id])
        except KeyError:  # might be a list of token ids already, but let's verify all in dict
            topic = (self.dictionary.id2token[_id] for _id in topic if _id in self.dictionary.id2token)
            return np.array([self.dictionary.token2id[token] for token in topic])

The test in the example with tokens is now covered, but if the input is made of ID's and not tokens, the code will not raise an exception because every id is not a key in the token2id dictionary and it will be excluded, resulting in a list of 0 topics instead of an excpetion which trigger the right code.

To handle the problem i proceeded to try this code:

def _ensure_elements_are_ids(self, topic):
           elements_are_tokens = np.array([self.dictionary.token2id[token] for token in topic if token in self.dictionary.token2id])
           topic_tokens_from_id = (self.dictionary.id2token[_id] for _id in topic if _id in self.dictionary.id2token)
           elements_are_ids = np.array([self.dictionary.token2id[token] for token in topic_tokens_from_id])
           if elements_are_tokens.size > elements_are_ids.size:
               return elements_are_tokens
           elif elements_are_ids.size > elements_are_tokens.size:
               return elements_are_ids
           else:
               raise Exception("Topic list is not a list of lists of tokens or ids")

With this code, we assume that the right formatting is the one with the higher number of matches with the token2id or id2token dictionaries.
When we can't say if the right formatting is tokens lists or ID's lists we raise an exception.
This example seems to pass more tests but it makes an assumption on the data formatting instead of finding out the real one.

If anyone can help me to solve the problem i will gladly try to work on it again and solve the problem once and for all.

mpenkov

Picked up a code formatting nitpick.

Is this code covered by tests? If not, please add them.

mpenkov · 2020-07-18T23:52:35Z

gensim/models/coherencemodel.py

@@ -120,6 +120,7 @@ class CoherenceModel(interfaces.TransformationABC):
        >>> coherence = cm.get_coherence()  # get coherence value

    """
+


Why did you add this blank line?

mpenkov · 2020-07-18T23:53:24Z

To explain i will refer to the issue example with a simple test.

I think this would make a good unit test. It would demonstrate the usefulness of your fix and prevent future regressions.

piskvorky · 2021-04-09T14:03:45Z

Looks like we dropped the ball on this PR. People still keep tripping over the same issue.

@pietrotrope @mpenkov do you think we could finish this & merge? Thanks.

pietrotrope · 2021-04-09T17:29:01Z

hi, I tried to update the code by modifying it according to the requests.
I point out that the solution that I have proposed in recent months (and that I am trying to submit with this commit) excludes from the calculation of the coherence the words that are not present in the used dictionary.

piskvorky · 2021-04-11T15:34:47Z

Thanks @pietrotrope .

I point out that the solution that I have proposed in recent months (and that I am trying to submit with this commit) excludes from the calculation of the coherence the words that are not present in the used dictionary.

That's fine – better than an exception at any rate. Is this fact clearly documented?

@mpenkov can you please review & merge?

T0admomo · 2021-05-04T18:42:10Z

pietrotrope - just wanted to say thank you for taking the time to propose this fix! New to gensim and I quickly encountered this problem, this fix will cut down on a lot of preprocessing! following closely !

mpenkov

Left some comments, please have a look. Sorry it took me so long to review this.

gensim/models/coherencemodel.py

gensim/test/test_coherencemodel.py

pietrotrope · 2021-05-09T09:40:54Z

Hi! I updated the code,
as i said in the comments in the revision i did some refactoring (following your suggestions) and added the new tests (also corrected one)

mpenkov · 2021-05-09T09:51:10Z

Can you merge origin/develop branch into your PR's branch? Looks like there's some kind of conflict.

mpenkov · 2021-06-22T02:08:05Z

Note to self: resolve conflicts, check tests.

mpenkov · 2021-06-29T05:00:25Z

Merged. Thank you @pietrotrope !

pietrotrope added 2 commits May 7, 2020 20:53

Fixed coherence model issue #2711

c4aef6d

Handled token or id formatting of topics

5baa575

pietrotrope added 2 commits May 8, 2020 18:26

Raised error with wrong formatting

2d8b1ad

removed blank lines

a17a1ca

mpenkov requested changes Jul 18, 2020

View reviewed changes

mpenkov reviewed Jul 18, 2020

View reviewed changes

piskvorky mentioned this pull request Apr 9, 2021

Coherence key error on held out set #3111

Closed

pietrotrope added 2 commits April 9, 2021 19:17

updated code

ea40fa7

updated code

e1d19ef

pietrotrope requested a review from mpenkov April 9, 2021 17:27

mpenkov requested changes May 5, 2021

View reviewed changes

gensim/models/coherencemodel.py Outdated Show resolved Hide resolved

gensim/models/coherencemodel.py Outdated Show resolved Hide resolved

gensim/test/test_coherencemodel.py Show resolved Hide resolved

pietrotrope added 2 commits May 9, 2021 11:33

revision on coherencemodel.py

eb0aff6

added new tests

01c4b9f

mpenkov self-assigned this Jun 22, 2021

Merge remote-tracking branch 'upstream/develop' into coherence_score_fix

c3d47a1

mpenkov mentioned this pull request Jun 29, 2021

Bug Fix: Uninitialized dictionary.id2token used in CoherenceModel #2971

Closed

rm trailing whitespace

2846dea

mpenkov added 2 commits June 29, 2021 12:12

more flake8 fixes

30f311d

still more flake8 fixes

807e30d

mpenkov changed the title ~~Fixed coherence model issue #2711~~ Fixed KeyError in coherence model Jun 29, 2021

update changelog

f4d8e80

mpenkov merged commit 52fade6 into piskvorky:develop Jun 29, 2021

pietrotrope deleted the coherence_score_fix branch May 25, 2022 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed KeyError in coherence model #2830

Fixed KeyError in coherence model #2830

pietrotrope commented May 7, 2020 •

edited by mpenkov

Loading

piskvorky commented May 7, 2020

pietrotrope commented May 7, 2020

pietrotrope commented May 8, 2020

mpenkov left a comment

mpenkov Jul 18, 2020

mpenkov commented Jul 18, 2020

piskvorky commented Apr 9, 2021

pietrotrope commented Apr 9, 2021

piskvorky commented Apr 11, 2021

T0admomo commented May 4, 2021

mpenkov left a comment

pietrotrope commented May 9, 2021

mpenkov commented May 9, 2021

mpenkov commented Jun 22, 2021

mpenkov commented Jun 29, 2021

		@@ -120,6 +120,7 @@ class CoherenceModel(interfaces.TransformationABC):
		>>> coherence = cm.get_coherence() # get coherence value

		"""

Fixed KeyError in coherence model #2830

Fixed KeyError in coherence model #2830

Conversation

pietrotrope commented May 7, 2020 • edited by mpenkov Loading

piskvorky commented May 7, 2020

pietrotrope commented May 7, 2020

pietrotrope commented May 8, 2020

mpenkov left a comment

Choose a reason for hiding this comment

mpenkov Jul 18, 2020

Choose a reason for hiding this comment

mpenkov commented Jul 18, 2020

piskvorky commented Apr 9, 2021

pietrotrope commented Apr 9, 2021

piskvorky commented Apr 11, 2021

T0admomo commented May 4, 2021

mpenkov left a comment

Choose a reason for hiding this comment

pietrotrope commented May 9, 2021

mpenkov commented May 9, 2021

mpenkov commented Jun 22, 2021

mpenkov commented Jun 29, 2021

pietrotrope commented May 7, 2020 •

edited by mpenkov

Loading