models.Phrases multiple scoring methods (#1363) #1464

michaelwsherman · 2017-07-05T18:46:55Z

First attempt at alternative scoring methods, with tests, based on issue #1363

Currently uses a string to specify the scoring. That can be changed, but I'd like some feedback first.

Different scoring methods require different inputs, for example the default scoring requires the min_count setting and the length of the vocabulary, and npmi requires a count of all the words in the corpus (which is now counted in learn_vocab, and presumably slows learn_vocab down a tiny bit).

My main argument against pluggable scoring functions is that since different scoring methods require different corpus-based or settings-based constants, all pluggable functions would have to take all these parameters even if they weren't used. Further, if you wanted a scoring function that used a different corpus-based constant, you'd have to implement the determination/counting of that constant in somewhere like learn_vocab. New corpus based constants could also cause backwards compatibility issues if the become part of the standard scoring function call (although there might be a clever way around this by requiring all scoring functions to inherit from a parent function, which could then be updated to fill in missing parameters in existing scoring functions as new ones are created). See the remarks in the code for more about how pluggable functions might work.

My second argument against pluggable functions is that the current implementation as a string setting is more straightforward, and that it might be overkill to create cool pluggable scoring functionality that may never/rarely get used. And even if there is demand for more scoring options later, after a few other ad-hoc scoring functions are implemented the best design of pluggable scoring functions will be more obvious. Backwards compatibility could easily be kept by creating a new scoring_function input to phrases that overrides scorer.

It's very possible I'm overthinking this and pluggable scoring isn't so complicated. I'm happy to implement something if the feeling is there's a clear best way to do it.

now with a scoring parameter to initialize a Phrases object, defaults to the mikolov paper scoring, but also switchable to 'npmi', normalized pointwise mutual information moved scoring calculation to call a function, scoring functions are now top level functions in models.Phrases that are called when calculating scores in models.Phrases.export_phrases

fixed some bugs with the pluggable scoring that were causing tests to fail.

…hrases-scoring

michaelwsherman · 2017-07-05T18:50:17Z

gensim/models/phrases.py

+                        count_a = float(vocab[word_a])
+                        count_b = float(vocab[word_b])
+                        count_ab = float(vocab[bigram_word])
+                        score = scoring_function(count_a, count_b, count_ab)


A pluggable scoring function would have to be called with all corpus constants and Phrases settings used in any scoring function. Right now that would look like:
score = scoring_function(count_a, count_b, count_ab, min_count, len_vocab, corpus_word_count).
And the call would grow as the universe of variables considered by all scoring functions grows.

I think that's still preferable. This string-passing seems inflexible.

We could support some common use-cases by passing a string, but the code underneath should simply translate that string into a scoring_function and work with that underneath. Custom scoring_functions should be supported IMO.

In other words, we could support both string and callable as param. If string, gensim converts that to a known callable (for easy-to-use common cases).

I will make this change, hopefully before the end of the week, and make it part of a PR.

…hrases-scoring

menshikh-iv · 2017-07-20T11:21:08Z

Looks good, thank you @michaelwsherman 👍

piskvorky

I see this was already merged, but some changes are necessary.

piskvorky · 2017-07-21T08:07:49Z

gensim/models/phrases.py

-        total vocabulary size.
+        `threshold` represents a score threshold for forming the phrases (higher means
+        fewer phrases). A phrase of words `a` followed by `b` is accepted if the score of the
+        phrase is greater than threshold. see the `scoring' setting


Capitalize first word in sentence, end in full stop.

piskvorky · 2017-07-21T08:08:49Z

gensim/models/phrases.py

@@ -197,8 +221,10 @@ def add_vocab(self, sentences):
        # directly, but gives the new sentences a fighting chance to collect
        # sufficient counts, before being pruned out by the (large) accummulated
        # counts collected in previous learn_vocab runs.
-        min_reduce, vocab = self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per)
+        min_reduce, vocab, total_words = \
+        self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per)


Code style: bad indentation (unneeded line break).

What's the number of columns we cap at? I thought it was 100, which I believe this exceeded.

There's no hard limit; if the line becomes hard to read, we break it.

If the break would be even harder to read than the original (for semantic/visual/clarity reasons), we don't break it.

Line continuations are indented at one extra level (4 spaces to the right).

piskvorky · 2017-07-21T08:09:04Z

gensim/models/phrases.py

+
+        if scoring == 'default':
+            scoring_function = \
+            partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count))


Indentation (unneeded line break).

piskvorky · 2017-07-21T08:09:09Z

gensim/models/phrases.py

+            partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count))
+        elif scoring == 'npmi':
+            scoring_function = \
+            partial(self.npmi_scorer, corpus_word_count=corpus_word_count)


Indentation (unneeded line break).

piskvorky · 2017-07-21T08:11:24Z

gensim/models/phrases.py

+                        count_a = float(vocab[word_a])
+                        count_b = float(vocab[word_b])
+                        count_ab = float(vocab[bigram_word])
+                        score = scoring_function(count_a, count_b, count_ab)


I think that's still preferable. This string-passing seems inflexible.

We could support some common use-cases by passing a string, but the code underneath should simply translate that string into a scoring_function and work with that underneath. Custom scoring_functions should be supported IMO.

In other words, we could support both string and callable as param. If string, gensim converts that to a known callable (for easy-to-use common cases).

piskvorky · 2017-07-21T08:11:52Z

gensim/models/phrases.py

                            if as_tuples:
                                yield ((word_a, word_b), score)
                            else:
                                yield (out_delimiter.join((word_a, word_b)), score)
                            last_bigram = True
                            continue
-                    last_bigram = False
+                last_bigram = False


Is this on purpose? What is this change about?

Yes, this is on purpose. Matches up to line 277. If that test fails we have to set last_bigram to false. This positioning sets it to false always--the only time it gets set to true is in line 293 when a passing bigram is found.

Aha, so this is a bug fix at the same time. Thanks! CC @menshikh-iv

piskvorky · 2017-07-21T08:12:10Z

gensim/models/phrases.py

+    # len_vocab and min_count set so functools.partial works
+    @staticmethod
+    def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0):
+        return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


Beware of integer divisions - this code is brittle.

I didn't fix this in PR #1573 . Rather, I just cast everything before calling the scoring method in Phrases and Phraser. I think that's the better place to do the casting since then it fixes the problem for all custom scorers as well.

Of course, I can do the casting in the scoring methods as well. Let me know if you still think I need it here and in npmi_scorer and I'll update PR #1573. It's extra steps, but I'd assume the performance hit is infinitesimal.

piskvorky · 2017-07-21T08:12:44Z

gensim/models/phrases.py

+    # normalized PMI, requires corpus size
+    @staticmethod
+    def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0):
+        pa = worda_count / corpus_word_count


Is this meant to be an integer or float division? (dtto below)

michaelwsherman · 2017-07-21T12:06:48Z

@piskvorky Thank you for your comments. I'll go though these and respond with some specifics sometime next week, am on vacation now.

piskvorky · 2017-07-21T13:05:32Z

Sounds good. Enjoy your vacation :)

michaelwsherman · 2017-07-25T15:41:19Z

Question @piskvorky -- what's the best way to make these changes (and to the other PR)? Submit another PR? Or Is there a way to update this PR even though it has already been merged?

menshikh-iv · 2017-07-25T15:51:30Z

@michaelwsherman Please submit as another PR.

michaelwsherman · 2017-08-03T19:30:43Z

I haven't forgotten about this, just really swamped right now at work. Do you have an expected date of next release? I'll try to do my best to get these fixes (and the fixes from #1423 ) in a new PR before that.

Sorry.

menshikh-iv · 2017-08-07T07:51:33Z

No problem @michaelwsherman, I think next release will be in the first week of September.

michaelwsherman · 2017-09-06T18:16:20Z

Fixes from @piskvorky in PR #1573 .

Michael Sherman added 10 commits June 30, 2017 16:19

all existing tests now pass

8984589

fixed some bugs with the pluggable scoring that were causing tests to fail.

Merge remote-tracking branch 'upstream/develop' into 1363-alternate-p…

8b298c3

…hrases-scoring

added testScoringOriginal to test default scoring

a36b2fb

better name for test for default scorer

5043dbb

moved scoring parameter checking logic to initialization

384172e

fixed bugin export_phrases scoring function creation

e3eeb67

test for npmi scoring

a6684de

typo in phrases docstring

b70648c

copy scoring setting to Phraser

99ec301

michaelwsherman commented Jul 5, 2017

View reviewed changes

michaelwsherman mentioned this pull request Jul 5, 2017

Better support for evaluating threshold settings in models.phrases.Phrases #1465

Closed

Michael Sherman added 3 commits July 7, 2017 16:06

Merge remote-tracking branch 'upstream/develop' into 1363-alternate-p…

5836467

…hrases-scoring

fixing travis-ci errors

e408f90

no need to specify long vs. int

80b68c2

menshikh-iv merged commit 5f54b60 into piskvorky:develop Jul 20, 2017

piskvorky reviewed Jul 21, 2017

View reviewed changes

menshikh-iv added the style checking label Jul 21, 2017

piskvorky mentioned this pull request Aug 15, 2017

Phrases __getitem__() method does not respect chosen scoring function #1533

Closed

michaelwsherman mentioned this pull request Sep 6, 2017

1533 fix and 1464 1423 comments #1573

Merged

menshikh-iv removed the style checking label Oct 25, 2017

menshikh-iv mentioned this pull request Oct 25, 2017

Small style fixes #1650

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models.Phrases multiple scoring methods (#1363) #1464

models.Phrases multiple scoring methods (#1363) #1464

michaelwsherman commented Jul 5, 2017 •

edited

Loading

michaelwsherman Jul 5, 2017

piskvorky Jul 21, 2017 •

edited

Loading

michaelwsherman Jul 24, 2017

menshikh-iv commented Jul 20, 2017

piskvorky left a comment

piskvorky Jul 21, 2017

piskvorky Jul 21, 2017

michaelwsherman Jul 24, 2017

piskvorky Jul 25, 2017 •

edited

Loading

piskvorky Jul 21, 2017

piskvorky Jul 21, 2017

piskvorky Jul 21, 2017 •

edited

Loading

piskvorky Jul 21, 2017

michaelwsherman Jul 24, 2017

piskvorky Jul 25, 2017

piskvorky Jul 21, 2017

michaelwsherman Sep 6, 2017

piskvorky Jul 21, 2017 •

edited

Loading

michaelwsherman commented Jul 21, 2017

piskvorky commented Jul 21, 2017

michaelwsherman commented Jul 25, 2017

menshikh-iv commented Jul 25, 2017 •

edited by piskvorky

Loading

michaelwsherman commented Aug 3, 2017

menshikh-iv commented Aug 7, 2017 •

edited by piskvorky

Loading

michaelwsherman commented Sep 6, 2017

models.Phrases multiple scoring methods (#1363) #1464

models.Phrases multiple scoring methods (#1363) #1464

Conversation

michaelwsherman commented Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

piskvorky Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Jul 20, 2017

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 21, 2017 • edited Loading

Choose a reason for hiding this comment

michaelwsherman commented Jul 21, 2017

piskvorky commented Jul 21, 2017

michaelwsherman commented Jul 25, 2017

menshikh-iv commented Jul 25, 2017 • edited by piskvorky Loading

michaelwsherman commented Aug 3, 2017

menshikh-iv commented Aug 7, 2017 • edited by piskvorky Loading

michaelwsherman commented Sep 6, 2017

michaelwsherman commented Jul 5, 2017 •

edited

Loading

piskvorky Jul 21, 2017 •

edited

Loading

piskvorky Jul 25, 2017 •

edited

Loading

piskvorky Jul 21, 2017 •

edited

Loading

piskvorky Jul 21, 2017 •

edited

Loading

menshikh-iv commented Jul 25, 2017 •

edited by piskvorky

Loading

menshikh-iv commented Aug 7, 2017 •

edited by piskvorky

Loading