-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
models.Phrases multiple scoring methods (#1363) #1464
models.Phrases multiple scoring methods (#1363) #1464
Conversation
now with a scoring parameter to initialize a Phrases object, defaults to the mikolov paper scoring, but also switchable to 'npmi', normalized pointwise mutual information moved scoring calculation to call a function, scoring functions are now top level functions in models.Phrases that are called when calculating scores in models.Phrases.export_phrases
fixed some bugs with the pluggable scoring that were causing tests to fail.
count_a = float(vocab[word_a]) | ||
count_b = float(vocab[word_b]) | ||
count_ab = float(vocab[bigram_word]) | ||
score = scoring_function(count_a, count_b, count_ab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pluggable scoring function would have to be called with all corpus constants and Phrases settings used in any scoring function. Right now that would look like:
score = scoring_function(count_a, count_b, count_ab, min_count, len_vocab, corpus_word_count)
.
And the call would grow as the universe of variables considered by all scoring functions grows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's still preferable. This string-passing seems inflexible.
We could support some common use-cases by passing a string, but the code underneath should simply translate that string into a scoring_function
and work with that underneath. Custom scoring_function
s should be supported IMO.
In other words, we could support both string and callable as param. If string, gensim converts that to a known callable (for easy-to-use common cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this change, hopefully before the end of the week, and make it part of a PR.
Looks good, thank you @michaelwsherman 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this was already merged, but some changes are necessary.
total vocabulary size. | ||
`threshold` represents a score threshold for forming the phrases (higher means | ||
fewer phrases). A phrase of words `a` followed by `b` is accepted if the score of the | ||
phrase is greater than threshold. see the `scoring' setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize first word in sentence, end in full stop.
@@ -197,8 +221,10 @@ def add_vocab(self, sentences): | |||
# directly, but gives the new sentences a fighting chance to collect | |||
# sufficient counts, before being pruned out by the (large) accummulated | |||
# counts collected in previous learn_vocab runs. | |||
min_reduce, vocab = self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per) | |||
min_reduce, vocab, total_words = \ | |||
self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: bad indentation (unneeded line break).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the number of columns we cap at? I thought it was 100, which I believe this exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no hard limit; if the line becomes hard to read, we break it.
If the break would be even harder to read than the original (for semantic/visual/clarity reasons), we don't break it.
Line continuations are indented at one extra level (4 spaces to the right).
|
||
if scoring == 'default': | ||
scoring_function = \ | ||
partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation (unneeded line break).
partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count)) | ||
elif scoring == 'npmi': | ||
scoring_function = \ | ||
partial(self.npmi_scorer, corpus_word_count=corpus_word_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation (unneeded line break).
count_a = float(vocab[word_a]) | ||
count_b = float(vocab[word_b]) | ||
count_ab = float(vocab[bigram_word]) | ||
score = scoring_function(count_a, count_b, count_ab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's still preferable. This string-passing seems inflexible.
We could support some common use-cases by passing a string, but the code underneath should simply translate that string into a scoring_function
and work with that underneath. Custom scoring_function
s should be supported IMO.
In other words, we could support both string and callable as param. If string, gensim converts that to a known callable (for easy-to-use common cases).
if as_tuples: | ||
yield ((word_a, word_b), score) | ||
else: | ||
yield (out_delimiter.join((word_a, word_b)), score) | ||
last_bigram = True | ||
continue | ||
last_bigram = False | ||
last_bigram = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this on purpose? What is this change about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is on purpose. Matches up to line 277. If that test fails we have to set last_bigram
to false. This positioning sets it to false always--the only time it gets set to true is in line 293 when a passing bigram is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, so this is a bug fix at the same time. Thanks! CC @menshikh-iv
# len_vocab and min_count set so functools.partial works | ||
@staticmethod | ||
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0): | ||
return (bigram_count - min_count) / worda_count / wordb_count * len_vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware of integer divisions - this code is brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fix this in PR #1573 . Rather, I just cast everything before calling the scoring method in Phrases and Phraser. I think that's the better place to do the casting since then it fixes the problem for all custom scorers as well.
Of course, I can do the casting in the scoring methods as well. Let me know if you still think I need it here and in npmi_scorer
and I'll update PR #1573. It's extra steps, but I'd assume the performance hit is infinitesimal.
# normalized PMI, requires corpus size | ||
@staticmethod | ||
def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0): | ||
pa = worda_count / corpus_word_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be an integer or float division? (dtto below)
@piskvorky Thank you for your comments. I'll go though these and respond with some specifics sometime next week, am on vacation now. |
Sounds good. Enjoy your vacation :) |
Question @piskvorky -- what's the best way to make these changes (and to the other PR)? Submit another PR? Or Is there a way to update this PR even though it has already been merged? |
@michaelwsherman Please submit as another PR. |
I haven't forgotten about this, just really swamped right now at work. Do you have an expected date of next release? I'll try to do my best to get these fixes (and the fixes from #1423 ) in a new PR before that. Sorry. |
No problem @michaelwsherman, I think next release will be in the first week of September. |
Fixes from @piskvorky in PR #1573 . |
First attempt at alternative scoring methods, with tests, based on issue #1363
Currently uses a string to specify the scoring. That can be changed, but I'd like some feedback first.
Different scoring methods require different inputs, for example the default scoring requires the min_count setting and the length of the vocabulary, and npmi requires a count of all the words in the corpus (which is now counted in learn_vocab, and presumably slows learn_vocab down a tiny bit).
My main argument against pluggable scoring functions is that since different scoring methods require different corpus-based or settings-based constants, all pluggable functions would have to take all these parameters even if they weren't used. Further, if you wanted a scoring function that used a different corpus-based constant, you'd have to implement the determination/counting of that constant in somewhere like learn_vocab. New corpus based constants could also cause backwards compatibility issues if the become part of the standard scoring function call (although there might be a clever way around this by requiring all scoring functions to inherit from a parent function, which could then be updated to fill in missing parameters in existing scoring functions as new ones are created). See the remarks in the code for more about how pluggable functions might work.
My second argument against pluggable functions is that the current implementation as a string setting is more straightforward, and that it might be overkill to create cool pluggable scoring functionality that may never/rarely get used. And even if there is demand for more scoring options later, after a few other ad-hoc scoring functions are implemented the best design of pluggable scoring functions will be more obvious. Backwards compatibility could easily be kept by creating a new scoring_function input to phrases that overrides scorer.
It's very possible I'm overthinking this and pluggable scoring isn't so complicated. I'm happy to implement something if the feeling is there's a clear best way to do it.