-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add tests for this functionality + update https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb
gensim/summarization/mz_entropy.py
Outdated
import scipy | ||
|
||
def mz_keywords(text,blocksize=1024,scores=False,split=False,weighted=True,threshold=0.0): | ||
"""Extract keywords from text using the Montemurro and Zanette entropy algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style format for docstring (here and everywhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the return type varies according to the scores and split arguments, what should I put in the Returns section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerate all + add description what's condition for each type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to an example in the existing docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeteBleackley sorry, but I have no example for this case.
gensim/summarization/mz_entropy.py
Outdated
'auto' calculates the threshold as | ||
nblocks/(nblocks+1.0) | ||
Use 'auto' with weighted=False)""" | ||
text=to_unicode(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many PEP8 issues (in each line), look to checker log.
gensim/summarization/mz_entropy.py
Outdated
logp=numpy.log2(p) | ||
H=numpy.nan_to_num((p*logp),0.0).sum(axis=0) | ||
|
||
def log_combinations(n,m): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better move definition log_combinations
, marginal_prob
, marginal
outside and start names with _
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring the core functionality into a private class to address this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's possible option
gensim/summarization/mz_entropy.py
Outdated
text=to_unicode(text) | ||
words=[word for word in _tokenize_by_word(text)] | ||
vocab=sorted(set(words)) | ||
wordcounts=numpy.array([[words[i:i+blocksize].count(word) for word in vocab] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use hanging indents (instead of vertical), here and everywhere.
ping @PeteBleackley |
I have fixed the style issues, refactored the inner functions and written a test. I just need to update the tutorial now. |
Don't forget push your commits @PeteBleackley :) (sorry misclick) |
Bother, I've got a nasty merge conflict in the tutorial now. |
Oh, resolve merge conflict in notebooks is painful, I advise you to apply versions from gensim and make changes again. |
More or less fixed now, except that all the newlines are double escaped. |
Pushed changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please continue improvements, also look at https://travis-ci.org/RaRe-Technologies/gensim/jobs/307965705#L833 (something happened with your algorithm).
@@ -0,0 +1,512 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect file, please remove it.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"By default, the algorithm weights the entropy by the overall frequency of the word in the document. We can remove this weighting by setting weighted=False" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the output of your code? descriptions are needed too of course, but user want's to see the result (how it works).
gensim/test/test_summarization.py
Outdated
@@ -147,6 +147,22 @@ def test_keywords_runs(self): | |||
|
|||
kwds_lst = keywords(text, split=True) | |||
self.assertTrue(len(kwds_lst)) | |||
|
|||
def test_mz_keywords(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add the more concrete tests (not only "sanity check"), I mean with the concrete words.
gensim/summarization/mz_entropy.py
Outdated
text = to_unicode(text) | ||
words = [word for word in _tokenize_by_word(text)] | ||
vocab = sorted(set(words)) | ||
wordcounts = numpy.array([[words[i:i+blocksize].count(word) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line isn't very long, no needed \n
here (here and after for range
and weights
)
@@ -0,0 +1,118 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now coding style is better, but contains problems yet, please have a look to flake8 log - https://travis-ci.org/RaRe-Technologies/gensim/jobs/307965704#L517
Have fixed the last set of requested changes, but there are some issues with a set of tests that I didn't write. |
All Flake8 issues fixed now, but the tests are timing out. What do you suggest? |
gensim/test/test_summarization.py
Outdated
self.assertTrue(len(kwds_lst)) | ||
kwds_auto = mz_keywords(text, scores=True, weighted=False, | ||
threshold='auto') | ||
self.assertTrue(kwds_auto[-1][1] > 329.0 / 330.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what this "magic" 329.0 / 330.0
means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document I'm using for testing divides into 329 blocks, when the default block size of 1024 words is used. When threshold='auto', the threshold is calculated as nblocks/(nblocks+1). This is in the docstring for mz_entropy, but I'll add a comment to the test.
@@ -148,6 +148,26 @@ def test_keywords_runs(self): | |||
kwds_lst = keywords(text, split=True) | |||
self.assertTrue(len(kwds_lst)) | |||
|
|||
def test_mz_keywords(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis hang because your implementation is really slow (test don't finish successfully for 10 minutes), you can make several changes
- Try to optimize
mz_keywords
- Use really small corpus (micro-sample from current dataset for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try using the first 10240 words.
All tests passing now. |
Thanks @PeteBleackley, nice first contribution 🔥 👍 ! |
Implemented as per #665