Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738

PeteBleackley · 2017-11-23T17:19:53Z

Implemented as per #665

…ithm

menshikh-iv

Please also add tests for this functionality + update https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb

menshikh-iv · 2017-11-24T04:38:19Z

gensim/summarization/mz_entropy.py

+import scipy
+
+def mz_keywords(text,blocksize=1024,scores=False,split=False,weighted=True,threshold=0.0):
+    """Extract keywords from text using the Montemurro and Zanette entropy algorithm.


Please use numpy-style format for docstring (here and everywhere)

Given that the return type varies according to the scores and split arguments, what should I put in the Returns section?

enumerate all + add description what's condition for each type

Can you point me to an example in the existing docs?

@PeteBleackley sorry, but I have no example for this case.

menshikh-iv · 2017-11-24T04:39:39Z

gensim/summarization/mz_entropy.py

+                                           'auto' calculates the threshold as 
+                                           nblocks/(nblocks+1.0)
+                                           Use 'auto' with weighted=False)"""
+    text=to_unicode(text)


Many PEP8 issues (in each line), look to checker log.

menshikh-iv · 2017-11-24T04:42:43Z

gensim/summarization/mz_entropy.py

+    logp=numpy.log2(p)
+    H=numpy.nan_to_num((p*logp),0.0).sum(axis=0)
+
+    def log_combinations(n,m):


Better move definition log_combinations, marginal_prob, marginal outside and start names with _.

Refactoring the core functionality into a private class to address this issue

yes, it's possible option

menshikh-iv · 2017-11-24T04:44:09Z

gensim/summarization/mz_entropy.py

+    text=to_unicode(text)
+    words=[word for word in _tokenize_by_word(text)]
+    vocab=sorted(set(words))
+    wordcounts=numpy.array([[words[i:i+blocksize].count(word) for word in vocab]


Use hanging indents (instead of vertical), here and everywhere.

menshikh-iv · 2017-11-27T04:41:39Z

ping @PeteBleackley

PeteBleackley · 2017-11-27T08:45:07Z

I have fixed the style issues, refactored the inner functions and written a test. I just need to update the tutorial now.

menshikh-iv · 2017-11-27T09:39:09Z

Don't forget push your commits @PeteBleackley :) (sorry misclick)

PeteBleackley · 2017-11-27T11:11:09Z

Bother, I've got a nasty merge conflict in the tutorial now.

menshikh-iv · 2017-11-27T11:51:29Z

Oh, resolve merge conflict in notebooks is painful, I advise you to apply versions from gensim and make changes again.

PeteBleackley · 2017-11-27T11:57:07Z

More or less fixed now, except that all the newlines are double escaped.

PeteBleackley · 2017-11-27T14:53:01Z

Pushed changes.

menshikh-iv

Please continue improvements, also look at https://travis-ci.org/RaRe-Technologies/gensim/jobs/307965705#L833 (something happened with your algorithm).

menshikh-iv · 2017-11-28T04:16:15Z

docs/notebooks/.ipynb_checkpoints/summarization_tutorial-checkpoint.ipynb

@@ -0,0 +1,512 @@
+{


Incorrect file, please remove it.

menshikh-iv · 2017-11-28T04:18:19Z

docs/notebooks/summarization_tutorial.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, the algorithm weights the entropy by the overall frequency of the word in the document. We can remove this weighting by setting weighted=False"


Where is the output of your code? descriptions are needed too of course, but user want's to see the result (how it works).

menshikh-iv · 2017-11-28T04:22:56Z

gensim/test/test_summarization.py

@@ -147,6 +147,22 @@ def test_keywords_runs(self):

        kwds_lst = keywords(text, split=True)
        self.assertTrue(len(kwds_lst))
+
+    def test_mz_keywords(self):


Need to add the more concrete tests (not only "sanity check"), I mean with the concrete words.

menshikh-iv · 2017-11-28T04:24:14Z

gensim/summarization/mz_entropy.py

+    text = to_unicode(text)
+    words = [word for word in _tokenize_by_word(text)]
+    vocab = sorted(set(words))
+    wordcounts = numpy.array([[words[i:i+blocksize].count(word) 


Line isn't very long, no needed \n here (here and after for range and weights)

menshikh-iv · 2017-11-28T04:26:10Z

gensim/summarization/mz_entropy.py

@@ -0,0 +1,118 @@
+#!/usr/bin/env python


Now coding style is better, but contains problems yet, please have a look to flake8 log - https://travis-ci.org/RaRe-Technologies/gensim/jobs/307965704#L517

PeteBleackley · 2017-11-28T09:57:15Z

Have fixed the last set of requested changes, but there are some issues with a set of tests that I didn't write.

PeteBleackley · 2017-11-28T16:29:23Z

All Flake8 issues fixed now, but the tests are timing out. What do you suggest?

menshikh-iv · 2017-11-29T04:36:23Z

gensim/test/test_summarization.py

+        self.assertTrue(len(kwds_lst))
+        kwds_auto = mz_keywords(text, scores=True, weighted=False,
+            threshold='auto')
+        self.assertTrue(kwds_auto[-1][1] > 329.0 / 330.0)


what this "magic" 329.0 / 330.0 means?

The document I'm using for testing divides into 329 blocks, when the default block size of 1024 words is used. When threshold='auto', the threshold is calculated as nblocks/(nblocks+1). This is in the docstring for mz_entropy, but I'll add a comment to the test.

menshikh-iv · 2017-11-29T04:40:01Z

gensim/test/test_summarization.py

@@ -148,6 +148,26 @@ def test_keywords_runs(self):
        kwds_lst = keywords(text, split=True)
        self.assertTrue(len(kwds_lst))

+    def test_mz_keywords(self):


Travis hang because your implementation is really slow (test don't finish successfully for 10 minutes), you can make several changes

Try to optimize mz_keywords

Use really small corpus (micro-sample from current dataset for example)

I'll try using the first 10240 words.

PeteBleackley · 2017-11-29T12:23:22Z

All tests passing now.

menshikh-iv · 2017-12-01T08:08:57Z

Thanks @PeteBleackley, nice first contribution 🔥 👍 !

PeteBleackley added 3 commits November 23, 2017 15:28

Added Montemurro and Zanette's entropy-based keyword extraction algor…

3160dea

…ithm

Improved Docstrings

0550651

Fixed numerical bugs due to zero frequencies

a072f93

PeteBleackley mentioned this pull request Nov 23, 2017

Find keywords using entropy with Montemurro and Zanette algorithm #665

Closed

Merge branch 'develop' into develop

02428fa

menshikh-iv suggested changes Nov 24, 2017

View reviewed changes

menshikh-iv closed this Nov 27, 2017

menshikh-iv reopened this Nov 27, 2017

Coding style changes, test and tutorial

c8a3792

PeteBleackley added 3 commits November 27, 2017 12:16

Trying to fix a merge conflict

8a8264e

I hate git

e763f3c

Summarization tutorial

195c3c0

Fixed some failing tests

5b9a3ad

menshikh-iv suggested changes Nov 28, 2017

View reviewed changes

Tests, demo, nan_to_num and a few last flake8 issues

4c2d8de

PeteBleackley added 5 commits November 28, 2017 10:54

Further flake8 issues

d9c290a

Further flake8 issues

8809e5a

Removed Jupyter checkpoint

a97fd82

Removed trailing whitespace

0d4e31c

Trailing whitespace

4d18223

menshikh-iv suggested changes Nov 29, 2017

View reviewed changes

PeteBleackley added 2 commits November 29, 2017 09:10

Speed up test and add comment to explain threshold value

dc42cee

Flake8 again

fdddf02

menshikh-iv added 2 commits November 30, 2017 11:42

rename vars + style fixes

86db65c

fix operation order

28ae7cb

menshikh-iv merged commit c462bd0 into piskvorky:develop Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738

Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738

PeteBleackley commented Nov 23, 2017

menshikh-iv left a comment

menshikh-iv Nov 24, 2017

PeteBleackley Nov 24, 2017

menshikh-iv Nov 24, 2017

PeteBleackley Nov 24, 2017

menshikh-iv Nov 24, 2017 •

edited

Loading

menshikh-iv Nov 24, 2017

menshikh-iv Nov 24, 2017

PeteBleackley Nov 24, 2017

menshikh-iv Nov 24, 2017

menshikh-iv Nov 24, 2017

menshikh-iv commented Nov 27, 2017

PeteBleackley commented Nov 27, 2017

menshikh-iv commented Nov 27, 2017 •

edited

Loading

PeteBleackley commented Nov 27, 2017

menshikh-iv commented Nov 27, 2017

PeteBleackley commented Nov 27, 2017

PeteBleackley commented Nov 27, 2017

menshikh-iv left a comment

menshikh-iv Nov 28, 2017

menshikh-iv Nov 28, 2017

menshikh-iv Nov 28, 2017

menshikh-iv Nov 28, 2017

menshikh-iv Nov 28, 2017

PeteBleackley commented Nov 28, 2017

PeteBleackley commented Nov 28, 2017

menshikh-iv Nov 29, 2017

PeteBleackley Nov 29, 2017

menshikh-iv Nov 29, 2017

PeteBleackley Nov 29, 2017

PeteBleackley commented Nov 29, 2017

menshikh-iv commented Dec 1, 2017

Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738

Implementation of Montemurro and Zanette's entropy based keyword extraction algorithm #1738

Conversation

PeteBleackley commented Nov 23, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Nov 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Nov 27, 2017

PeteBleackley commented Nov 27, 2017

menshikh-iv commented Nov 27, 2017 • edited Loading

PeteBleackley commented Nov 27, 2017

menshikh-iv commented Nov 27, 2017

PeteBleackley commented Nov 27, 2017

PeteBleackley commented Nov 27, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PeteBleackley commented Nov 28, 2017

PeteBleackley commented Nov 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PeteBleackley commented Nov 29, 2017

menshikh-iv commented Dec 1, 2017

menshikh-iv Nov 24, 2017 •

edited

Loading

menshikh-iv commented Nov 27, 2017 •

edited

Loading