Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find keywords using entropy with Montemurro and Zanette algorithm #665

Closed
tmylk opened this issue Apr 12, 2016 · 12 comments
Closed

Find keywords using entropy with Montemurro and Zanette algorithm #665

tmylk opened this issue Apr 12, 2016 · 12 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@tmylk
Copy link
Contributor

tmylk commented Apr 12, 2016

The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.

Dr Peter J. Bleackley has kindly suggested his implementation

@tmylk tmylk added feature Issue described a new feature wishlist Feature request labels Apr 12, 2016
@PeteBleackley
Copy link
Contributor

However, I feel that it will need a few modifications to its API for fit in nicely with the rest of Gensim, and would like some advice form the core Gensim developers.
My questions are

  1. Where would be the best place to fit this algorithm into the Gensim project structure?
  2. In what format should the algorithm ingest data? The current implementation is designed for XML, mainly for historic reasons.
  3. In what format should the algorithm return its results?

Once I have answers to these questions, it shouldn't take too long to modify my code accordingly.

@tmylk
Copy link
Contributor Author

tmylk commented Jul 6, 2016

Nice meeting you again yesterday. I will put this algo on our student page.

@PeteBleackley
Copy link
Contributor

I'll be happy to advise any student who takes this project on.

@bhargavvader
Copy link
Contributor

If there is interest for this and no one else wishes to take it up, I would like to give it a shot. :)

@piskvorky
Copy link
Owner

@bhargavvader sounds good, thanks!

@tmylk can you add some context to this ticket? What is "Montemurro and Zanette algorithm"?

@tmylk tmylk changed the title Add Montemurro and Zanette algorithm Find keywords using entropy with Montemurro and Zanette algorithm Nov 8, 2016
@PeteBleackley
Copy link
Contributor

Here's a link to a paper describing the algorithm.

https://arxiv.org/abs/0907.1558

@piskvorky
Copy link
Owner

@tmylk ticket context still missing, update.

@tmylk
Copy link
Contributor Author

tmylk commented Dec 28, 2016

@piskvorky Could you please suggest a way to add context? The context is clear to me, with relevant links. There is even a volunteer contributor.

@piskvorky
Copy link
Owner

piskvorky commented Jan 6, 2017

Sure -- something along the lines of "Here's a problem / motivation; here's what we could do to solve it".

The first part is missing -- from the link it's not apparent to me what "Montemurro and Zanette algorithm" does, and the linked implementation doesn't explain it either (that I can see).

If this is implemented in gensim, what will it actually do? Who is it for?

@PeteBleackley
Copy link
Contributor

The algorithm identifies words that are significant to the structure of the document - these often correspond to the major themes. It does so independently of a corpus.

@piskvorky
Copy link
Owner

piskvorky commented Jan 8, 2017

Aha, thanks @PeteBleackley. So this is a candidate to replace the summarization.keywords package, if I understand correctly @tmylk .

It would be interesting to compare them side-by-side, see which algo works better (and deprecate the other one -- we don't want to maintain dead weight in gensim).

Or if the algorithms have non-overlapping strengths/weaknesses, document what they are. When should users use one or the other? Is there a standard benchmark? (@tmylk Qs for the incubator project)

@menshikh-iv menshikh-iv added the difficulty medium Medium issue: required good gensim understanding & python skills label Oct 2, 2017
@PeteBleackley
Copy link
Contributor

I've implemented this in #1738. However, there is a merge conflict in summarization/init.py that needs to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

5 participants