Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: discard "gensim.summarization"? #2592

Closed
gojomo opened this issue Sep 3, 2019 · 23 comments · Fixed by #2958
Closed

Discussion: discard "gensim.summarization"? #2592

gojomo opened this issue Sep 3, 2019 · 23 comments · Fixed by #2958
Assignees
Milestone

Comments

@gojomo
Copy link
Collaborator

gojomo commented Sep 3, 2019

In the course of considering the list question at https://groups.google.com/d/msg/gensim/v24RI3-oUq0/NYlPpif1AQAJ, I took a slightly-deeper look at gensim.summarization than before.

From that look, my opinion is that its presence is more likely to waste peoples' time than help them. It's fairly rudimentary functionality, but spread across many files, with its own non-configurable regex-based word- and sentence- tokenization, with a lot of hard-to-follow steps. None of the doc/tutorial examples show impressive results.

I even find it hard to imagine anyone getting satisfactory results from this approach, so I expect most peoples' interaction with this code is: (1) "I need summarization – and cool, gensim has a summarization feature!" (2) View its docs/tutorial and try on some real data. (3) "This is nowhere near what I need nor is it customizable/fixable enough to be tweaked into service." (4) They look for something else entirely.

I'd suggest marking the whole module 'deprecated' with an eye towards eventual removal. And, if summarization is an important thing to truly support, soliciting someone to work-up a better algorithm or implementation, one that can actually demo some useful results in a tutorial/demo, and that also mixes well with other corpus-format/tokenization practices in gensim. (It might even be TextRank-based – but with configurable tokenization & sentence-similarity/graph-building steps.)

@piskvorky
Copy link
Owner

piskvorky commented Sep 3, 2019

+1 on that. IIRC, the algo is actually OK / standard, but the technical execution (engineering, design) was poor.

One of the (several) modules in Gensim I'd be scared to use myself, and consequently never did.

Discussions go on the mailing list though, why did you open it here?

@gojomo
Copy link
Collaborator Author

gojomo commented Sep 3, 2019

Opened this here because this seemed to me more like a committer-level discussion regarding quality/standards/policies. Also, it'd ideally yield tangible issue-like followup steps, if there was agreement, for which the issue could then record the motivating reasoning & decisions. That's a bit like the prior GH-issue to discuss when/whether Python2-support should be dropped, or the GH-issue asking whether issues themselves should auto-close after deadlines. It's essentially a "feature request" in reverse: a "de-feature request". But happy to discuss there instead or also, as appropriate.

I've generally not been too impressed with "extractive summarization" – it seems to only be useful when the original text was already well authored, in a hierarchical & expository "reference" style. There, extractive summarization has a fair chance of finding the inherently-summarizing sentences/passages the author already included. (Elsewhere, it stumbles hard – as on some of the winding-plot-narratives that some of the tutorial code for this feature has inexplicably chosen to highlight.)

So to the extent TextRank or some other extractive method survives, it'd be helpful to more specifically set expectations. For example, get the name of algorithm (textrank) into the module or function-name, or the type of summarization (extractive), or the essential limitation on its kind of output (sentences_subset).

And, docs/tutorials could highlight some kinds of texts on which it works well, and others where it doesn't. (One potential evaluative method, for a method that's not order-dependent in its choice of sentences: shuffle all the sentences in a Wikipedia article together, run the algorithm, consider those algorithms that choose more sentences from the article's actual 'summary' section-number-0 to be better.)

From what I've read of TextRank, it seems its method of calculating sentence-to-sentence similarity (and thus the edges on its sentence-to-sentence graph) could be pluggable, and methods based on average-of-word-vectors, or doc-vectors, or WMD-similarity might work quite well compared to the current code (which if I've read right just checks nearly-exact-word overlap).

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 7, 2019

+1 for deprecation and eventual removal.

Perhaps this is something we should do in the next major release?

@fredzannarbor
Copy link

There are still a lot of places on the web that recommend using gensim.summarization, so this was not super helpful.

@gojomo
Copy link
Collaborator Author

gojomo commented Apr 20, 2021

There are still a lot of places on the web that recommend using gensim.summarization, so this was not super helpful.

@fredzannarbor It'd be helpful if you let those places know they now need to make some other better recommendation!

@ismailhammounou
Copy link

Do you have any recommendation for bm25 ? There is a tuto that I want replicate in a my use case and It still uses BM25

@gojomo
Copy link
Collaborator Author

gojomo commented May 12, 2021

Do you have any recommendation for bm25 ? There is a tuto that I want replicate in a my use case and It still uses BM25

If a tutorial/approach worked well with the older Gensim version, you can always choose to install & use that older version, for example in an isolated, project-specific virtual environment. Only if you also need closely-integrated later-version features or fixes would there be any complications.

(And, if you really like some of the removed code, & are sure it meets your needs, you can always copy the source code into your own project, adapting names/prerequisites lightly as necessary. Just remember that the choice to remove things has usually been driven by an assessment that the code had limitations that made it hard to officially support, often including no one active in the project with the knowledge/interest to answer questions or investigate issues.)

@bwindsor22
Copy link

:(
@gojomo so if I need, e.g. text split by sentences, I need a dependency on something like NLTK?

Would having more maintainers help in a decision like this?

@gojomo
Copy link
Collaborator Author

gojomo commented Jun 16, 2021

Yes, if you need a text split by sentences, using a project that has well-maintained code for doing that is wise.

That's what Gensim itself would want to do, if any of its current algorithms needed to split text into sentences. (In general, they don't.)

The prior code for this in gensim.summarization.textcleaner.get_sentences() wasn't very good, given other better options just a pip install away.

But also, it was about 2 lines of crude regex-based string splitting. If that's all you need, it's easy to copy. See:

https://github.com/RaRe-Technologies/gensim/blob/release-3.8.3/gensim/summarization/textcleaner.py#L37

https://github.com/RaRe-Technologies/gensim/blob/release-3.8.3/gensim/summarization/textcleaner.py#L147-L173

@Witiko
Copy link
Contributor

Witiko commented Jun 23, 2021

Although I agree with the removal of the gensim.summarization module, Okapi BM25 is the standard baseline for question answering and information retrieval, which outperforms TF-IDF and Log-Entropy even with parameter tuning.

Is there any suitable replacement for gensim.summarization in the context of information retrieval at the moment? I am aware of the rank-bm25 library, which is fast and easy to set up, but also incompatible with Gensim's Dictionary and techniques for query expansion, such as SoftCosineSimilarity. If not, would there be any objections against creating a gensim.models.bm25 module, which would provide a model with similar interface to gensim.models.tfidf? It's missed.

@piskvorky
Copy link
Owner

+1 on including BM25 in Gensim. We'll just need to vet the code better.

But I don't expect it will a problem with your code.

@Witiko
Copy link
Contributor

Witiko commented Apr 27, 2022

I implemented BM25 and opened PR #3304 on April 2. Quantitative results on information retrieval show marked improvement over TF-IDF and compatibility with existing implementations such as rank-bm25. I will appreciate your comments and reviews.

@fredzannarbor
Copy link

How can one now accomplish summarization with gensim?

@gojomo
Copy link
Collaborator Author

gojomo commented Apr 27, 2022

How can one now accomplish summarization with gensim?

There's no summarization functionality in current versions. You could try a 3.x version, & if the results work well for you, keep using that old version, or copy its source-code into your project.

If you want state-of-the-art summarization – including potentially abstractive (paraphrasing) summarization not just a crude selection of some subset of guessed-important sentences that the previous Gensim extractive summarization provided – and have sufficient resources, you could look at newer, deeper large language models, like BERT/etc.

@dogayagcizeybek
Copy link

Artificial intelligence has been evolving rapidly, and we can enhance the functionality of a simple open-source library both algorithmically and with a database-based approach. Since it hasn't been explicitly stated that summarization must be algorithm-based, I would like to request bringing back this idea. What is your perspective on starting a pull request for this thought? We could add a warning during the development phase to ensure that it doesn't consume people's time until satisfactory results are achieved.

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 5, 2023

Since it hasn't been explicitly stated that summarization must be algorithm-based, I would like to request bringing back this idea. What is your perspective on starting a pull request for this thought?

Do you mean a PR to restore exactly the old code?

I think that'd be silly - it was bad code, poorly maintained, without any public examples of it providing good results, that as far as I could tell wasted the time of most people who tried it. A mere documentation or code-comment or even printed-to-console warning that the code is likely to disappoint people doesn't, in my experience, provide enough discouragement. They're still tempted by the label, or misleading old examples online – & thus waste their time, & ours.

But still, if people really want it, maybe they have one of the rare tasks where this technique has good results. (I've seen people report this, but never seen any working demo of this code, on even contrived/cherry-picked data, showing useful results.)

In that case, they can fetch the code out of the old versions. It's easy to get, it's not that long, it's open-source.

As mentioned in the initial 2019 discussion, if someone wanted to make a more-generalizable and more-maintainable implementation of the 'TextRank' algorithm on which this gensim-summarization was based, that might have a case.

With pluggable word/sentence tokenization, & pluggable/configurable sentence-centrality-ranking options, this kind of early extractive text summarization algorithm might still be useful against some well-written texts, or interesting didactically about the limits of summarization capabilities before deep neural networks.

But here in 2023+, even an excellent & flexible implementation of TextRank-style, sentence-excerpts summarization will be far worse than what's cheap & easy with modern LLMs.

@fredzannarbor
Copy link

fredzannarbor commented Jul 5, 2023 via email

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 6, 2023

You didn't clarify whether your proposal is to bring back the old code, but your allusion to 'free' suggests that might be what you are suggesting.

  1. “Cheap and easy” is not free. Useful to have free summarization built into the package.

There was never any truly 'free' summarization in the past, nor is any possible in the future. The prior code was low quality. Users wasted time & effort, which is not free, trying to get it to work. Maintainers faced questions from frustrated users, which impose costs even when the answer is, "no help is available".

(And, with compact & open-source LLMs, those options are potentially as close to 'free' as anything else.)

  1. Extractive summarization is an important alternative because you can rely on the words in the summary being the same as the original source. For some applications that’s essential.

I am unfamiliar with applications where using the exact same words is essential. Can you provide some links to representative applications where that's better than high-quality abstractive summarization?

As I've mentioned, I've never seen any texts on which the old code delivered good results. (Our own demo notebook showed only poor/nonsense results.) If you know of cases where this has been shown to work well, can you provide links?

To the extent someone really wanted to retrieve the "most representative" verbatim sentences from a longer work, as a sort of IR task, I suspect that applying other algorithms better-supported in Gensim – LDA, average-of-word-vectors, WMD, Doc2Vec, etc – would select better excerpts than the prior crude gensim.summarization code.

If such selection-of-verbatim-excerpts is a real need driving your request, I suggest trying some of those other algorithms.

But also, if you have any published or private evaluations showing the old gensim.summarization code doing better than extant alternatives, that would be useful to see. It doesn't seem likely, from my read & tests. (On what texts have you applied the code & reviewed its results?)

  1. Modern LLMs still struggle with context window size. It’s crucial to have at least one tool that can summarize very long documents as a whole, ideally not constrained by memory size.

A tool that could effectively summarize arbitrarily long documents would be useful!

I've seen no evidence the old code could serve as that tool.

Among its other substandard aspects: it required entire documents in memory, and its analysis required a massive expansion in memory use. Even after the fixes in #2298, it was reported to fail with a MemoryError on a 16GB RAM machine when trying to summarize a text under 4MB in size (Tolstoy's full 'War and Peace').

If you think you've found an extractive-summarization technique that could outcompete an LLM due to an LLM's window-size limitations, I'd want to see some credible evaluations demonstrating that, including that it outperforms the most-simple plausible LLM workaround: summarize acceptably-sized chunks, concatenate those summaries, repeat. It doesn't seem likely to me that any extractive approach would be competitive, but I'd enjoy being suprised if that can be shown!

@fredzannarbor
Copy link

fredzannarbor commented Jul 6, 2023 via email

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 6, 2023

The old code is still there, free to use if it works well for your needs. (You can install older versions of Gensim on request, or copy & paste the relevant source code into your projects.)

And I'm still interested for actual viewable examples where it worked well – I've still never seen one.

I sympathize if your historic use might be too private/proprietary to share details, but in the absence of any public examples of this particular code working well, it's hard to justify any cost of maintenance/user-frustration.

By my understanding, quoting (to support a specific point) is very different than summarization. And trusting the old code's excerpts to reflect the original faithfully would be unwise - its technique couldn't be sure if a sentence were a quote of arguments the main document was refuting, or holding up to ridicule.

And my other point remains: the other (stronger, better documented, test-case-covered, better-coded, easier-to-demonstrate) similarity-algorithms can likely find representative excerpts, to quote verbatim if that is necessary, even better than the very fragile/crude/underpowered/inefficient/idiosyncratic gensim.summarization did. Anyone who needs such functionality should try them in that role.

Simple concatenation & recursion can easily be bundled in a single function call in user code. The claim "LLMs can't do this - unless you put their operations into a simple loops of a few lines of code" isn't really the same as "LLMs can't do this".

@fredzannarbor
Copy link

fredzannarbor commented Jul 6, 2023 via email

@gojomo
Copy link
Collaborator Author

gojomo commented Jul 6, 2023

That's helpful to know, even as anecdotal spot testing.

Can you say any more about these texts' sizes in words or sentences, and their domain/style? (EG, were they fiction/non-fiction, academic/popular/governmental, etc?)

I ask because I'm still curious where oldsummarization was providing value – none of our documentation/demo/tutorial examples showed good results, and it may be possible to match/exceed its value with a few dozen lines of other code using better-supported remaining algorithms (& more-standard tokenization functions/libraries).

So the sort of "single function or command line call" functionality you'd like might still be possible, if there were a few more hints about what reference set of texts, & baseline performance, were worth optimizing around.

@fredzannarbor
Copy link

fredzannarbor commented Jul 6, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants