Skip to content

A transformer-based models to enhance text summarization with a focus on rare and infrequently used words built with CNN/DailyMail Dataset using PyTorch, NLTK, Bidirectional Auto-Regressive Transformers(BART)

Notifications You must be signed in to change notification settings

apoorvwankar/Text-Summarization-With-Rare-Words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Summarization With Rare Words

In an era of explosive internet content, automatic text summarization has emerged as a vital solution. There is a challenge for taking care of rare and infrequently used words in text summarization. There are various aspects of extractive, abstractive, and hybrid approaches, harnessing transformer models and attention mechanisms to reshape text summarization. The results were compared using the well-known ROUGE metric.

Dataset

The CNN / DailyMail Dataset is a collection of English-language news articles, encompassing more than 300,000 unique journalistic pieces from CNN and the Daily Mail. Originally designed for machine reading, comprehension, and abstractive question answering, the dataset now caters to both extractive and abstractive summarization tasks.

CNN, Daily Mail Dataset

The dataset has the following columns:

  • id: URL (in string)
  • article: Body of the news article
  • highlights: Highlight of the article as written by the author

The articles and highlights features are used as a 'source_text' and 'target_summary' respectively as a training data. The target_summary feature is used to test the model generated summarization vs the author generated summarization of the source_text.

Results

The ROGUE scores are used to calculate the performance of the model.

Score of BART model:

Score

References

Rare words in text summarization https://www.sciencedirect.com/science/article/pii/S2949719123000110

About

A transformer-based models to enhance text summarization with a focus on rare and infrequently used words built with CNN/DailyMail Dataset using PyTorch, NLTK, Bidirectional Auto-Regressive Transformers(BART)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published