Skip to content

The French summarization dataset introduced in "BARThez: a Skilled Pretrained French Sequence-to-Sequence Model".

License

Notifications You must be signed in to change notification settings

Tixierae/OrangeSum

Repository files navigation

OrangeSum Dataset

What is this repo for?

This repository provides the French summarization dataset introduced in the paper BARThez: a Skilled Pretrained French Sequence-to-Sequence Model (Kamal Eddine, Tixier, and Vazirgiannis, 2020), with the train, development, and test splits used in the paper. It also provides the code that was used to build the dataset, and the test set summaries generated by the BARThez, mBART, mBARThez, and CamemBERT2CamemBERT models, to make cross comparison with future work as easy as possible (in ./summaries/).

Note: this repository is dedicated to the OrangeSum dataset. The main repository of the paper is https://github.com/moussaKam/BARThez.

OrangeSum

The OrangeSum dataset was inspired by the XSum dataset. It was created by scraping the "Orange Actu" website: https://actu.orange.fr/. Orange S.A. is a large French multinational telecommunications corporation, with 266M customers worldwide. Scraped pages cover almost a decade from Feb 2011 to Sep 2020. They belong to five main categories: France, world, politics, automotive, and society. The society category is itself divided into 8 subcategories: health, environment, people, culture, media, high-tech, unusual ("insolite" in French), and miscellaneous.

Each article featured a single-sentence title as well as a very brief abstract, both professionally written by the author of the article. These two fields were extracted from each page, thus creating two summarization tasks: OrangeSum Title and OrangeSum Abstract.

As a post-processing step, we removed all empty articles, and articles whose titles were shorter than 5 words. For OrangeSum Abstract, we removed the top 10% articles in terms of proportion of novel unigrams in the abstracts, as we observed that such abstracts tended to be introductions rather than real abstracts. This corresponded to a threshold of 57% novel unigrams.

For both OrangeSum Title and OrangeSum Abstract, we set aside 1500 pairs for testing, 1500 for validation, and used all the remaining ones for training.

In the table below, Sizes (column 2) are given in thousands of documents, document and summary lengths are in words, and vocab sizes are in thousands of tokens.

alt text

In the table below, it can be observed that OrangeSum offers approximately the same degree of abstractivity as XSum, and that both of them are more abstractive than traditional summarization datasets.

alt text

Steps to create the dataset

Starting from an empty directory structure, run the following scripts, in that order.

  1. get_urls.py
  2. scrape_urls.py
  3. parse_urls.py
  4. compute_overlap.py
  5. filter_split.py

Notes

  1. Some of the articles that were scraped might not still be online. The raw HTML files were saved and are released here, though.
  2. Sometimes, "heading" is used in the code and the repository. It corresponds to the Abstract task in the paper.
  3. The dataset was augmented by running a second round of scraping about two months after the initial one, to collect new articles. In this process, a line ======== was appended at the end of the urls.txt file. The indexes of the new documents start from the index of the following line (in urls.txt). These indexes were passed to the scape_one_url() function that writes the documents, but the new URLs were appended directly at the end of the urls_final.txt file. This created an index gap as urls_final.txt had not the same number of lines as urls.txt at the beginning of the process. So, to sum up, there is a perfect mapping between the line numbers of the URLs and the final .json files from 0 to 31134. Then, one needs to add 236, i.e., URL index + 236 = .json index.

Cite

If you use our code or dataset, please cite:

BibTex

@article{eddine2020barthez,
  title={BARThez: a Skilled Pretrained French Sequence-to-Sequence Model},
  author={Eddine, Moussa Kamal and Tixier, Antoine J-P and Vazirgiannis, Michalis},
  journal={arXiv preprint arXiv:2010.12321},
  year={2020}
}

MLA

Eddine, Moussa Kamal, Antoine J-P. Tixier, and Michalis Vazirgiannis. "BARThez: a Skilled Pretrained French Sequence-to-Sequence Model." arXiv preprint arXiv:2010.12321 (2020).