The hMDS Corpus

The hMDS corpus is a heterogeneous multi-document summarization corpus built with a novel corpus construction approach. It consists of 91 topics coming from 3 different domains. You can find the guidelines which were used by the annotators to create the corpus in the Guidelines.md file.

Reference

If you plan to refer to hMDS in your publications, please cite the corresponding Coling 2016 paper:

@InProceedings{Zopf2016hMDS,
  author    = {Zopf, Markus and Peyrard, Maxime and Eckle-Kohler, Judith},
  title     = {The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach},
  booktitle = {Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016)},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  publisher = {Association for Computational Linguistics},
  pages     = {1535--1545},
  url       = {https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_AIPHES/publications/2016/2016_COLING_hMDS_cameraReady.pdf},
  website = {https://github.com/AIPHES/hMDS}
}

Obtaining the Corpus

The public parts of the corpus can be found in the hMDS file. Due to copyright restrictions, we are not able to make the full corpus directly available. The subfolder "input", as described in the readme.txt in the hMDS archive files, is missing. To mitigate this issue, we added link lists containing references to the web pages included in the corpus (see Guidelines.md, step 6 for details) which allows an automatic crawling of the corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The hMDS Corpus

Reference

Obtaining the Corpus

Files

README.md

Latest commit

History

README.md

File metadata and controls

The hMDS Corpus

Reference

Obtaining the Corpus