Skip to content

Ground truth data for Page Layout Analysis of Historical Classical Commentaries.

License

Notifications You must be signed in to change notification settings

AjaxMultiCommentary/GT-commentaries-layout

Repository files navigation

Layout Ground Truth for Historical Commentaries

License: CC BY 4.0 files badge regions badge lines badge characters badge HTR United Workflow

Description

GT4HistCommentLayout contains layout annotations for ca. 370 pages sampled from 8 public domain classical commentaries, published in the 19th century in English, German and Latin. The commentaries concern Ancient Greek and Latin works from prose and poetry (caveat: AGreek poetry is slightly over-represented).

The dataset comes in two formats: alto_xml and json containing the same information. For each format, there is one file per page (named after the image's id). Each json is single list of region-dictionnaries, like so:

[
  # The first region : 
  {"label": "MainZone:primaryText", # The segmonto-style label of the region
   "bbox": [ [100, 100], [300, 300] ] # The region's bbox in xyxy format (upperleft x-y, bottomright x-y)
  },
  # The second region:
  # ...
] 

⚠️ NB: this dataset contains exclusively layout annotations. A separate dataset with OCR groundtruth data exists for classical commentaries, but sources and annotated pages differ between the two.

An example of annotated commentary page is shown in the image below. Pages were annotated according to a taxonomy mapped to the SegmOnto controlled vocabulary.

Sources

ID Genre Lang. Bibl. Ref.
annalsoftacitusp00taci prose Latin H. Furneaux. 1896. The annals of Tacitus. P. Cornelii Taciti annalium ab excessu Divi Augusti libri. Oxford
bsb10234118 poetry Agreek Lobeck, Christian August. 1835. Sophoclis Aiax. Leipzig: Weidmann
cu31924087948174 poetry AGreek Lewis Campbell. 1881. Sophocles. Oxford : Clarendon Press
pvergiliusmaroa00virggoog poetry Latin Norden, E. Aen. 6. 1903. Leipzig: Teubner
sophoclesplaysa05campgoog poetry AGreek Jebb, Richard Claverhouse. 1896. Sophocles: The Plays and Fragments. London: Cambridge University Press.
sophokle1v3soph poetry Agreek Schneidewin, Friedrich Wilhelm. 1853. Sophokles Erklaert von F. W. Schneidewin. Leipzig: Weidmann
thukydides02thuc prose AGreek J. Classen & J. Steup. 1889. Thukydides erklärt, 7., Weidmann: Berlin
Wecklein1894 poetry AGreek Wecklein, Nikolaus. 1894. Sophokleus Aias. München: Lindauer

Statistics

Detailed statistics about the annotated dataset can be found in the stats.txt (automatically generated by humGenerator).

Citation

If you use this dataset in your research, please cite the following publication:

@inproceedings{najem-meyer_page-layout-analysis_2022,
  title = {Page {{Layout Analysis}} of {{Text-heavy Historical Documents}}: A {{Comparison}} of {{Textual}} and {{Visual Approaches}}},
  booktitle = {Proceedings of the {{Conference}} on {{Computational Humanities Research}} 2022},
  author = {{Najem-Meyer}, Sven and Romanello, Matteo},
  year = {2022},
  publisher = {{CEUR-WS}},
  address = {{Antwerp}}
}

License

This dataset is released under a Creative Commons Attribution license (CC BY).

Acknowledgements

Data in this repository were produced in the context of the Ajax Multi-Commentary project, funded by the Swiss National Science Foundation under an Ambizione grant PZ00P1_186033.

Contributors: Carla Amaya (UNIL), Sven Najem-Meyer (EPFL), Matteo Romanello (UNIL).