The ELTE Novel Corpus is a continuously expanding database developed by the Department of Digital Humanities at Eötvös Loránd University. Currently, the corpus contains 400 Hungarian novels. Besides the texts, the corpus contains the annotation of structural units and the grammatical features of words in TEI XML format. The novels of the corpus are from the 19th century and from the first half of the 20th century.
The numbers below present the numeric properties of the level2 novels. The novels of level1 are currently being expanded.
- number of novels: 400
- number of authors: 119
- number of tokens: 26.8 million
- number of words: 21.4 million
The level1_metadata.tsv file contains the main metadata for level1 novels and the level2_metadata.tsv file contains the main metadata for level2 novels. WARNING: Since level1 novels are currently being expanded, it is possible that the level1_metadata.tsv file is not up to date, that is, some novels added newly are not included in the TSV file.
The source of the corpus was the collection of the Hungarian Electronic Library.
- The texts from the Hungarian Electronic Library were converted into TEI XML format based on the Text Encoding Initiative. The TEI XML files contain the annotation of structural units and the metadata of the novels. The conversion was partly done manually (level1).
- Then, we tokenized the novels and annotated the grammatical features of words by using e-magyar, an NLP tool chain for Hungarian texts (level2).
<ns1:authorGender/>
: sex of authorM
: maleF
: female
<ns1:size/>
: size of the novelshort
: 10 000 -- 49 999 wordsmedium
: 50 0000 -- 99 999 wordslong
: more than 100 000 words
<ns1:canonicity/>
: canonicity level of the novellow
: 0 or 1 edition after 1979high
: 2 or more edition after 1979
<ns1:timeSlot/>
: time period of the first edition of the novelT0
: before 1840T1
: 1840--1860T2
: 1860--1880T3
: 1880--1900T4
: 1900--1920T5
: after 1920
<head>
: title<div>
: part, chapter<milestone>
: delimiter of subchapters<p>
: paragraph
<s>
: sentence<w>
: word<pc>
: punctuation mark@lemma
: lemma@pos
: part of speech@msd
: morphosyntactic features (Universal Dependencies)
The folder contains the level1 and level2 files with headers in the format of ELTeC. These files are not valid for TEI, we do not recommend to use these files.
- Gábor Palkó
- Tímea Borbála Bajzát
- Emma Takács
- Bence Vétek
- Zsófia Fellegi
- Péter Horváth
- Balázs Indig
- Bence Vida
- Botond Szemes
- Eszter Szlávich
The content of the repository is licensed under the CC BY-NC-ND license.
All texts of the corpus are in the public domain.