The folder sources
contains the source files of the used lexica; for more information see the README.md
in this folder.
The folder scripts
contains the Python scripts used to generate all files.
The repository has the following files at the moment:
Bailly2020.4a.json
,LSJ.json
,Pape.json
,TBESG.json
Greek dictionaries compiled from respective source files insources
.pta_lexicon_grc.json
:- compiled from LSJ, TBESG, and Pape
- has: lemma – grc_eng – grc_eng2 – grc_deu
- grc_eng = LSJ, grc_eng2 = TBESG, grc_deu = Pape
- grc_eng and grc_deu are lists, as there are homonymous lemmata.
- If there is no entry in one of the dictionaries, the entry is empty.
- The folder
pta_lexicon_grc
contains xml-version of the above wordlemma_grc_cltk.json
:- result of lemmatizing all Greek texts in in pta_data; it currently has 133.438 entries. Lemmatization was done using the Classical Language Toolkit (CLTK).
- has word - lemma - POS - morphology (according to Universal Dependencies (UD) project)
wordlemma_grc.json
(outdated):- result of lemmatizing part of the texts in in pta_data; it has 42.346 entries. Lemmatization was done using the Morpheus morphological analysis engine used at morph.perseids.org.
- has word – lemma – morphology
- words which have not been lemmatized (for whatever reason), are not in the file.
wordlemma_grc.xml
(outdated):- xml-version of the file above
wordlemma_grc_diogenes.json
:- morphology data from Diogenes; Greek is converted to utf-8 (from Betacode).
- has word - lemma (list of possible morphology)
- JSON-versions of the lexica in the
source
-folder, adapted for use in PTA, the folderpta_dictionaries
contains xml-versions of these.
-
georges_lat.json
: compiled from respective source file insources
. -
LewisShort.json
: tbd -
TLL.json
:- built from https://publikationen.badw.de/de/api/thesaurus/html-xml/thesaurus/index.json"
- has lemma - url of entry in THESAVRVS LINGVAE LATINAE Open Access
-
wordlemma_lat_cltk.json
:- result of lemmatizing all Latin texts in in pta_data; it currently has 3022 entries. Lemmatization was done using the Classical Language Toolkit (CLTK).
- has word - lemma - POS - morphology (according to Universal Dependencies (UD) project)
-
wordlemma_lat_diogenes.json
:- morphology data from Diogenes
- has word - lemma (list of possible morphology)