Word-Lemma-Lists and Lexica used by PTA

The folder sources contains the source files of the used lexica; for more information see the README.md in this folder.

The folder scripts contains the Python scripts used to generate all files.

Greek

The repository has the following files at the moment:

Bailly2020.4a.json, LSJ.json, Pape.json, TBESG.json Greek dictionaries compiled from respective source files in sources.
pta_lexicon_grc.json:
- compiled from LSJ, TBESG, and Pape
- has: lemma – grc_eng – grc_eng2 – grc_deu
- grc_eng = LSJ, grc_eng2 = TBESG, grc_deu = Pape
- grc_eng and grc_deu are lists, as there are homonymous lemmata.
- If there is no entry in one of the dictionaries, the entry is empty.
The folder pta_lexicon_grc contains xml-version of the above
wordlemma_grc_cltk.json:
- result of lemmatizing all Greek texts in in pta_data; it currently has 133.438 entries. Lemmatization was done using the Classical Language Toolkit (CLTK).
- has word - lemma - POS - morphology (according to Universal Dependencies (UD) project)
wordlemma_grc.json (outdated):
- result of lemmatizing part of the texts in in pta_data; it has 42.346 entries. Lemmatization was done using the Morpheus morphological analysis engine used at morph.perseids.org.
- has word – lemma – morphology
- words which have not been lemmatized (for whatever reason), are not in the file.
wordlemma_grc.xml (outdated):
- xml-version of the file above
wordlemma_grc_diogenes.json:
- morphology data from Diogenes; Greek is converted to utf-8 (from Betacode).
- has word - lemma (list of possible morphology)
JSON-versions of the lexica in the source-folder, adapted for use in PTA, the folder pta_dictionaries contains xml-versions of these.

georges_lat.json: compiled from respective source file in sources.
LewisShort.json: tbd
TLL.json:
- built from https://publikationen.badw.de/de/api/thesaurus/html-xml/thesaurus/index.json"
- has lemma - url of entry in THESAVRVS LINGVAE LATINAE Open Access
wordlemma_lat_cltk.json:
- result of lemmatizing all Latin texts in in pta_data; it currently has 3022 entries. Lemmatization was done using the Classical Language Toolkit (CLTK).
- has word - lemma - POS - morphology (according to Universal Dependencies (UD) project)
wordlemma_lat_diogenes.json:
- morphology data from Diogenes
- has word - lemma (list of possible morphology)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
pta_dictionaries		pta_dictionaries
pta_lexicon_grc		pta_lexicon_grc
scripts		scripts
sources		sources
Bailly2020.4a.json		Bailly2020.4a.json
LSJ.json		LSJ.json
Pape.json		Pape.json
README.md		README.md
TBESG.json		TBESG.json
TLL.json		TLL.json
georges_lat.json		georges_lat.json
pta_lexicon_grc.json		pta_lexicon_grc.json
wordlemma_grc.json		wordlemma_grc.json
wordlemma_grc.xml		wordlemma_grc.xml
wordlemma_grc_cltk.json		wordlemma_grc_cltk.json
wordlemma_grc_diogenes.json		wordlemma_grc_diogenes.json
wordlemma_lat_cltk.json		wordlemma_lat_cltk.json
wordlemma_lat_diogenes.json		wordlemma_lat_diogenes.json