A curated list of resources dedicated to Biblical Natural Language Processing
Contribute your favorite Biblical NLP resource by raising a pull request! Please read the contribution guidelines before raising a pull request.
- ebible Parallel Data: Curated corpus of parallel data derived from translations of the Bible provided by eBible.org.
- Biblical Humanities Corpus | on HuggingFace: Collection of open-licensed sentence-level (verse) aligned bi-text in several languages.
- Vachan Data Corpus: Collection of 12 minority language New Testament translations from Northern India as sentence-level (verse) aligned bi-text.
- OPUS Bible Corpus (bible-uedin): Collection of aligned bi-texts based on the Bible in 102 languages. [Paper]
- Snow Mountain Dataset: Open-licensed and formatted dataset of audio recordings of the Bible in low-resource Indian languages.
- Macula Hebrew | Greek: Open-licensed and curated dataset of the Bible in Hebrew and Greek with various connected meta resources (e.g. Syntax trees, glosses, semantic roles).
- utoken: Universal tokenizer in Python and CLI interface that is also tested on Biblical text.
- uroman: Universal Romanizer that can convert any unicode script to roman (latin) script
- SIL Machine | Python version | JavaScript Version: Toolkit for various NLP operations on Biblical content (especially support for Paratext projects).
- Wildebeest: Investigate, repair and normalize text for a wide range of issues at the character level. Especially tested on Biblical content.