Skip to content

🤗 Wiktionary-sourced vocabulary stored in graph database as rich vocabulary visualization data source

License

Notifications You must be signed in to change notification settings

QubitPi/wilhelm-graphdb

Repository files navigation

license pretty_name language configs tags size_categories
apache-2.0
English Wiktionary in JSONL
en
de
grc
la
config_name data_files
Languages
split path
German
german-wiktextract-data.jsonl
split path
Latin
latin-wiktextract-data.jsonl
split path
AncientGreek
ancient-greek-wiktextract-data.jsonl
Wiktionary
German
Ancient Greek
Latin
Vocabulary
1M<n<10M

Wilhelm GraphDB - Visualizing Wiktionary in Graph Database

Python Version Badge Neo4J DB version badge GitHub workflow status badge Hugging Face sync status badge Hugging Face dataset badge Apache License Badge

wilhelm-graphdb is a Docker image that hosts Wiktionary language data in Neo4J graph database. It is part of the efforts that scales project Wilhelm

Currently, the following languages are loaded into the container database:

  1. German
  2. Latin
  3. Ancient Greek

🤗 Hugging Face Datasets

If a graph database is not exactly what one need the data to be stored, the Wiktionary language data is also available on 🤗 Hugging Face Datasets.

from datasets import load_dataset
dataset = load_dataset("QubitPi/wilhelm-graphdb", split="German")

The available splits are

  • German
  • Latin
  • AncientGreek

Development

Docker

This section discusses the details of how Docker image is made. It is recommended to go through this process in a remote machine such as AWS EC2 because it takes more than a day to load data completely.

Although the original Wiktionary dump is available, parsing it from scratch involves rather complicated process. We would probably do it in the future. At present, however, we would simply take the awesome works by tatuylonen which has already processed it and presented it in in JSON format. wilhelm-graphdb takes the raw Wiktextract data (JSONL, one object per line) option.

VM Setup

The setup installs 2 packages:

  • Docker Engine
  • Python3 virtualenv
./docker-setup.sh

Then log out of the remote machine and back in.

Caution

The Python 3.10+ should be installed separately. If the remote server is Ubuntu 22.04 or above, this should have automatically been satisfied

Creating Docker Image

Prerequisite:

  • Docker
  • Python 3.10
git clone git@github.com:QubitPi/wilhelm-graphdb.git
cd wilhelm-graphdb

Creating the following environment variables:

Caution

The script would overwrite the following environment variables with the specified values if already defined locally:

  • NEO4J_URI: neo4j://localhost:7687
  • NEO4J_USERNAME: neo4j

The loading takes several hours, please be patient

nohup ./docker-load.sh > load.log &
  • The loading log can be found at ./load.log
  • The database UI can now be accessed at http://localhost:7474 which shows how much data has been loaded
docker cp neo4j-loader:/data .
docker build -t jack20191124/wilhelm-graphdb:neo4j .

docker login -u $DOCKERHUB_USERNAME -p $DOCKERHUB_PASSWORD
docker push jack20191124/wilhelm-graphdb:neo4j

Troubleshooting

Reinstalling wilhelm_python_sdk

pip3 uninstall wilhelm_python_sdk
pip3 install --upgrade --force-reinstall wilhelm-python-sdk

License

The use and distribution terms for wilhelm-graphdb are covered by the Apache License, Version 2.0.

About

🤗 Wiktionary-sourced vocabulary stored in graph database as rich vocabulary visualization data source

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published