license | pretty_name | language | configs | tags | size_categories | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
English Wiktionary in JSONL |
|
|
|
|
wilhelm-graphdb is a Docker image that hosts Wiktionary language data in Neo4J graph database. It is part of the efforts that scales project Wilhelm
Currently, the following languages are loaded into the container database:
- German
- Latin
- Ancient Greek
If a graph database is not exactly what one need the data to be stored, the Wiktionary language data is also available on 🤗 Hugging Face Datasets.
from datasets import load_dataset
dataset = load_dataset("QubitPi/wilhelm-graphdb", split="German")
The available splits are
German
Latin
AncientGreek
This section discusses the details of how Docker image is made. It is recommended to go through this process in a remote machine such as AWS EC2 because it takes more than a day to load data completely.
Although the original Wiktionary dump is available, parsing it from scratch involves rather complicated process. We would probably do it in the future. At present, however, we would simply take the awesome works by tatuylonen which has already processed it and presented it in in JSON format. wilhelm-graphdb takes the raw Wiktextract data (JSONL, one object per line) option.
The setup installs 2 packages:
- Docker Engine
- Python3
virtualenv
./docker-setup.sh
Then log out of the remote machine and back in.
Caution
The Python 3.10+ should be installed separately. If the remote server is Ubuntu 22.04 or above, this should have automatically been satisfied
Prerequisite:
- Docker
- Python 3.10
git clone git@github.com:QubitPi/wilhelm-graphdb.git
cd wilhelm-graphdb
Creating the following environment variables:
- DOCKERHUB_USERNAME: The value for the
-u
argument as used in the Docker login command - DOCKERHUB_TOKEN: The value for the
-p
argument as used in the Docker login command
Caution
The script would overwrite the following environment variables with the specified values if already defined locally:
- NEO4J_URI:
neo4j://localhost:7687
- NEO4J_USERNAME:
neo4j
The loading takes several hours, please be patient
nohup ./docker-load.sh > load.log &
- The loading log can be found at
./load.log
- The database UI can now be accessed at http://localhost:7474 which shows how much data has been loaded
docker cp neo4j-loader:/data .
docker build -t jack20191124/wilhelm-graphdb:neo4j .
docker login -u $DOCKERHUB_USERNAME -p $DOCKERHUB_PASSWORD
docker push jack20191124/wilhelm-graphdb:neo4j
pip3 uninstall wilhelm_python_sdk
pip3 install --upgrade --force-reinstall wilhelm-python-sdk
The use and distribution terms for wilhelm-graphdb are covered by the Apache License, Version 2.0.