Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
… into documentation
  • Loading branch information
sshivam95 committed Mar 20, 2024
2 parents 0bd7f22 + 2fa2dcc commit ddcd192
Show file tree
Hide file tree
Showing 52 changed files with 3,016 additions and 1,698 deletions.
9 changes: 5 additions & 4 deletions .github/workflows/github-actions-python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9.18"]
python-version: ["3.10.13"]

steps:
- uses: actions/checkout@v3
Expand All @@ -18,11 +18,12 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip3 install .
pip install -e .["dev"]
- name: Lint with ruff
run: |
ruff dicee/
ruff --select=E501 --line-length=200 dicee/
- name: Test with pytest
run: |
wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check-certificate && unzip KGs.zip
pytest -p no:warnings -x
python -m pytest -p no:warnings -x
79 changes: 51 additions & 28 deletions .github/workflows/sphinx.yml
Original file line number Diff line number Diff line change
@@ -1,33 +1,56 @@
# This is a basic workflow to help you get started with Actions
name: Build docs

name: Build-sphinx-docs
on:
push:
branches:
- main
- develop
- documentation # just for testing
pull_request:

on: [push,pull_request]

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "build"
build:
# The type of runner that the job will run on
jobs:
docs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [ "3.10.11" ]
max-parallel: 5

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v2

- name: Set up Python 3.9.18
uses: actions/setup-python@v2
with:
python-version: "3.9.18"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
# # pip install -r requirements.txt

- name: Build HTML and import
run: |
# sphinx-apidoc -o docs dicee/ && make -C docs/ html && ghp-import -n -p -f docs/_build/html


- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Prepare required software
run: |
# epstopdf & dot & noto-fonts
sudo apt update && sudo apt install texlive-font-utils graphviz fonts-noto\
- name: Build docs
run: |
sphinx-build -M html docs/ docs/_build/
- name: Build LaTeX docs
run: |
sphinx-build -M latex docs/ docs/_build/
- name: Compile LaTeX document
uses: docker://texlive/texlive:latest
with:
args: make -C docs/_build/latex
- name: Copy Latex pdf to ./html
run: |
cp docs/_build/latex/diceembeddings.pdf docs/_build/html/
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: 'docs/_build/html'
176 changes: 63 additions & 113 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,25 +32,25 @@ Deploy a pre-trained embedding model without writing a single line of code.
## Installation
<details><summary> Click me! </summary>

### Instalation from Source
### Installation from Source
``` bash
git clone https://github.com/dice-group/dice-embeddings.git
conda create -n dice python=3.9.18 --no-default-packages && conda activate dice && cd dice-embeddings &&
pip3 install .
conda create -n dice python=3.10.13 --no-default-packages && conda activate dice && cd dice-embeddings &&
pip3 install -e .
```
or
```bash
pip install dicee==0.1.3
pip install dicee
```
## Download Knowledge Graphs
```bash
wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check-certificate && unzip KGs.zip
```
To test the Installation
```bash
pytest -p no:warnings -x # Runs >114 tests leading to > 15 mins
pytest -p no:warnings --lf # run only the last failed test
pytest -p no:warnings --ff # to run the failures first and then the rest of the tests.
python -m pytest -p no:warnings -x # Runs >114 tests leading to > 15 mins
python -m pytest -p no:warnings --lf # run only the last failed test
python -m pytest -p no:warnings --ff # to run the failures first and then the rest of the tests.
```

</details>
Expand Down Expand Up @@ -95,24 +95,36 @@ A KGE model can also be trained from the command line
```bash
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
Models can be easily trained in a single node multi-gpu setting with pytorch-lightning
dicee automatically detects available GPUs and trains a model with distributed data parallels technique.
```bash
dicee --accelerator "gpu" --strategy "ddp" --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by only using the GPU-0
CUDA_VISIBLE_DEVICES=0 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by only using GPU-1
CUDA_VISIBLE_DEVICES=1 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by using all available GPUs
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
Under the hood, dicee executes the run.py script and uses [lightning](https://lightning.ai/) as a default trainer.
```bash
# Two equivalent executions
# (1)
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# (2)
CUDA_VISIBLE_DEVICES=0,1 python dicee/scripts/run.py --trainer PL --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```

Similarly, models can be easily trained with torchrun
```bash
torchrun --standalone --nnodes=1 --nproc_per_node=gpu main.py
torchrun --standalone --nnodes=1 --nproc_per_node=gpu dicee/scripts/run.py --trainer torchDDP --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
You can also train a model in multi-node multi-gpu setting.
```bash
torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 0 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula -m dicee.run --trainer torchDDP --dataset_dir KGs/UMLS
torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 1 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula -m dicee.run --trainer torchDDP --dataset_dir KGs/UMLS
torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 0 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula dicee/scripts/run.py --trainer torchDDP --dataset_dir KGs/UMLS
torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 1 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula dicee/scripts/run.py --trainer torchDDP --dataset_dir KGs/UMLS
```
Train a KGE model by providing the path of a single file and store all parameters under newly created directory
called `KeciFamilyRun`.
```bash
dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib
dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
```
where the data is in the following form
```bash
Expand All @@ -121,6 +133,11 @@ _:1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07
<http://www.benchmark.org/family#hasChild> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
<http://www.benchmark.org/family#hasParent> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
```
**Continual Training:** the training phase of a pretrained model can be resumed.
```bash
dicee --continual_learning KeciFamilyRun --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
```

**Apart from n-triples or standard link prediction dataset formats, we support ["owl", "nt", "turtle", "rdf/xml", "n3"]***.
Moreover, a KGE model can be also trained by providing **an endpoint of a triple store**.
```bash
Expand All @@ -129,114 +146,41 @@ dicee --sparql_endpoint "http://localhost:3030/mutagenesis/" --model Keci
For more, please refer to `examples`.
</details>

## Embedding Vector Database
## Creating an Embedding Vector Database
<details> <summary> To see a code snippet </summary>

#### Train an embedding model

##### Learning Embeddings
```bash
# Train an embedding model
dicee --dataset_dir KGs/Countries-S1 --path_to_store_single_run CountryEmbeddings --model Keci --p 0 --q 1 --embedding_dim 32 --adaptive_swa
Evaluate Keci on Train set: Evaluate Keci on Train set
{'H@1': 0.7110711071107111, 'H@3': 0.8937893789378938, 'H@10': 0.9657965796579658, 'MRR': 0.8083741625024974}
Evaluate Keci on Validation set: Evaluate Keci on Validation set
{'H@1': 0.2916666666666667, 'H@3': 0.5208333333333334, 'H@10': 0.75, 'MRR': 0.43778750756550605}
Evaluate Keci on Test set: Evaluate Keci on Test set
{'H@1': 0.4166666666666667, 'H@3': 0.5833333333333334, 'H@10': 0.8125, 'MRR': 0.5345117321073071}
Total Runtime: 16.738 seconds

## Create a qdrant vector database
diceeindex --path_to_store_single_run CountryEmbeddings --path_model CountryEmbeddings --collection_name "dummy" --location "localhost"
```
#### Create Embedding Vector database

#### Loading Embeddings into Qdrant Vector Database
```bash
# Install Qdrant
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
# pip install qdrant-client
diceeindex --path_model CountryEmbeddings --collection_name "dummy" --location "localhost"
# Ensure that Qdrant available
# docker pull qdrant/qdrant && docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/qdrant_storage:/qdrant/storage:z qdrant/qdrant
diceeindex --path_model "CountryEmbeddings" --collection_name "dummy" --location "localhost"
```

#### Run Webservice
#### Launching Webservice
```bash
diceeserve --path_model CountryEmbeddings --collection_name "dummy" --collection_location "localhost"
diceeserve --path_model "CountryEmbeddings" --collection_name "dummy" --collection_location "localhost"
```
##### Retrieve and Search

#### Query

Most similar countries to germany
Get embedding of germany
```bash
curl -X 'GET' 'http://0.0.0.0:8000/api/search?q=germany' -H 'accept: application/json'
{"result":[{"hit":"germany","score":1.0},
{"hit":"netherlands","score":0.8340942},
{"hit":"luxembourg","score":0.7828385},
{"hit":"france","score":0.70330715},
{"hit":"belgium","score":0.6233973}]}
curl -X 'GET' 'http://0.0.0.0:8000/api/get?q=germany' -H 'accept: application/json'
```


```python
# pip install dicee
# wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check-certificate & unzip KGs.zip
from dicee.executer import Execute
from dicee.config import Namespace
from dicee.knowledge_graph_embeddings import KGE
# (1) Train a KGE model
args = Namespace()
args.model = 'Keci'
args.p=0
args.q=1
args.optim = 'Adam'
args.scoring_technique = "AllvsAll"
args.path_single_kg = "KGs/Family/family-benchmark_rich_background.owl"
args.backend = "rdflib"
args.num_epochs = 200
args.batch_size = 1024
args.lr = 0.1
args.embedding_dim = 512
result = Execute(args).start()
# (2) Load the pre-trained model
pre_trained_kge = KGE(path=result['path_experiment_folder'])
# (3) Single-hop query answering
# Query: ?E : \exist E.hasSibling(E, F9M167)
# Question: Who are the siblings of F9M167?
# Answer: [F9M157, F9F141], as (F9M167, hasSibling, F9M157) and (F9M167, hasSibling, F9F141)
predictions = pre_trained_kge.answer_multi_hop_query(query_type="1p",
query=('http://www.benchmark.org/family#F9M167',
('http://www.benchmark.org/family#hasSibling',)),
tnorm="min", k=3)
top_entities = [topk_entity for topk_entity, query_score in predictions]
assert "http://www.benchmark.org/family#F9F141" in top_entities
assert "http://www.benchmark.org/family#F9M157" in top_entities
# (2) Two-hop query answering
# Query: ?D : \exist E.Married(D, E) \land hasSibling(E, F9M167)
# Question: To whom a sibling of F9M167 is married to?
# Answer: [F9F158, F9M142] as (F9M157 #married F9F158) and (F9F141 #married F9M142)
predictions = pre_trained_kge.answer_multi_hop_query(query_type="2p",
query=("http://www.benchmark.org/family#F9M167",
("http://www.benchmark.org/family#hasSibling",
"http://www.benchmark.org/family#married")),
tnorm="min", k=3)
top_entities = [topk_entity for topk_entity, query_score in predictions]
assert "http://www.benchmark.org/family#F9M142" in top_entities
assert "http://www.benchmark.org/family#F9F158" in top_entities
# (3) Three-hop query answering
# Query: ?T : \exist D.type(D,T) \land Married(D,E) \land hasSibling(E, F9M167)
# Question: What are the type of people who are married to a sibling of F9M167?
# (3) Answer: [Person, Male, Father] since F9M157 is [Brother Father Grandfather Male] and F9M142 is [Male Grandfather Father]

predictions = pre_trained_kge.answer_multi_hop_query(query_type="3p", query=("http://www.benchmark.org/family#F9M167",
("http://www.benchmark.org/family#hasSibling",
"http://www.benchmark.org/family#married",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type")),
tnorm="min", k=5)
top_entities = [topk_entity for topk_entity, query_score in predictions]
print(top_entities)
assert "http://www.benchmark.org/family#Person" in top_entities
assert "http://www.benchmark.org/family#Father" in top_entities
assert "http://www.benchmark.org/family#Male" in top_entities
Get most similar things to europe
```bash
curl -X 'GET' 'http://0.0.0.0:8000/api/search?q=europe' -H 'accept: application/json'
{"result":[{"hit":"europe","score":1.0},
{"hit":"northern_europe","score":0.67126536},
{"hit":"western_europe","score":0.6010134},
{"hit":"puerto_rico","score":0.5051694},
{"hit":"southern_europe","score":0.4829831}]}
```
For more, please refer to `examples/multi_hop_query_answering`.

</details>


Expand Down Expand Up @@ -327,16 +271,22 @@ pre_trained_kge.predict_topk(r=[".."],t=[".."],topk=10)

## Downloading Pretrained Models

We provide plenty pretrained knowledge graph embedding models at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/).
<details> <summary> To see a code snippet </summary>

```python
from dicee import KGE
# (1) Load a pretrained ConEx on DBpedia
model = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/KINSHIP-Keci-dim128-epoch256-KvsAll")
mure = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_MuRE-dim128-epoch256-KvsAll")
quate = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_QuatE-dim128-epoch256-KvsAll")
keci = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Keci-dim128-epoch256-KvsAll")
quate.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.9894362688064575), ('Europe', 0.01575559377670288), ('Tadanari_Lee', 0.012544365599751472)]
keci.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.6522021293640137), ('Chinggis_Khaan_International_Airport', 0.36563414335250854), ('Democratic_Party_(Mongolia)', 0.19600993394851685)]
mure.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.9996906518936157), ('Ulan_Bator', 0.0009907372295856476), ('Philippines', 0.0003116439620498568)]
```

- For more please look at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/)

</details>

## How to Deploy
Expand Down Expand Up @@ -434,5 +384,5 @@ url={https://openreview.net/forum?id=6T45-4TFqaX}}
year={2021},
organization={IEEE}
```
For any questions or wishes, please contact: ```caglar.demir@upb.de``` or ```caglardemir8@gmail.com```
For any questions or wishes, please contact: ```caglar.demir@upb.de```

2 changes: 1 addition & 1 deletion dicee/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
from .executer import Execute # noqa
from .dataset_classes import * # noqa
from .query_generator import QueryGenerator # noqa
__version__ = '0.1.3'
__version__ = '0.1.4'
Loading

0 comments on commit ddcd192

Please sign in to comment.