Merge branch 'develop' of https://github.com/dice-group/dice-embeddings…

… into documentation
dice-group · Mar 20, 2024 · ddcd192 · ddcd192
2 parents 0bd7f22 + 2fa2dcc
commit ddcd192
Show file tree

Hide file tree

Showing 52 changed files with 3,016 additions and 1,698 deletions.
diff --git a/.github/workflows/github-actions-python-package.yml b/.github/workflows/github-actions-python-package.yml
@@ -7,7 +7,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9.18"]
+        python-version: ["3.10.13"]
 
     steps:
       - uses: actions/checkout@v3
@@ -18,11 +18,12 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          pip3 install .
+          pip install -e .["dev"]
+
       - name: Lint with ruff
         run: |
-          ruff  dicee/
+          ruff  --select=E501 --line-length=200 dicee/
       - name: Test with pytest
         run: |
           wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check-certificate && unzip KGs.zip
-          pytest -p no:warnings -x
+          python -m pytest -p no:warnings -x
diff --git a/.github/workflows/sphinx.yml b/.github/workflows/sphinx.yml
@@ -1,33 +1,56 @@
-# This is a basic workflow to help you get started with Actions
+name: Build docs
 
-name: Build-sphinx-docs
+on:
+  push:
+    branches:
+     - main
+     - develop
+     - documentation # just for testing
+  pull_request:
 
-on: [push,pull_request]
-
-# A workflow run is made up of one or more jobs that can run sequentially or in parallel
-jobs: 
-  # This workflow contains a single job called "build"
-  build: 
-    # The type of runner that the job will run on
+jobs:
+  docs:
     runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [ "3.10.11" ]
+      max-parallel: 5
 
-    # Steps represent a sequence of tasks that will be executed as part of the job
     steps:
-      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
-      - uses: actions/checkout@v2
-
-      - name: Set up Python 3.9.18
-        uses: actions/setup-python@v2
-        with:
-           python-version: "3.9.18"
-
-      - name: Install dependencies
-        run: | 
-          python -m pip install --upgrade pip
-      #    # pip install -r requirements.txt
-
-      - name: Build HTML and import
-        run: |
-      #     sphinx-apidoc -o docs dicee/ && make -C docs/ html && ghp-import -n -p -f docs/_build/html
-
-
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r requirements.txt   
+
+    - name: Prepare required software
+      run: |
+        # epstopdf & dot & noto-fonts
+        sudo apt update && sudo apt install texlive-font-utils graphviz fonts-noto\
+
+    - name: Build docs
+      run: |
+        sphinx-build -M html docs/ docs/_build/
+
+    - name: Build LaTeX docs
+      run: |
+        sphinx-build -M latex docs/ docs/_build/
+
+    - name: Compile LaTeX document
+      uses: docker://texlive/texlive:latest
+      with:
+        args: make -C docs/_build/latex
+    - name: Copy Latex pdf to ./html
+      run: |
+        cp docs/_build/latex/diceembeddings.pdf docs/_build/html/
+
+    - name: Deploy to GitHub Pages
+      uses: peaceiris/actions-gh-pages@v3
+      with:
+        github_token: ${{ secrets.GITHUB_TOKEN }}
+        publish_dir: 'docs/_build/html'
diff --git a/README.md b/README.md
@@ -32,25 +32,25 @@ Deploy a pre-trained embedding model without writing a single line of code.
 ## Installation
 <details><summary> Click me! </summary>
 
-### Instalation from Source
+### Installation from Source
 ``` bash
 git clone https://github.com/dice-group/dice-embeddings.git
-conda create -n dice python=3.9.18 --no-default-packages && conda activate dice && cd dice-embeddings &&
-pip3 install .
+conda create -n dice python=3.10.13 --no-default-packages && conda activate dice && cd dice-embeddings &&
+pip3 install -e .
 ```
 or
 ```bash
-pip install dicee==0.1.3
+pip install dicee
 ```
 ## Download Knowledge Graphs
 ```bash
 wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check-certificate && unzip KGs.zip
 ```
 To test the Installation
 ```bash
-pytest -p no:warnings -x # Runs >114 tests leading to > 15 mins
-pytest -p no:warnings --lf # run only the last failed test
-pytest -p no:warnings --ff # to run the failures first and then the rest of the tests.
+python -m pytest -p no:warnings -x # Runs >114 tests leading to > 15 mins
+python -m pytest -p no:warnings --lf # run only the last failed test
+python -m pytest -p no:warnings --ff # to run the failures first and then the rest of the tests.
 ```
 
 </details>
@@ -95,24 +95,36 @@ A KGE model can also be trained from the command line
 ```bash
 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
 ```
-Models can be easily trained in a single node multi-gpu setting with pytorch-lightning
+dicee automatically detects available GPUs and trains a model with distributed data parallels technique.
 ```bash
-dicee --accelerator "gpu" --strategy "ddp" --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test" 
+# Train a model by only using the GPU-0
+CUDA_VISIBLE_DEVICES=0 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
+# Train a model by only using GPU-1
+CUDA_VISIBLE_DEVICES=1 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
+# Train a model by using all available GPUs
+dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
+```
+Under the hood, dicee executes the run.py script and uses [lightning](https://lightning.ai/) as a default trainer.
+```bash
+# Two equivalent executions
+# (1)
+dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
+# (2)
+CUDA_VISIBLE_DEVICES=0,1 python dicee/scripts/run.py --trainer PL --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
 ```
-
 Similarly, models can be easily trained with torchrun
 ```bash
-torchrun --standalone --nnodes=1 --nproc_per_node=gpu main.py
+torchrun --standalone --nnodes=1 --nproc_per_node=gpu dicee/scripts/run.py --trainer torchDDP --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
 ```
 You can also train a model in multi-node multi-gpu setting.
 ```bash
-torchrun --nnodes 2 --nproc_per_node=gpu  --node_rank 0 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula -m dicee.run --trainer torchDDP --dataset_dir KGs/UMLS
-torchrun --nnodes 2 --nproc_per_node=gpu  --node_rank 1 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula -m dicee.run --trainer torchDDP --dataset_dir KGs/UMLS
+torchrun --nnodes 2 --nproc_per_node=gpu  --node_rank 0 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula  dicee/scripts/run.py --trainer torchDDP --dataset_dir KGs/UMLS
+torchrun --nnodes 2 --nproc_per_node=gpu  --node_rank 1 --rdzv_id 455 --rdzv_backend c10d --rdzv_endpoint=nebula dicee/scripts/run.py --trainer torchDDP --dataset_dir KGs/UMLS
 ```
 Train a KGE model by providing the path of a single file and store all parameters under newly created directory
 called `KeciFamilyRun`.
 ```bash
-dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib
+dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
 ```
 where the data is in the following form
 ```bash
@@ -121,6 +133,11 @@ _:1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07
 <http://www.benchmark.org/family#hasChild> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
 <http://www.benchmark.org/family#hasParent> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
 ```
+**Continual Training:** the training phase of a pretrained model can be resumed.
+```bash
+dicee --continual_learning KeciFamilyRun --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
+```
+
 **Apart from n-triples or standard link prediction dataset formats, we support ["owl", "nt", "turtle", "rdf/xml", "n3"]***.
 Moreover, a KGE model can be also trained  by providing **an endpoint of a triple store**.
 ```bash
@@ -129,114 +146,41 @@ dicee --sparql_endpoint "http://localhost:3030/mutagenesis/" --model Keci
 For more, please refer to `examples`.
 </details>
 
-## Embedding Vector Database 
+## Creating an Embedding Vector Database 
 <details> <summary> To see a code snippet </summary>
 
-#### Train an embedding model
-
+##### Learning Embeddings
 ```bash
+# Train an embedding model
 dicee --dataset_dir KGs/Countries-S1 --path_to_store_single_run CountryEmbeddings --model Keci --p 0 --q 1 --embedding_dim 32 --adaptive_swa
-Evaluate Keci on Train set: Evaluate Keci on Train set
-{'H@1': 0.7110711071107111, 'H@3': 0.8937893789378938, 'H@10': 0.9657965796579658, 'MRR': 0.8083741625024974}
-Evaluate Keci on Validation set: Evaluate Keci on Validation set
-{'H@1': 0.2916666666666667, 'H@3': 0.5208333333333334, 'H@10': 0.75, 'MRR': 0.43778750756550605}
-Evaluate Keci on Test set: Evaluate Keci on Test set
-{'H@1': 0.4166666666666667, 'H@3': 0.5833333333333334, 'H@10': 0.8125, 'MRR': 0.5345117321073071}
-Total Runtime: 16.738 seconds
-
-## Create a qdrant vector database
-diceeindex --path_to_store_single_run CountryEmbeddings --path_model CountryEmbeddings --collection_name "dummy" --location "localhost"
 ```
-#### Create Embedding Vector database
-
+#### Loading Embeddings into Qdrant Vector Database
 ```bash
-# Install Qdrant
-docker pull qdrant/qdrant
-docker run -p 6333:6333 -p 6334:6334      -v $(pwd)/qdrant_storage:/qdrant/storage:z      qdrant/qdrant
-# pip install qdrant-client
-diceeindex --path_model CountryEmbeddings --collection_name "dummy" --location "localhost"
+# Ensure that Qdrant available
+# docker pull qdrant/qdrant && docker run -p 6333:6333 -p 6334:6334      -v $(pwd)/qdrant_storage:/qdrant/storage:z      qdrant/qdrant
+diceeindex --path_model "CountryEmbeddings" --collection_name "dummy" --location "localhost"
 ```
-
-#### Run Webservice
+#### Launching Webservice
 ```bash
-diceeserve --path_model CountryEmbeddings --collection_name "dummy" --collection_location "localhost"
+diceeserve --path_model "CountryEmbeddings" --collection_name "dummy" --collection_location "localhost"
 ```
+##### Retrieve and Search 
 
-#### Query
-
-Most similar countries to germany
+Get embedding of germany
 ```bash
-curl -X 'GET' 'http://0.0.0.0:8000/api/search?q=germany' -H 'accept: application/json'
-{"result":[{"hit":"germany","score":1.0},
-{"hit":"netherlands","score":0.8340942},
-{"hit":"luxembourg","score":0.7828385},
-{"hit":"france","score":0.70330715},
-{"hit":"belgium","score":0.6233973}]}
+curl -X 'GET' 'http://0.0.0.0:8000/api/get?q=germany' -H 'accept: application/json'
 ```
 
-
-```python
-# pip install dicee
-# wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check-certificate & unzip KGs.zip
-from dicee.executer import Execute
-from dicee.config import Namespace
-from dicee.knowledge_graph_embeddings import KGE
-# (1) Train a KGE model
-args = Namespace()
-args.model = 'Keci'
-args.p=0
-args.q=1
-args.optim = 'Adam'
-args.scoring_technique = "AllvsAll"
-args.path_single_kg = "KGs/Family/family-benchmark_rich_background.owl"
-args.backend = "rdflib"
-args.num_epochs = 200
-args.batch_size = 1024
-args.lr = 0.1
-args.embedding_dim = 512
-result = Execute(args).start()
-# (2) Load the pre-trained model
-pre_trained_kge = KGE(path=result['path_experiment_folder'])
-# (3) Single-hop query answering
-# Query: ?E : \exist E.hasSibling(E, F9M167)
-# Question: Who are the siblings of F9M167?
-# Answer: [F9M157, F9F141], as (F9M167, hasSibling, F9M157) and (F9M167, hasSibling, F9F141)
-predictions = pre_trained_kge.answer_multi_hop_query(query_type="1p",
-                                                     query=('http://www.benchmark.org/family#F9M167',
-                                                            ('http://www.benchmark.org/family#hasSibling',)),
-                                                     tnorm="min", k=3)
-top_entities = [topk_entity for topk_entity, query_score in predictions]
-assert "http://www.benchmark.org/family#F9F141" in top_entities
-assert "http://www.benchmark.org/family#F9M157" in top_entities
-# (2) Two-hop query answering
-# Query: ?D : \exist E.Married(D, E) \land hasSibling(E, F9M167)
-# Question: To whom a sibling of F9M167 is married to?
-# Answer: [F9F158, F9M142] as (F9M157 #married F9F158) and (F9F141 #married F9M142)
-predictions = pre_trained_kge.answer_multi_hop_query(query_type="2p",
-                                                     query=("http://www.benchmark.org/family#F9M167",
-                                                            ("http://www.benchmark.org/family#hasSibling",
-                                                             "http://www.benchmark.org/family#married")),
-                                                     tnorm="min", k=3)
-top_entities = [topk_entity for topk_entity, query_score in predictions]
-assert "http://www.benchmark.org/family#F9M142" in top_entities
-assert "http://www.benchmark.org/family#F9F158" in top_entities
-# (3) Three-hop query answering
-# Query: ?T : \exist D.type(D,T) \land Married(D,E) \land hasSibling(E, F9M167)
-# Question: What are the type of people who are married to a sibling of F9M167?
-# (3) Answer: [Person, Male, Father] since  F9M157 is [Brother Father Grandfather Male] and F9M142 is [Male Grandfather Father]
-
-predictions = pre_trained_kge.answer_multi_hop_query(query_type="3p", query=("http://www.benchmark.org/family#F9M167",
-                                                                             ("http://www.benchmark.org/family#hasSibling",
-                                                                             "http://www.benchmark.org/family#married",
-                                                                             "http://www.w3.org/1999/02/22-rdf-syntax-ns#type")),
-                                                     tnorm="min", k=5)
-top_entities = [topk_entity for topk_entity, query_score in predictions]
-print(top_entities)
-assert "http://www.benchmark.org/family#Person" in top_entities
-assert "http://www.benchmark.org/family#Father" in top_entities
-assert "http://www.benchmark.org/family#Male" in top_entities
+Get most similar things to europe
+```bash
+curl -X 'GET' 'http://0.0.0.0:8000/api/search?q=europe' -H 'accept: application/json'
+{"result":[{"hit":"europe","score":1.0},
+{"hit":"northern_europe","score":0.67126536},
+{"hit":"western_europe","score":0.6010134},
+{"hit":"puerto_rico","score":0.5051694},
+{"hit":"southern_europe","score":0.4829831}]}
 ```
-For more, please refer to `examples/multi_hop_query_answering`.
+
 </details>
 
 
@@ -327,16 +271,22 @@ pre_trained_kge.predict_topk(r=[".."],t=[".."],topk=10)
 
 ## Downloading Pretrained Models 
 
+We provide plenty pretrained knowledge graph embedding models at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/).
 <details> <summary> To see a code snippet </summary>
 
 ```python
 from dicee import KGE
-# (1) Load a pretrained ConEx on DBpedia 
-model = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/KINSHIP-Keci-dim128-epoch256-KvsAll")
+mure = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_MuRE-dim128-epoch256-KvsAll")
+quate = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_QuatE-dim128-epoch256-KvsAll")
+keci = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Keci-dim128-epoch256-KvsAll")
+quate.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
+# [('Asia', 0.9894362688064575), ('Europe', 0.01575559377670288), ('Tadanari_Lee', 0.012544365599751472)]
+keci.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
+# [('Asia', 0.6522021293640137), ('Chinggis_Khaan_International_Airport', 0.36563414335250854), ('Democratic_Party_(Mongolia)', 0.19600993394851685)]
+mure.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
+# [('Asia', 0.9996906518936157), ('Ulan_Bator', 0.0009907372295856476), ('Philippines', 0.0003116439620498568)]
 ```
 
-- For more please look at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/)
-
 </details>
 
 ## How to Deploy
@@ -434,5 +384,5 @@ url={https://openreview.net/forum?id=6T45-4TFqaX}}
   year={2021},
   organization={IEEE}
 ```
-For any questions or wishes, please contact:  ```caglar.demir@upb.de``` or ```caglardemir8@gmail.com```
+For any questions or wishes, please contact:  ```caglar.demir@upb.de```
 
diff --git a/dicee/__init__.py b/dicee/__init__.py
@@ -4,4 +4,4 @@
 from .executer import Execute # noqa
 from .dataset_classes import * # noqa
 from .query_generator import QueryGenerator # noqa
-__version__ = '0.1.3'
+__version__ = '0.1.4'