Skip to content

geokats/schema-linking-embeddings

Repository files navigation

Locally Aligned Embeddings for Schema Linking

Introduction

This project was completed for the Database Systems class taught by Dr. Georgia Koutrika, as part of the Data Science and Information Technologies MSc programme at the University of Athens.

The goal of the project is to construct word embedding representations from a given table in order to perform better schema linking.

Instructions on how to download the necessary files and how to run the code can be found bellow. It is also possible to run the experiments on Google Colab, without downloading anything, by using the notebook in this repository. To do so click here: Open In Colab

Local embedding creation overview

Running the program

To run our programs you can use the following commands.

For creating the schema linking dataset:

python create_data.py -i INPUT_FILE -t TABLE_FILE -o OUTPUT_FILE

arguments:
  -i INPUT_FILE, --input_file INPUT_FILE
                        Input WikiSQL .jsonl file
  -t TABLE_FILE, --table_file TABLE_FILE
                        Input WikiSQL .tables.jsonl file
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Output file to save the updated examples with schema links

For creating the table embeddings and alignment matrices:

python main.py -t TABLES_FILE -e EMBEDDINGS_FILE -d EMBEDDINGS_DIMENSION -a {word2vec,fasttext} -o OUTPUT_DIR

optional arguments:
  -t TABLES_FILE, --tables_file TABLES_FILE
                        Input WikiSQL .tables.jsonl file
  -e EMBEDDINGS_FILE, --embeddings_file EMBEDDINGS_FILE
                        Pre-trained embeddings file
  -d EMBEDDINGS_DIMENSION, --embeddings_dimension EMBEDDINGS_DIMENSION
                        Dimension of pre-trained embeddings
  -a {word2vec,fasttext}, --embeddings_algorithm {word2vec,fasttext}
                        Algorithm to use for training local embeddings
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Output directory to save the table embeddings and alignment matrices

Data

WikiSQL Dataset

We use the WikiSQL dataset, which has been copied in our repository from the authors' original repository. To unpack use:

tar xvjf data.tar.bz2

Pre-trained Word embeddings

A set of pre-trained word embedding vectors is needed to run the program. These can be downloaded from the following links:

Acknowledgments

This project is built on two earlier works:

  • EmbDI [1]

    @article{Cappuzzo_2020,
       title={Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks},
       ISBN={9781450367356},
       url={http://dx.doi.org/10.1145/3318464.3389742},
       DOI={10.1145/3318464.3389742},
       journal={Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
       publisher={ACM},
       author={Cappuzzo, Riccardo and Papotti, Paolo and Thirumuruganathan, Saravanan},
       year={2020},
       month={May}
    }
    
  • FastText alignment package [2]

    @InProceedings{joulin2018loss,
        title={Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion},
        author={Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J\'egou, Herv\'e and Grave, Edouard},
        year={2018},
        booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
    }
    

If you use them in your work, please acknowledge the authors' contributions by citing them, using the bibtex citations above.

About

Creating locally aligned embeddings for schema linking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published