SynQL

Data | Models | Paper | Citation | Getting Started

SynQL is a method for synthetically generating Text-to-SQL Question-Query Pairs (QQPs). This project is based on the SynQL paper, which we recommend reading for a detailed explanation of the methodology. A high level overview of the methodology is shown below.

Given that manual data generation is both expensive and labor-intensive, alternative methods involving Synthetic Data Generation (SDG) and data augmentation techniques for the text-to-SQL domain (Yu et al., 2021, Wu et al., 2022, Hu et al., 2023) have become increasingly essential. In this paper, we focus on this problem and propose a new method for generating synthetic text-to-SQL data. Contributions of this paper are as follows:

We present a systematic comparison of text-to-SQL SDG methods and highlight the need for more diverse data.
To address this, we propose SynQL, a novel synthetic data generation method for Text-to-SQL that leverages in-context learning and introduces ’Topics’—a new type of contextual information enhancing the diversity of generated QQPs.
In order to assess the quality of our method, we experiment with KaggleDBQA (Lee et al. 2021), an established low-resource benchmark, and demonstrate that models trained on SynQL-KaggleDBQA exceed the performance of those trained on the original data.
Additionally, to better understand the properties of SynQL data, we generate a synthetic equivalent of the Spider dataset (Yu et al. 2018) and analyze model performance when trained on both original and synthetic data.
Finally, we open-source both method and synthetic datasets for further research.

Data

We have used the SynQL method to generate the datasets listed below. These datasets are available for download on the Hugging Face dataset hub. A detailed explanation of the data generation process can be found in the SynQL paper. The Spider dataset can be found here. The KaggleDBQA dataset can be found here.

Dataset	Description	Link
SynQL-Spider-Train	Synthetically generated data based on the Spider training split	Download
SynQL-KaggleDBQA-Train	Synthetically generated data based on the KaggleDBQA training split	Download

Models

We have previously used data generated using the SynQL method to train the models listed below. These models are available for download on the Hugging Face model hub. The T5 training framework available in the picard repository was used to train these models. A detailed explanation of the training configuration and process can be found in the SynQL paper.

Model	Dataset	Description	Link
T5-3B	SynQL-Spider-Train	T5-3B model finetuned on SynQL-Spider-Train	Download
T5-3B	SynQL-KaggleDBQA-Train	T5-3B model finetuned on SynQL-Spider-Train	Download

Getting Started

Checkout the runner directory for examples on how to synthesize QQPs using SynQL. Please open an issue if you have any questions or need help getting started.

Citation

If you use SynQL, or any of the datasets or models provided here, you can use the following citation:

@inproceedings{baumgartner-2024-synql,
    title = "SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing",
    author = "Baumgartner, Denver  and
      Kornuta, Tomasz",
    booktitle = "Third Table Representation Learning Workshop (TRL) at the Neural Information Processing Systems Conference (NeurIPS)",
    year = "2024",
    month = "December",
    url = "https://openreview.net/pdf?id=WrexlGBDCH",
}

For any questions or feedback, please open an issue or contact Denver Baumgartner at (denver[at]semiotic[dot]ai).

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
runner		runner
synql		synql
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynQL

Data

Models

Getting Started

Citation

About

Releases

Packages

Languages

License

semiotic-ai/SynQL

Folders and files

Latest commit

History

Repository files navigation

SynQL

Data

Models

Getting Started

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages