Data | Models | Paper | Citation | Getting Started
SynQL
is a method for synthetically generating Text-to-SQL Question-Query Pairs (QQPs). This project is based on the SynQL paper, which we recommend reading for a detailed explanation of the methodology. A high level overview of the methodology is shown below.
Given that manual data generation is both expensive and labor-intensive, alternative methods involving Synthetic Data Generation (SDG) and data augmentation techniques for the text-to-SQL domain (Yu et al., 2021, Wu et al., 2022, Hu et al., 2023) have become increasingly essential. In this paper, we focus on this problem and propose a new method for generating synthetic text-to-SQL data. Contributions of this paper are as follows:
- We present a systematic comparison of text-to-SQL SDG methods and highlight the need for more diverse data.
- To address this, we propose SynQL, a novel synthetic data generation method for Text-to-SQL that leverages in-context learning and introduces ’Topics’—a new type of contextual information enhancing the diversity of generated QQPs.
- In order to assess the quality of our method, we experiment with KaggleDBQA (Lee et al. 2021), an established low-resource benchmark, and demonstrate that models trained on SynQL-KaggleDBQA exceed the performance of those trained on the original data.
- Additionally, to better understand the properties of SynQL data, we generate a synthetic equivalent of the Spider dataset (Yu et al. 2018) and analyze model performance when trained on both original and synthetic data.
- Finally, we open-source both method and synthetic datasets for further research.
We have used the SynQL method to generate the datasets listed below. These datasets are available for download on the Hugging Face dataset hub. A detailed explanation of the data generation process can be found in the SynQL paper. The Spider
dataset can be found here. The KaggleDBQA
dataset can be found here.
Dataset | Description | Link |
---|---|---|
SynQL-Spider-Train | Synthetically generated data based on the Spider training split | Download |
SynQL-KaggleDBQA-Train | Synthetically generated data based on the KaggleDBQA training split | Download |
We have previously used data generated using the SynQL method to train the models listed below. These models are available for download on the Hugging Face model hub. The T5
training framework available in the picard repository was used to train these models. A detailed explanation of the training configuration and process can be found in the SynQL paper.
Model | Dataset | Description | Link |
---|---|---|---|
T5-3B | SynQL-Spider-Train | T5-3B model finetuned on SynQL-Spider-Train | Download |
T5-3B | SynQL-KaggleDBQA-Train | T5-3B model finetuned on SynQL-Spider-Train | Download |
Checkout the runner directory for examples on how to synthesize QQPs using SynQL. Please open an issue if you have any questions or need help getting started.
If you use SynQL, or any of the datasets or models provided here, you can use the following citation:
@inproceedings{baumgartner-2024-synql,
title = "SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing",
author = "Baumgartner, Denver and
Kornuta, Tomasz",
booktitle = "Third Table Representation Learning Workshop (TRL) at the Neural Information Processing Systems Conference (NeurIPS)",
year = "2024",
month = "December",
url = "https://openreview.net/pdf?id=WrexlGBDCH",
}
For any questions or feedback, please open an issue or contact Denver Baumgartner at (denver[at]semiotic[dot]ai).