Official implementation of "ClavaDDPM:Multi-relational Data Synthesis with Cluster-guided Diffusion Models" https://arxiv.org/pdf/2405.17724
To run a train-sampling pipeline:
python complex_pipeline --config_path configs/movie_lens.json
To use customized datasets:
See complex_data/california
as an example.
- Save all tables as
.csv
files. All id columns should be named<column_name>_id
. - Create a
dataset_meta.json
, in whichtables
should be manually created to specify all foreign key relationships in a multi-table dataset. - Create a
relation_order
indataset_meta.json
, which specifies the topological order of the multi-table dataset. The functiontopological_sort
inpreprocess_utils.py
helps create it. - Create a domain file for each table, id columns excluded.
(CCS, Instacart 05, Berka): https://drive.google.com/file/d/1H_IGsL7JuCJYlq_6TOmdIZ4-q-wFafCL/view?usp=sharing
Some trained model checkpoints are provided for the convenience of development: https://drive.google.com/file/d/1dmSdi-5Wc3uad5QPrIu2ihysAIOIUUS0/view?usp=sharing, this includes model checkpoints, cluster checkpoints, synthetic data, and training log.