NeurIPS 2024

Official implementation of "ClavaDDPM:Multi-relational Data Synthesis with Cluster-guided Diffusion Models" https://arxiv.org/pdf/2405.17724

To run a train-sampling pipeline:

python complex_pipeline --config_path configs/movie_lens.json

To use customized datasets: See complex_data/california as an example.

Save all tables as .csv files. All id columns should be named <column_name>_id.
Create a dataset_meta.json, in which tables should be manually created to specify all foreign key relationships in a multi-table dataset.
Create a relation_order in dataset_meta.json, which specifies the topological order of the multi-table dataset. The function topological_sort in preprocess_utils.py helps create it.
Create a domain file for each table, id columns excluded.

Other datasets

Some trained model checkpoints are provided for the convenience of development: https://drive.google.com/file/d/1dmSdi-5Wc3uad5QPrIu2ihysAIOIUUS0/view?usp=sharing, this includes model checkpoints, cluster checkpoints, synthetic data, and training log.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
complex_data		complex_data
configs		configs
eval		eval
lib		lib
scripts		scripts
smote		smote
tab_ddpm		tab_ddpm
.gitignore		.gitignore
README.md		README.md
baseline_denorm.py		baseline_denorm.py
baseline_singleT.py		baseline_singleT.py
baseline_utils.py		baseline_utils.py
complex_pipeline.py		complex_pipeline.py
gen_multi_report.py		gen_multi_report.py
gen_single_report.py		gen_single_report.py
pipeline_modules.py		pipeline_modules.py
pipeline_utils.py		pipeline_utils.py
preprocess_utils.py		preprocess_utils.py
report_utils.py		report_utils.py
requirements.txt		requirements.txt