Code for the paper On Memorization in Diffusion Models.
-
We run all our experiments on A100 GPUs
-
Python 3.8 and PyTorch 1.13 and CUDA 11.8.
-
Run the following commands to install python libraries:
pip install -r requirements.txt
We run our experiments on the CIFAR-10 and ImageNet datasets.
CIFAR-10 can be downloaded and saved to datasets/cifar10
by the following commands:
mkdir datasets
mkdir datasets/cifar10
wget -P datasets/cifar10 https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Prepare the full training dataset of CIFAR-10 with
python dataset_tool.py --source=datasets/cifar10/cifar-10-python.tar.gz --dest=datasets/cifar10/cifar10-train.zip
To download ImageNet, please refer to ImageNet Object Localization Challenge and save it to datasets/imagenet
.
Firstly, we compare the generated images by the theoretical optimum and state-of-the-art diffusion model (EDM). The experiments are run on a single A100 GPU.
We include the implementations of the theoretical optimum in training/optim.py
. We use following command to generate images by this theoretical optimum:
torchrun --standalone --nproc_per_node=1 generate_optim.py --outdir=fid-tmp-optim --seeds=0-49999 --subdirs --network=datasets/cifar10/cifar10-train.zip
We use following command to generate images by EDM:
torchrun --standalone --nproc_per_node=1 generate.py --outdir=fid-tmp-edm --seeds=0-49999 --subdirs --network=https://nvlabs-fi-cdn.nvidia.com/edm/pretrained/edm-cifar10-32x32-uncond-vp.pkl
The basic procedure to evaluate the contribution of a factor on memorization in diffusion models is as follows:
Step I: Sample a training dataset with different sizes dataset_utils
, which will be introduced later. The sampled dataset will be saved to $data_path
.
Step II: Train a diffusion model on the training data.
All of our experiments related to model training are run on 8 A100 GPUs through DDP with multi-node training. The basic command is
torchrun --nproc_per_node 1 \
--nnodes $WORLD_SIZE \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
train.py --outdir=$savedir --argument=$argument
Alternatively, you can use the following command to support DDP with single-node training
torchrun --standalone --nproc_per_node=8 train.py --outdir=$savedir --argument=$argument
We suggest to provide a unique $savedir
for each experiment. $argument
includes all hyper-parameters.
Step III: Evaluate the snapshots of this trained diffusion model and report the highest memorization ratio.
torchrun --standalone --nproc_per_node=$num_gpu mem_ratio.py --expdir=$outdir --knn-ref=$data_path --log=$outdir/mem_traj.log --seeds=0-9999 --subdirs --batch=512
$outdir
refers to the folder including all model snapshots.
Step IV: Gradually increase the training dataset size
Step V: Modify the value of the evaluated factor, and then repeat Step I to Step IV to observe the effect of this factor to memorization.
We provide all the scripts to reproduce our experimental results in the paper.
- Data distribution
$P$ : refer toscripts/data_distribution.md
. Here we highlight that data dimension has significant contributions to memorization in diffusion models.
- Model configuration
$\mathcal{M}$ : refer toscripts/model_config.md
. Here we highlight that skip connections on higher resolutions play important roles on memorization.
-
Training procedure
$\mathcal{T}$ : refer toscripts/train_procedure.md
. -
Unconditional v.s. conditional generation: refer to
scripts/conditional.md
. Here we highlight that random labels as conditions can trigger the memorization of diffusion models.
Finally, we highlight that conditional EDM with unique labels as conditions can largely memorize training data with
If you find the code useful for your research, please consider citing our paper.
@article{gu2023memorization,
title={On Memorization in Diffusion Models},
author={Xiangming Gu and Chao Du and Tianyu Pang and Chongxuan Li and Min Lin and Ye Wang},
journal={arXiv preprint arXiv:2310.02664},
year={2023}
}
Our codes are modified based on the official implementation of EDM.