Implementation of Imagen, Google's Text-to-Image Neural Network that beats DALL-E2, in Pytorch. It is the new SOTA for text-to-image synthesis.
Architecturally, it is actually much simpler than DALL-E2. It consists of a cascading DDPM conditioned on text embeddings from a large pretrained T5 model (attention network). It also contains dynamic clipping for improved classifier free guidance, noise level conditioning, and a memory efficient unet design.
It appears neither CLIP nor prior network is needed after all. And so research continues.
AI Coffee Break with Letitia | Assembly AI
Please join if you are interested in helping out with the replication with the LAION community
-
StabilityAI for the generous sponsorship, as well as my other sponsors out there
-
🤗 Huggingface for their amazing transformers library. The text encoder portion is pretty much taken care of because of them
-
Sylvain and Zachary for the Accelerate library, which this repository uses for distributed training
-
Jorge Gomes for helping out with the T5 loading code and advice on the correct T5 version
-
Katherine Crowson, for her beautiful code, which helped me understand the continuous time version of gaussian diffusion
-
Marunine and Netruk44, for reviewing code, sharing experimental results, and help with debugging
-
Marunine for providing a potential solution for a color shifting issue in the memory efficient u-nets. Thanks to Jacob for sharing experimental comparisons between the base and memory-efficient unets
-
Marunine for finding numerous bugs, resolving an issue with resize right, and for sharing his experimental configurations and results
-
MalumaDev for proposing the use of pixel shuffle upsampler to fix checkboard artifacts
-
Valentin for pointing out insufficient skip connections in the unet, as well as the specific method of attention conditioning in the base-unet in the appendix
-
BIGJUN for catching a big bug with continuous time gaussian diffusion noise level conditioning at inference time
-
You? It isn't done yet, chip in if you are a researcher or skilled ML engineer
$ pip install imagen-pytorch
import torch
from imagen_pytorch import Unet, Imagen
# unet for imagen
unet1 = Unet(
dim = 32,
cond_dim = 512,
dim_mults = (1, 2, 4, 8),
num_resnet_blocks = 3,
layer_attns = (False, True, True, True),
layer_cross_attns = (False, True, True, True)
)
unet2 = Unet(
dim = 32,
cond_dim = 512,
dim_mults = (1, 2, 4, 8),
num_resnet_blocks = (2, 4, 8, 8),
layer_attns = (False, False, False, True),
layer_cross_attns = (False, False, False, True)
)
# imagen, which contains the unets above (base unet and super resoluting ones)
imagen = Imagen(
unets = (unet1, unet2),
image_sizes = (64, 256),
timesteps = 1000,
cond_drop_prob = 0.1
).cuda()
# mock images (get a lot of this) and text encodings from large T5
text_embeds = torch.randn(4, 256, 768).cuda()
images = torch.randn(4, 3, 256, 256).cuda()
# feed images into imagen, training each unet in the cascade
for i in (1, 2):
loss = imagen(images, text_embeds = text_embeds, unet_number = i)
loss.backward()
# do the above for many many many many steps
# now you can sample an image based on the text embeddings from the cascading ddpm
images = imagen.sample(texts = [
'a whale breaching from afar',
'young girl blowing out candles on her birthday cake',
'fireworks with blue and green sparkles'
], cond_scale = 3.)
images.shape # (3, 3, 256, 256)
For simpler training, you can directly supply text strings instead of precomputing text encodings. (Although for scaling purposes, you will definitely want to precompute the textual embeddings + mask)
The number of textual captions must match the batch size of the images if you go this route.
# mock images and text (get a lot of this)
texts = [
'a child screaming at finding a worm within a half-eaten apple',
'lizard running across the desert on two feet',
'waking up to a psychedelic landscape',
'seashells sparkling in the shallow waters'
]
images = torch.randn(4, 3, 256, 256).cuda()
# feed images into imagen, training each unet in the cascade
for i in (1, 2):
loss = imagen(images, texts = texts, unet_number = i)
loss.backward()
With the ImagenTrainer
wrapper class, the exponential moving averages for all of the U-nets in the cascading DDPM will be automatically taken care of when calling update
import torch
from imagen_pytorch import Unet, Imagen, ImagenTrainer
# unet for imagen
unet1 = Unet(
dim = 32,
cond_dim = 512,
dim_mults = (1, 2, 4, 8),
num_resnet_blocks = 3,
layer_attns = (False, True, True, True),
)
unet2 = Unet(
dim = 32,
cond_dim = 512,
dim_mults = (1, 2, 4, 8),
num_resnet_blocks = (2, 4, 8, 8),
layer_attns = (False, False, False, True),
layer_cross_attns = (False, False, False, True)
)
# imagen, which contains the unets above (base unet and super resoluting ones)
imagen = Imagen(
unets = (unet1, unet2),
text_encoder_name = 't5-large',
image_sizes = (64, 256),
timesteps = 1000,
cond_drop_prob = 0.1
).cuda()
# wrap imagen with the trainer class
trainer = ImagenTrainer(imagen)
# mock images (get a lot of this) and text encodings from large T5
text_embeds = torch.randn(64, 256, 1024).cuda()
images = torch.randn(64, 3, 256, 256).cuda()
# feed images into imagen, training each unet in the cascade
loss = trainer(
images,
text_embeds = text_embeds,
unet_number = 1, # training on unet number 1 in this example, but you will have to also save checkpoints and then reload and continue training on unet number 2
max_batch_size = 4 # auto divide the batch of 64 up into batch size of 4 and accumulate gradients, so it all fits in memory
)
trainer.update(unet_number = 1)
# do the above for many many many many steps
# now you can sample an image based on the text embeddings from the cascading ddpm
images = trainer.sample(texts = [
'a puppy looking anxiously at a giant donut on the table',
'the milky way galaxy in the style of monet'
], cond_scale = 3.)
images.shape # (2, 3, 256, 256)
You can also train Imagen without text (unconditional image generation) as follows
import torch
from imagen_pytorch import Unet, Imagen, SRUnet256, ImagenTrainer
# unets for unconditional imagen
unet1 = Unet(
dim = 32,
dim_mults = (1, 2, 4),
num_resnet_blocks = 3,
layer_attns = (False, True, True),
layer_cross_attns = False,
use_linear_attn = True
)
unet2 = SRUnet256(
dim = 32,
dim_mults = (1, 2, 4),
num_resnet_blocks = (2, 4, 8),
layer_attns = (False, False, True),
layer_cross_attns = False
)
# imagen, which contains the unets above (base unet and super resoluting ones)
imagen = Imagen(
condition_on_text = False, # this must be set to False for unconditional Imagen
unets = (unet1, unet2),
image_sizes = (64, 128),
timesteps = 1000
)
trainer = ImagenTrainer(imagen).cuda()
# now get a ton of images and feed it through the Imagen trainer
training_images = torch.randn(4, 3, 256, 256).cuda()
# train each unet separately
# in this example, only training on unet number 1
loss = trainer(training_images, unet_number = 1)
trainer.update(unet_number = 1)
# do the above for many many many many steps
# now you can sample images unconditionally from the cascading unet(s)
images = trainer.sample(batch_size = 16) # (16, 3, 128, 128)
At any time you can save and load the trainer and all associated states with the save
and load
methods. It is recommended you use these methods instead of manually saving with a state_dict
call, as there are some device memory management being done underneath the hood within the trainer.
ex.
trainer.save('./path/to/checkpoint.pt')
trainer.load('./path/to/checkpoint.pt')
trainer.steps # (2,) step number for each of the unets, in this case 2
You can also rely on the ImagenTrainer
to automatically train off DataLoader
instances. You simply have to craft your DataLoader
to return either images
(for unconditional case), or of ('images', 'text_embeds')
for text-guided generation.
ex. unconditional training
from imagen_pytorch import Unet, Imagen, ImagenTrainer
from imagen_pytorch.data import get_images_dataloader
# unets for unconditional imagen
unet = Unet(
dim = 32,
dim_mults = (1, 2, 4, 8),
num_resnet_blocks = 1,
layer_attns = (False, False, False, True),
layer_cross_attns = False
)
# imagen, which contains the unet above
imagen = Imagen(
condition_on_text = False, # this must be set to False for unconditional Imagen
unets = unet,
image_sizes = 128,
timesteps = 1000
)
trainer = ImagenTrainer(imagen).cuda()
# instantiate your dataloader, which returns the necessary inputs to the DDPM as tuple in the order of images, text embeddings, then text masks. in this case, only images is returned as it is unconditional training
train_dl = get_images_dataloader('/path/to/training/images', batch_size = 16, image_size = 128)
trainer.add_train_dataloader(train_dl)
# working training loop
for i in range(200000):
loss = trainer.train_step(unet_number = 1, max_batch_size = 4)
print(f'loss: {loss}')
if not (i % 100) and trainer.is_main: # is_main makes sure this can run in distributed
images = trainer.sample(batch_size = 1, return_pil_images = True) # returns List[Image]
images[0].save(f'./sample-{i // 100}.png')
Thanks to 🤗 Accelerate, you can do multi GPU training easily with two steps.
First you need to invoke accelerate config
in the same directory as your training script (say it is named train.py
)
$ accelerate config
Next, instead of calling python train.py
as you would for single GPU, you would use the accelerate CLI as so
$ accelerate launch train.py
That's it!
Tero Karras of StyleGAN fame has written a new paper with results that have been corroborated by a number of independent researchers as well as on my own machine. I have decided to create a version of Imagen
, the ElucidatedImagen
, so that one can use the new elucidated DDPM for text-guided cascading generation.
Simply import ElucidatedImagen
, and then instantiate the instance as you did before. The hyperparameters are different than the usual ones for discrete and continuous time gaussian diffusion, and can be individualized for each unet in the cascade.
Ex.
from imagen_pytorch import ElucidatedImagen
# instantiate your unets ...
imagen = ElucidatedImagen(
unets = (unet1, unet2),
image_sizes = (64, 128),
cond_drop_prob = 0.1,
num_sample_steps = (64, 32), # number of sample steps - 64 for base unet, 32 for upsampler (just an example, have no clue what the optimal values are)
sigma_min = 0.002, # min noise level
sigma_max = (80, 160), # max noise level, @crowsonkb recommends double the max noise level for upsampler
sigma_data = 0.5, # standard deviation of data distribution
rho = 7, # controls the sampling schedule
P_mean = -1.2, # mean of log-normal distribution from which noise is drawn for training
P_std = 1.2, # standard deviation of log-normal distribution from which noise is drawn for training
S_churn = 80, # parameters for stochastic sampling - depends on dataset, Table 5 in apper
S_tmin = 0.05,
S_tmax = 50,
S_noise = 1.003,
).cuda()
# rest is the same as above
- Why are my generated images not aligning well with the text?
Imagen uses an algorithm called Classifier Free Guidance. When sampling, you apply a scale to the conditioning (text in this case) of greater than 1.0
.
Researcher Netruk44 have reported 5-10
to be optimal, but anything greater than 10
to break.
trainer.sample(texts = [
'a cloud in the shape of a roman gladiator'
], cond_scale = 5.) # <-- cond_scale is the conditioning scale, needs to be greater than 1.0 to be better than average
- Are there any pretrained models yet?
Not at the moment but one will likely be trained and open sourced within the year, if not sooner. If you would like to participate, you can join the community of artificial neural network trainers at Laion (discord link is in the Readme above) and start collaborating.
- use huggingface transformers for T5-small text embeddings
- add dynamic thresholding
- add dynamic thresholding DALLE2 and video-diffusion repository as well
- allow for one to set T5-large (and perhaps small factory method to take in any huggingface transformer)
- add the lowres noise level with the pseudocode in appendix, and figure out what is this sweep they do at inference time
- port over some training code from DALLE2
- need to be able to use a different noise schedule per unet (cosine was used for base, but linear for SR)
- just make one master-configurable unet
- complete resnet block (biggan inspired? but with groupnorm) - complete self attention
- complete conditioning embedding block (and make it completely configurable, whether it be attention, film etc)
- consider using perceiver-resampler from https://github.com/lucidrains/flamingo-pytorch in place of attention pooling
- add attention pooling option, in addition to cross attention and film
- add optional cosine decay schedule with warmup, for each unet, to trainer
- switch to continuous timesteps instead of discretized, as it seems that is what they used for all stages - first figure out the linear noise schedule case from the variational ddpm paper https://openreview.net/forum?id=2LdBqxc1Yv
- figure out log(snr) for alpha cosine noise schedule.
- suppress the transformers warning because only T5encoder is used
- allow setting for using linear attention on layers where full attention cannot be used
- force unets in continuous time case to use non-fouriered conditions (just pass the log(snr) through an MLP with optional layernorms), as that is what i have working locally
- removed learned variance
- add p2 loss weighting for continuous time
- make sure cascading ddpm can be trained without text condition, and make sure both continuous and discrete time gaussian diffusion works
- use primer's depthwise convs on the qkv projections in linear attention (or use token shifting before projections) - also use new dropout proposed by bayesformer, as it seems to work well with linear attention
- explore skip layer excitation in unet decoder
- accelerate integration
- knock out any issues that arised from accelerate
- preencoding of text to memmapped embeddings
- build out CLI tool for training, resuming training, and one-line generation of image
- extend to video generation, using axial time attention as in Ho's video ddpm paper + https://github.com/lucidrains/flexible-diffusion-modeling-videos-pytorch for up to 25 minute video
- add inpainting ability using resampler from repaint paper https://arxiv.org/abs/2201.09865
- consider unet with attention mediating skip connections https://arxiv.org/abs/2109.04335
- if memory efficient unet is defective, consider https://arxiv.org/abs/1906.06148
- be able to create dataloader iterators based on the old epoch style, also configure shuffling etc
- be able to also pass in arguments (instead of requiring forward to be all keyword args on model)
@inproceedings{Saharia2022PhotorealisticTD,
title = {Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding},
author = {Chitwan Saharia and William Chan and Saurabh Saxena and Lala Li and Jay Whang and Emily L. Denton and Seyed Kamyar Seyed Ghasemipour and Burcu Karagol Ayan and Seyedeh Sara Mahdavi and Raphael Gontijo Lopes and Tim Salimans and Jonathan Ho and David Fleet and Mohammad Norouzi},
year = {2022}
}
@article{Alayrac2022Flamingo,
title = {Flamingo: a Visual Language Model for Few-Shot Learning},
author = {Jean-Baptiste Alayrac et al},
year = {2022}
}
@article{Choi2022PerceptionPT,
title = {Perception Prioritized Training of Diffusion Models},
author = {Jooyoung Choi and Jungbeom Lee and Chaehun Shin and Sungwon Kim and Hyunwoo J. Kim and Sung-Hoon Yoon},
journal = {ArXiv},
year = {2022},
volume = {abs/2204.00227}
}
@inproceedings{Sankararaman2022BayesFormerTW,
title = {BayesFormer: Transformer with Uncertainty Estimation},
author = {Karthik Abinav Sankararaman and Sinong Wang and Han Fang},
year = {2022}
}
@article{So2021PrimerSF,
title = {Primer: Searching for Efficient Transformers for Language Modeling},
author = {David R. So and Wojciech Ma'nke and Hanxiao Liu and Zihang Dai and Noam M. Shazeer and Quoc V. Le},
journal = {ArXiv},
year = {2021},
volume = {abs/2109.08668}
}
@misc{cao2020global,
title = {Global Context Networks},
author = {Yue Cao and Jiarui Xu and Stephen Lin and Fangyun Wei and Han Hu},
year = {2020},
eprint = {2012.13375},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
@article{Karras2022ElucidatingTD,
title = {Elucidating the Design Space of Diffusion-Based Generative Models},
author = {Tero Karras and Miika Aittala and Timo Aila and Samuli Laine},
journal = {ArXiv},
year = {2022},
volume = {abs/2206.00364}
}
@inproceedings{NEURIPS2020_4c5bcfec,
author = {Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
booktitle = {Advances in Neural Information Processing Systems},
editor = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},
pages = {6840--6851},
publisher = {Curran Associates, Inc.},
title = {Denoising Diffusion Probabilistic Models},
url = {https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf},
volume = {33},
year = {2020}
}