This paper proposes invertible Consistency Distillation, enabling
- highly efficient and accurate text-guided image editing
- diverse and high-quality image generation
- Installation
- Easy-to-run examples (iCD-SD1.5)
- Easy-to-run examples (iCD-SDXL)
- In-depth generation and editing (iCD-SDXL and iCD-SD1.5)
- iCD training example (iCD-SDXL and iCD-SD1.5)
- Citation
# Clone a repo
git clone https://github.com/yandex-research/invertible-cd
# Create an environment and install packages
conda create -n icd python=3.10 -y
conda activate icd
pip3 install -r requirements/req.txt
We provide the following checkpoints:
- Guidance distilled diffusion models
These models saved as .pt
files.
- Invertible Consistency Distillation (forward and reverse CD) on top of the guidance distilled models
Model | Steps | Time steps |
---|---|---|
iCD-SD1.5, 0.5GB | 4 | Reverse: [259, 519, 779, 999]; Forward: [19, 259, 519, 779] |
iCD-SD1.5, 0.5GB | 4 | Reverse: [249, 499, 699, 999]; Forward: [19, 249, 499, 699] |
iCD-SD1.5, 0.5GB | 3 | Reverse: [339, 699, 999]; Forward: [19, 339, 699] |
iCD-SDXL, 1.4GB | 4 | Reverse: [259, 519, 779, 999]; Forward: [19, 259, 519, 779] |
iCD-SDXL, 1.4GB | 4 | Reverse: [249, 499, 699, 999]; Forward: [19, 249, 499, 699] |
iCD-SDXL, 1.4GB | 3 | Reverse: [339, 699, 999]; Forward: [19, 339, 699] |
These models saved as .safetensors
files.
Step 0. Download the models and put them to the checkpoints folder
For this example, we consider iCD-SD1.5 using reverse: [259, 519, 779, 999], forward: [19, 259, 519, 779] time steps.
Step 1. Load the models
from utils.loading import load_models
from diffusers import DDPMScheduler
root = 'checkpoints'
ldm_stable, reverse_cons_model, forward_cons_model = load_models(
model_id="runwayml/stable-diffusion-v1-5",
device='cuda',
forward_checkpoint=f'{root}/iCD-SD15-forward_19_259_519_779.safetensors',
reverse_checkpoint=f'{root}/iCD-SD15-reverse_259_519_779_999.safetensors',
r=64,
w_embed_dim=512,
teacher_checkpoint=f'{root}/sd15_cfg_distill.pt',
)
tokenizer = ldm_stable.tokenizer
noise_scheduler = DDPMScheduler.from_pretrained(
"runwayml/stable-diffusion-v1-5", subfolder="scheduler", )
Step 2. Specify the configuration according to the downloaded model
from utils import p2p, generation
NUM_REVERSE_CONS_STEPS = 4
REVERSE_TIMESTEPS = [259, 519, 779, 999]
NUM_FORWARD_CONS_STEPS = 4
FORWARD_TIMESTEPS = [19, 259, 519, 779]
NUM_DDIM_STEPS = 50
solver = generation.Generator(
model=ldm_stable,
noise_scheduler=noise_scheduler,
n_steps=NUM_DDIM_STEPS,
forward_cons_model=forward_cons_model,
forward_timesteps=FORWARD_TIMESTEPS,
reverse_cons_model=reverse_cons_model,
reverse_timesteps=REVERSE_TIMESTEPS,
num_endpoints=NUM_REVERSE_CONS_STEPS,
num_forward_endpoints=NUM_FORWARD_CONS_STEPS,
max_forward_timestep_index=49,
start_timestep=19)
p2p.NUM_DDIM_STEPS = NUM_DDIM_STEPS
p2p.tokenizer = tokenizer
p2p.device = 'cuda'
Step 3. Generate
import torch
prompt = ['a cute owl with a graduation cap']
controller = p2p.AttentionStore()
generator = torch.Generator().manual_seed(150)
tau = 1.0
image, _ = generation.runner(
# Playing params
guidance_scale=19.0,
tau1=tau, # Dynamic guidance if tau < 1.0
tau2=tau,
# Fixed params
is_cons_forward=True,
model=reverse_cons_model,
w_embed_dim=512,
solver=solver,
prompt=prompt,
controller=controller,
generator=generator,
latent=None,
return_type='image')
# The left image is inversion, the right - editing.
generation.to_pil_images(image).save('test_generation_iCD-SD1.5.jpg')
generation.view_images(image)
Step 3. Load and invert real image
from utils import inversion
image_path = f"assets/bird.jpg"
prompt = ["a photo of a bird standing on a branch"]
(image_gt, image_rec), ddim_latent, uncond_embeddings = inversion.invert(
# Playing params
image_path=image_path,
prompt=prompt,
# Fixed params
is_cons_inversion=True,
w_embed_dim=512,
inv_guidance_scale=0.0,
stop_step=50,
solver=solver,
seed=10500)
Step 4. Edit the image
p2p.NUM_DDIM_STEPS = 4
p2p.tokenizer = tokenizer
p2p.device = 'cuda'
prompts = ["a photo of a bird standing on a branch",
"a photo of a lego bird standing on a branch"
]
# Playing params
cross_replace_steps = {'default_': 0.2, }
self_replace_steps = 0.2
blend_word = ((('bird',), ('lego',)))
eq_params = {"words": ("lego",), "values": (3.,)}
controller = p2p.make_controller(prompts,
False, # (is_replacement) True if only one word is changed
cross_replace_steps,
self_replace_steps,
blend_word,
eq_params)
tau = 0.8
image, _ = generation.runner(
# Playing params
guidance_scale=19.0,
tau1=tau, # Dynamic guidance if tau < 1.0
tau2=tau,
# Fixed params
model=reverse_cons_model,
is_cons_forward=True,
w_embed_dim=512,
solver=solver,
prompt=prompts,
controller=controller,
num_inference_steps=50,
generator=None,
latent=ddim_latent,
uncond_embeddings=uncond_embeddings,
return_type='image')
generation.to_pil_images(image).save('test_editing_iCD-SD1.5.jpg')
generation.view_images(image)
Note:
Please note that zero-shot editing is highly sensitive to hyperparameters. Thus, we recommend tuning: cross_replace_steps (from 0.0 to 1.0), self_replace_steps (from 0.0 to 1), tau (0.7 or 0.8 seems to work best), guidance scale (up to 19), and amplify factor (eq_params).
You can also consider the similar easy-to-run examples for the SDXL model or move on to in-depth examples
@article{starodubcev2024invertible,
title={Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps},
author={Starodubcev, Nikita and Khoroshikh, Mikhail and Babenko, Artem and Baranchuk, Dmitry},
journal={arXiv preprint arXiv:2406.14539},
year={2024}
}