-
-
Notifications
You must be signed in to change notification settings - Fork 445
SD Pipeline How it Works
Vladimir Mandic edited this page Oct 28, 2024
·
2 revisions
This is probably the best end-to-end semi-technical article:
https://stable-diffusion-art.com/how-stable-diffusion-work/
And a detailed look at diffusion process: https://towardsdatascience.com/understanding-diffusion-probabilistic-models-dpms-1940329d6048
But this is a short look at the pipeline:
- Encoder / Conditioning
Text (via tokenizer) or image (via vision model) to semantic map
(e.g CLiP text encoder) - Sampler
Generate noise which is starting point to map to content
(e.g. k_lms) - Diffuser
Create vector content based on resolved noise + semantic map
(e.g. actual stable diffusion checkpoint) - Autoencoder
Maps between latent and pixel space (actually creates images from vectors)
(e.g. typically some image-database trained GAN) - Denoising
Get meaningful images from pixel signatures
Basically, blends what autoencoder inserted using information from diffuser
(e.g. U-NET) - Loop and repeat From step#3 with cross-attention to blend results
- Run additional models as needed
- Upscale (e.g. ESRGAN)
- Resore Face (e.g. GFPGAN or CodeFormer)