-
Notifications
You must be signed in to change notification settings - Fork 37
Classifier guided stable diffusion
In addition to classifier free guidance, it's possible to use a latent-space classifier on top of stable diffusion for more consistent stylization. Here are some examples of classifier guided generations for 3 pretrained classifiers (aesthetic photo, digital art and anime). Training code is provided to train your own classifier.
generated with same seed for each row, SD 1.4 ddim sampler for 100 steps with CFG 7 (click on each image for non-cherry-picked results, classifier guidance scale 0->500 across each row)
Prompt | Stable diffusion | Photo classifier | Art classifier | Anime classifier |
---|---|---|---|---|
anime Elon Musk | ||||
fantasy castle |
The classifier is trained for binary classification, with the stylized images labelled as 1 and LAION images labelled as 0. This technique does not require text-image pairs but only two classes of images. It works a bit better when the prompt is aligned with the classifier instead of against it.
why might you want to use a classifier in addition to CFG?
- less reliance on prompt engineering. You can get stylized images without appending a bunch of artist tags to your prompt
- curate your own aesthetic without text labels. You can train your own classifier on a custom dataset to curate a unique aesthetic. Training a classifier is much easier than finetuning SD and can be done on most consumer GPUs (6gb vram or higher)
- improve composition. SD is trained on the center crop of non-square images. This works for the most part but can lead to poor compositions during inference (you sometimes get images of torsos and feet) We can use a classifier to guide SD away from these generations by putting near-square images in the 1 class and high-aspect-ratio images in the 0 class (this was applied to the art and anime pre-trained classifiers, but not the photo classifier)
- non-text guidance. CFG is a kind of vector arithmetic that uses the difference between text labels. Some aesthetic qualities are poorly represented in the text dataset and would benefit from a pure-image classifier - in addition to the composition problem there are images with split views, large margins and compression artifacts, which are not labelled as such in the text data.
- the classifier is independent from the SD model, so it can be used in future versions of SD as long as the VAE is the same.