Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement multi-cond guidance for Composable Diffusion #1695

Closed
wants to merge 1 commit into from

Conversation

raefu
Copy link
Contributor

@raefu raefu commented Oct 5, 2022

BUGS:

  • batch_size > 1 generates incorrect results
  • DDIM and PLMS samplers crash

photo of a cute (dog AND cat PLUS kitten), 4k, HD
download

As a bonus, add prompt weighting too.

Based on https://arxiv.org/pdf/2206.01714.pdf /
https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/

In vanilla Stable Diffusion, prompt generations is guided based on two prompts: towards the positive prompt, and away from the negative prompt. This change allows you to use an arbitrary number of prompts for guidance, for some interesting composition options. See the website above for more concrete examples.

Multi-cond guidance slows generation because it requires evaluating guidance for additional prompts for each step.

New syntax keywords: AND NOT PLUS -- since CLIP is case insensitive, just have them lowercase to use them in a prompt.

SYNTAX GUIDE:

Watch the console for debugging output of how each prompt is evaluated.

New case-sensitive keywords: AND NOT PLUS. Weights are :NUMBER.

"red AND white" guides with a "red" prompt and a "white" prompt.

"red:2 AND white" guides with a "red" prompt 2x stronger than a "white" prompt

"a photo of a (cat AND dog)" is equivalent to "a photo of a cat AND a photo of a dog" and generate an animal hybrid using the two prompts.

"a person NOT human" guides towards "a person" with "human" as a prompt with -0.5x weight.

"cat PLUS dog" guides with a prompt made by adding the CLIP-embeddings from "cat" to "dog" and dividing by 2.

You can combine PLUS and AND. "apple PLUS pear AND banana PLUS eggplant" makes an image containing apple/pear hybrids and banana/eggplant hybrids.

Multiple paren groups are supported and combine groups sensibly. "photo of (dog AND cat), cute, 4k, playing with (ball AND yarn)" => "photo of dog, cute, 4k, playing with ball AND photo of cat, cute, 4k, playing with yarn".

As a bonus, add prompt weighting too.

Based on https://arxiv.org/pdf/2206.01714.pdf /
https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/

In vanilla Stable Diffusion, prompt generations is guided based on two
prompts: towards the positive prompt, and away from the negative prompt.
This change allows you to use an arbitrary number of prompts for
guidance, for some interesting composition options. See the website
above for more concrete examples.

Multi-cond guidance slows generation because it requires evaluating
guidance for additional prompts for each step.

New syntax keywords: AND NOT PLUS -- since CLIP is case insensitive,
just have them lowercase to use them in a prompt.

SYNTAX GUIDE:

Watch the console for debugging output of how each prompt is evaluated.

New case-sensitive keywords: AND NOT PLUS. Weights are :NUMBER.

"red AND white" guides with a "red" prompt and a "white" prompt.

"red:2 AND white" guides with a "red" prompt 2x stronger than a "white" prompt

"a photo of a (cat AND dog)" is equivalent to "a photo of a cat AND a
photo of a dog" and generate an animal hybrid using the two prompts.

"a person NOT human" guides towards "a person" with "human" as a prompt
with -0.5x weight.

"cat PLUS dog" guides with a prompt made by adding the CLIP-embeddings
from "cat" to "dog" and dividing by 2.

You can combine PLUS and AND. "apple PLUS pear AND banana PLUS eggplant"
makes an image containing apple/pear hybrids and banana/eggplant
hybrids.

Multiple paren groups are supported and combine groups sensibly.
"photo of (dog AND cat), cute, 4k, playing with (ball AND yarn)"
=> "photo of dog, cute, 4k, playing with ball AND photo of cat, cute,
4k, playing with yarn".
@AUTOMATIC1111
Copy link
Owner

The choice of using parens when you don't actually support nesting them seems wrong. It also clashes with attention. The sensible composition does not feel sensible to me. Sensible for "photo of (dog AND cat), cute, 4k, playing with (ball AND yarn)" would be to make four conds there with all combinations.

NOT seems redundant when you have weights.

PLUS is just unrelated and I still don't want it.

More than anything, the amount of added code is very very unappealing.

The page you link has just AND, without any parens, and that would be a good start. I feel that if we just support AND plus weights, the amount of code would become multiple times smaller and it would a lot simpler.

I don't feel right telling you to throw this away after you stent time working on it, but I don't want this complexity added to the repo. The contributing page does say that you should consult with me before PRing big changes. I have plans to add this kind of compositing myself, so if you don't want to rework the code to conform to those requirements, the feature will make it in anyway at some point.

@differentprogramming
Copy link

differentprogramming commented Oct 5, 2022

The page you link has just AND, without any parens, and that would be a good start.

I think some kind of grouping is needed
consider: man with (red shirt AND green hat) painted by van gogh
compared: with man with red shirt AND green hat painted by van gogh

In the second case only the green hat is painted by van gogh not the man or the red shirt. You need the grouping because styles and the like apply to the whole picture.

Your way, people would have to type out every combination completely:
man (red shirt AND green pants AND tweed vest) 4k photograph
would have to be:
man with red shirt 4k photograph AND man with green pants 4k photograph AND man with tweed vest 4k photograph

I think he is letting parens that don't have AND or PLUS in them through so that they can be for attention. One possible change would be to pick a new grouping pair like <> instead of ()

Though now that I typed it, the idea of making AND top level and requiring the whole prompt to be duplicated does have the advantage of simplicity.

@moorehousew
Copy link
Contributor

@AUTOMATIC1111 Some sort of grouping would be sorely wanted, as another user pointed out. Some sort of standard syntax would be nice so that additional features can be freely added without clashing with old ones. S-expressions are trivial to parse, so if you can devise a prompt DSL with S-expressions it'll cost little in terms of complexity and maintainability.

Just a thought.

@differentprogramming
Copy link

There are people on reddit claiming that AND has just been added.
Has part of this pull been implemented already or are they wrong?

@ArcticEcho
Copy link

Added just a few hours ago in c26732?

@differentprogramming
Copy link

Does it have any grouping or delimiting or is it top level only?

@differentprogramming
Copy link

A problem with the current version is that there is no way to limit a negative weighted item to a single AND branch.

@astrobleem
Copy link

DDIM sampler crashes by putting AND NOT in the prompt.

nne998 pushed a commit to fjteam/stable-diffusion-webui that referenced this pull request Sep 26, 2023
* 🐛 Allow functionally equiv stale units

* 🔇 demote stale warning to debug level

* 🐛 Update keys check

* ✅ Add unittests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants