-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement multi-cond guidance for Composable Diffusion #1695
Conversation
As a bonus, add prompt weighting too. Based on https://arxiv.org/pdf/2206.01714.pdf / https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/ In vanilla Stable Diffusion, prompt generations is guided based on two prompts: towards the positive prompt, and away from the negative prompt. This change allows you to use an arbitrary number of prompts for guidance, for some interesting composition options. See the website above for more concrete examples. Multi-cond guidance slows generation because it requires evaluating guidance for additional prompts for each step. New syntax keywords: AND NOT PLUS -- since CLIP is case insensitive, just have them lowercase to use them in a prompt. SYNTAX GUIDE: Watch the console for debugging output of how each prompt is evaluated. New case-sensitive keywords: AND NOT PLUS. Weights are :NUMBER. "red AND white" guides with a "red" prompt and a "white" prompt. "red:2 AND white" guides with a "red" prompt 2x stronger than a "white" prompt "a photo of a (cat AND dog)" is equivalent to "a photo of a cat AND a photo of a dog" and generate an animal hybrid using the two prompts. "a person NOT human" guides towards "a person" with "human" as a prompt with -0.5x weight. "cat PLUS dog" guides with a prompt made by adding the CLIP-embeddings from "cat" to "dog" and dividing by 2. You can combine PLUS and AND. "apple PLUS pear AND banana PLUS eggplant" makes an image containing apple/pear hybrids and banana/eggplant hybrids. Multiple paren groups are supported and combine groups sensibly. "photo of (dog AND cat), cute, 4k, playing with (ball AND yarn)" => "photo of dog, cute, 4k, playing with ball AND photo of cat, cute, 4k, playing with yarn".
The choice of using parens when you don't actually support nesting them seems wrong. It also clashes with attention. The sensible composition does not feel sensible to me. Sensible for "photo of (dog AND cat), cute, 4k, playing with (ball AND yarn)" would be to make four conds there with all combinations. NOT seems redundant when you have weights. PLUS is just unrelated and I still don't want it. More than anything, the amount of added code is very very unappealing. The page you link has just AND, without any parens, and that would be a good start. I feel that if we just support AND plus weights, the amount of code would become multiple times smaller and it would a lot simpler. I don't feel right telling you to throw this away after you stent time working on it, but I don't want this complexity added to the repo. The contributing page does say that you should consult with me before PRing big changes. I have plans to add this kind of compositing myself, so if you don't want to rework the code to conform to those requirements, the feature will make it in anyway at some point. |
I think some kind of grouping is needed In the second case only the green hat is painted by van gogh not the man or the red shirt. You need the grouping because styles and the like apply to the whole picture. Your way, people would have to type out every combination completely: I think he is letting parens that don't have AND or PLUS in them through so that they can be for attention. One possible change would be to pick a new grouping pair like <> instead of () Though now that I typed it, the idea of making AND top level and requiring the whole prompt to be duplicated does have the advantage of simplicity. |
@AUTOMATIC1111 Some sort of grouping would be sorely wanted, as another user pointed out. Some sort of standard syntax would be nice so that additional features can be freely added without clashing with old ones. S-expressions are trivial to parse, so if you can devise a prompt DSL with S-expressions it'll cost little in terms of complexity and maintainability. Just a thought. |
There are people on reddit claiming that AND has just been added. |
Added just a few hours ago in c26732? |
Does it have any grouping or delimiting or is it top level only? |
A problem with the current version is that there is no way to limit a negative weighted item to a single AND branch. |
DDIM sampler crashes by putting AND NOT in the prompt. |
* 🐛 Allow functionally equiv stale units * 🔇 demote stale warning to debug level * 🐛 Update keys check * ✅ Add unittests
BUGS:
photo of a cute (dog AND cat PLUS kitten), 4k, HD
As a bonus, add prompt weighting too.
Based on https://arxiv.org/pdf/2206.01714.pdf /
https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/
In vanilla Stable Diffusion, prompt generations is guided based on two prompts: towards the positive prompt, and away from the negative prompt. This change allows you to use an arbitrary number of prompts for guidance, for some interesting composition options. See the website above for more concrete examples.
Multi-cond guidance slows generation because it requires evaluating guidance for additional prompts for each step.
New syntax keywords: AND NOT PLUS -- since CLIP is case insensitive, just have them lowercase to use them in a prompt.
SYNTAX GUIDE:
Watch the console for debugging output of how each prompt is evaluated.
New case-sensitive keywords: AND NOT PLUS. Weights are :NUMBER.
"red AND white" guides with a "red" prompt and a "white" prompt.
"red:2 AND white" guides with a "red" prompt 2x stronger than a "white" prompt
"a photo of a (cat AND dog)" is equivalent to "a photo of a cat AND a photo of a dog" and generate an animal hybrid using the two prompts.
"a person NOT human" guides towards "a person" with "human" as a prompt with -0.5x weight.
"cat PLUS dog" guides with a prompt made by adding the CLIP-embeddings from "cat" to "dog" and dividing by 2.
You can combine PLUS and AND. "apple PLUS pear AND banana PLUS eggplant" makes an image containing apple/pear hybrids and banana/eggplant hybrids.
Multiple paren groups are supported and combine groups sensibly. "photo of (dog AND cat), cute, 4k, playing with (ball AND yarn)" => "photo of dog, cute, 4k, playing with ball AND photo of cat, cute, 4k, playing with yarn".