Skip to content

Latest commit

 

History

History
106 lines (72 loc) · 3.49 KB

README.md

File metadata and controls

106 lines (72 loc) · 3.49 KB

MagicMix

Generic badge

Implementation of MagicMix: Semantic Mixing with Diffusion Models paper.

magicmix

The aim of the method is to mix two different concepts in a semantic manner to synthesize a new concept while preserving the spatial layout and geometry.

The method takes an image that provides the layout semantics and a prompt that provides the content semantics for the mixing process.

There are 3 parameters for the method-

  • v: It is the interpolation constant used in the layout generation phase. The greater the value of v, the greater the influence of the prompt on the layout generation process.
  • kmax and kmin: These determine the range for the layout and content generation process. A higher value of kmax results in loss of more information about the layout of the original image and a higher value of kmin results in more steps for content generation process.

Usage

from PIL import Image
from magic_mix import magic_mix

img = Image.open('phone.jpg')
out_img = magic_mix(img, 'bed', kmax=0.5)
out_img.save("mix.jpg")
python3 magic_mix.py \
    "phone.jpg" \
    "bed" \
    "mix.jpg" \
    --kmin 0.3 \
    --kmax 0.6 \
    --v 0.5 \
    --steps 50 \
    --seed 42 \
    --guidance_scale 7.5

Also, check out the demo notebook for example usage of the implementation to reproduce examples from the paper.

You can also use the community pipeline on the diffusers libary.

from diffusers import DiffusionPipeline, DDIMScheduler
from PIL import Image

pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="magic_mix",
    scheduler = DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
).to('cuda')

img = Image.open('phone.jpg')
mix_img = pipe(
    img, 
    prompt = 'bed', 
    kmin = 0.3,
    kmax = 0.5,
    mix_factor = 0.5,
    )
mix_img.save('mix.jpg')

Some examples reproduced from the paper:

Input Image:

telephone

Prompt: "Bed"
Output Image:

telephone-bed

Input Image:

sign

Prompt: "Family"
Output Image:

sign-family

Input Image:

sushi

Prompt: "ice-cream"
Output Image:

sushi-ice-cream

Input Image:

pineapple

Prompt: "Cake"
Output Image:

pineapple-cake

Note

I'm not the author of the paper, and this is not an official implementation