Inconsistency in SDXL's Second Text Encoder Weights Handling #9757

vitrun · 2024-10-23T13:12:47Z

vitrun
Oct 23, 2024

I've noticed an inconsistency in the handling of weights in the second text encoder of the SDXL model, which seems to align with the issue reported in Stability-AI/generative-models#111. It appears that the diffusers and kohya libraries utilize the weights differently, as discussed in huggingface/diffusers#8238.
To further investigate, I conducted some experiments using the following script:

import torch
from diffusers import StableDiffusionXLPipeline

base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
device = "cuda"

pipe = StableDiffusionXLPipeline.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,
    add_watermarker=False
).to(device)

pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()

templates = [
    "A picture of a {} in the jungle",
    "A picture of a {} on a cobblestone street",
    "A picture of a {} on top of pink fabric",
    "A picture of a {} on top of a wooden floor",
    "A picture of a {} with a city in the background",
    "A picture of a cube shaped {}",
]

negative_prompt = '(embedding:bad X), (embedding:badhandv4), (worst quality), (low quality), bad proportions, ((blurry)), cropped, jpeg artifacts'

def bench(prefix):
    for subj in ('cat', 'girl'):
        for template in templates:
            prompt = template.format(subj)
            print(prompt)
            images = pipe(
                prompt=prompt,
                negative_prompt=negative_prompt,
                num_images_per_prompt=2,
                num_inference_steps=16,
                guidance_scale=7.5,
                generator=torch.Generator(device=device).manual_seed(0)
            ).images
            for idx, image in enumerate(images):
                img_name = '-'.join(prompt.split(' ')[3:15]).replace('(', '').replace(')', '')
                gen_path = f"debug/{prefix}_{img_name}_{idx}.jpg"
                image.save(gen_path)

bench('orig')

original_weights = pipe.text_encoder_2.text_projection.weight.data.clone()

pipe.text_encoder_2.text_projection.weight.data = pipe.text_encoder_2.text_projection.weight.data.T
weight_diff = torch.norm(original_weights - pipe.text_encoder_2.text_projection.weight.data)

print(f"diff of text_projection: {weight_diff.item()}")

bench('tran')

The output shows a difference in the text projection weights (diff of text_projection: 32.875). Surprisingly, the images generated before and after the transposition of weights are of same quality level.
orig_a-cat-in-the-jungle_0.jpg vs tran_a-cat-in-the-jungle_0.jpg:

orig_a-girl-in-the-jungle_0.jpg vs tran_a-girl-in-the-jungle_0.jpg:

Possible explanations could be:

SDXL is robust enough to tolerate the corruption of pooled text embeddings from the second text encoder.
The transposition does not significantly alter the semantic meaning of the embeddings.
The pooled text embeddings may not play a significant role during the denoising process in the unet.

I am curious if anyone else has observed similar behavior or has insights into the robustness of SDXL's text encoding process.

a-r-r-o-w · 2024-10-25T09:22:00Z

a-r-r-o-w
Oct 25, 2024
Maintainer

cc @DN6 @asomoza @yiyixuxu 👀

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in SDXL's Second Text Encoder Weights Handling #9757

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Inconsistency in SDXL's Second Text Encoder Weights Handling #9757

vitrun Oct 23, 2024

Replies: 1 comment

a-r-r-o-w Oct 25, 2024 Maintainer

vitrun
Oct 23, 2024

a-r-r-o-w
Oct 25, 2024
Maintainer