You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed an inconsistency in the handling of weights in the second text encoder of the SDXL model, which seems to align with the issue reported in Stability-AI/generative-models#111. It appears that the diffusers and kohya libraries utilize the weights differently, as discussed in huggingface/diffusers#8238.
To further investigate, I conducted some experiments using the following script:
importtorchfromdiffusersimportStableDiffusionXLPipelinebase_model_path="stabilityai/stable-diffusion-xl-base-1.0"device="cuda"pipe=StableDiffusionXLPipeline.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
add_watermarker=False
).to(device)
pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_model_cpu_offload()
templates= [
"A picture of a {} in the jungle",
"A picture of a {} on a cobblestone street",
"A picture of a {} on top of pink fabric",
"A picture of a {} on top of a wooden floor",
"A picture of a {} with a city in the background",
"A picture of a cube shaped {}",
]
negative_prompt='(embedding:bad X), (embedding:badhandv4), (worst quality), (low quality), bad proportions, ((blurry)), cropped, jpeg artifacts'defbench(prefix):
forsubjin ('cat', 'girl'):
fortemplateintemplates:
prompt=template.format(subj)
print(prompt)
images=pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_images_per_prompt=2,
num_inference_steps=16,
guidance_scale=7.5,
generator=torch.Generator(device=device).manual_seed(0)
).imagesforidx, imageinenumerate(images):
img_name='-'.join(prompt.split(' ')[3:15]).replace('(', '').replace(')', '')
gen_path=f"debug/{prefix}_{img_name}_{idx}.jpg"image.save(gen_path)
bench('orig')
original_weights=pipe.text_encoder_2.text_projection.weight.data.clone()
pipe.text_encoder_2.text_projection.weight.data=pipe.text_encoder_2.text_projection.weight.data.Tweight_diff=torch.norm(original_weights-pipe.text_encoder_2.text_projection.weight.data)
print(f"diff of text_projection: {weight_diff.item()}")
bench('tran')
The output shows a difference in the text projection weights (diff of text_projection: 32.875). Surprisingly, the images generated before and after the transposition of weights are of same quality level.
orig_a-cat-in-the-jungle_0.jpg vs tran_a-cat-in-the-jungle_0.jpg:
orig_a-girl-in-the-jungle_0.jpg vs tran_a-girl-in-the-jungle_0.jpg:
Possible explanations could be:
SDXL is robust enough to tolerate the corruption of pooled text embeddings from the second text encoder.
The transposition does not significantly alter the semantic meaning of the embeddings.
The pooled text embeddings may not play a significant role during the denoising process in the unet.
I am curious if anyone else has observed similar behavior or has insights into the robustness of SDXL's text encoding process.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've noticed an inconsistency in the handling of weights in the second text encoder of the SDXL model, which seems to align with the issue reported in Stability-AI/generative-models#111. It appears that the diffusers and kohya libraries utilize the weights differently, as discussed in huggingface/diffusers#8238.
To further investigate, I conducted some experiments using the following script:
The output shows a difference in the text projection weights (
diff of text_projection: 32.875
). Surprisingly, the images generated before and after the transposition of weights are of same quality level.orig_a-cat-in-the-jungle_0.jpg vs tran_a-cat-in-the-jungle_0.jpg:
orig_a-girl-in-the-jungle_0.jpg vs tran_a-girl-in-the-jungle_0.jpg:
Possible explanations could be:
I am curious if anyone else has observed similar behavior or has insights into the robustness of SDXL's text encoding process.
Beta Was this translation helpful? Give feedback.
All reactions