High Batch Size with SD3 Dreambooth Destabilizes Training #8621

CiaraStrawberry · 2024-06-18T15:09:23Z

CiaraStrawberry
Jun 18, 2024

Describe the bug

I have been trying to train a slightly modified IP Adapter architecture for SD3 over the past few days and wrote the training script by copying the up to date weighting, noise and loss code from the train_dreamboothsd3.py script, while training nothing I could do would allow the model to train, after 2-3000 steps at batch size 40 lr 5e-6, the output would just turn to mush.

Now, after dropping to a batch size of 4 and lr 8e-7, the problem appears to have gone away completely, no hints of degradation 40,000 steps in.

Only other possible explanation is that around that time I also removed torch.autocast block around the model forward pass that shouldn't have been there given i was also using accelerate, but i don't think that was the source of the issue as that has been there for previous perfectly functional runs using a very similar script. I'll do an extra test to check some point over the next few days when i have a chance.

I have been using the newly modified weighting pushed a day or two ago and logit normal weighting. (same seemed to happen with the original sigma_sqrt weighting)

Reproduction

Run sd3 training using logit norm, presumably with a lora or something and use a batch size of at least 40 with lr 6e-6ish

Logs

System Info

4xA100 runpod server

Who can help?

@sayakpaul

sayakpaul · 2024-06-18T15:30:58Z

sayakpaul
Jun 18, 2024
Maintainer

This seems more like a technical discussion to me and very training specific. So, I am going to transfer this to "Discussions". Ccing @bghira @AmericanPresidentJimmyCarter to check if they have some things to suggest here.

0 replies

sayakpaul · 2024-06-18T15:32:20Z

sayakpaul
Jun 18, 2024
Maintainer

Quick suggestion would be try to experiment with different LRs and LR schedulers that vary with batch sizes.

0 replies

CiaraStrawberry · 2024-06-18T15:34:51Z

CiaraStrawberry
Jun 18, 2024
Author

Gotcha, does sd3 typically so far seem to need a different lr than previous versions did? I've been running with roughly 1e-7 * batch_size before this, I'll give some other settings a try later.

Attaching my training script for context, excuse the way it's a wreck after me bugfixing with it all week

test_train_sd3.txt

1 reply

sayakpaul Jun 18, 2024
Maintainer

You also decay the LR as the training progresses. The re-flow training paradigm is still new to us, so I would assume things will take more time to figure out.

bghira · 2024-06-18T15:47:29Z

bghira
Jun 18, 2024

yeah it needs a much lower LR. for what it's worth, 40 isn't considered a very high batch size for diffusion transformers, they appreciate it to be as high as you can push it

1 reply

AmericanPresidentJimmyCarter Jun 18, 2024

Yes the model is missing qk norm so is less stable than it should be. Transformers like big batch sizes like 1024-2048 and higher (1e-5 to 1e-4) LR, with small batch sizes you should crank down the LR.

CiaraStrawberry · 2024-06-18T16:06:23Z

CiaraStrawberry
Jun 18, 2024
Author

All noted, i'll try batch size 80 with 5e-7 and see if that works.

0 replies

CiaraStrawberry · 2024-06-23T17:24:32Z

CiaraStrawberry
Jun 23, 2024
Author

training seems much more stable after further tests so closing, thank you all for the advice.

3 replies

sayakpaul Jun 23, 2024
Maintainer

Good to know! If it is possible at all, it would be great to have a little blog post about it -:)

CiaraStrawberry Jun 23, 2024
Author

I am absolutely writing one on the matter once I get this fully trained, My IP Adapter thing is still training so don't want to be too self-congratulatory about it yet in case it doesn't converge or something xD

sayakpaul Jun 24, 2024
Maintainer

Looking forward to it!

xduzhangjiayu · 2024-10-25T08:42:45Z

xduzhangjiayu
Oct 25, 2024

Hey,
Any suggestions about this discusion? #9675
I've been stuck on this for a long time
I will be gratefull if you can give some advice~
@CiaraStrawberry @sayakpaul @bghira @AmericanPresidentJimmyCarter

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Batch Size with SD3 Dreambooth Destabilizes Training #8621

{{title}}

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

High Batch Size with SD3 Dreambooth Destabilizes Training #8621

CiaraStrawberry Jun 18, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

Replies: 7 comments · 5 replies

sayakpaul Jun 18, 2024 Maintainer

sayakpaul Jun 18, 2024 Maintainer

CiaraStrawberry Jun 18, 2024 Author

sayakpaul Jun 18, 2024 Maintainer

bghira Jun 18, 2024

AmericanPresidentJimmyCarter Jun 18, 2024

CiaraStrawberry Jun 18, 2024 Author

CiaraStrawberry Jun 23, 2024 Author

sayakpaul Jun 23, 2024 Maintainer

CiaraStrawberry Jun 23, 2024 Author

sayakpaul Jun 24, 2024 Maintainer

xduzhangjiayu Oct 25, 2024

CiaraStrawberry
Jun 18, 2024

Replies: 7 comments 5 replies

sayakpaul
Jun 18, 2024
Maintainer

sayakpaul
Jun 18, 2024
Maintainer

CiaraStrawberry
Jun 18, 2024
Author

sayakpaul Jun 18, 2024
Maintainer

bghira
Jun 18, 2024

CiaraStrawberry
Jun 18, 2024
Author

CiaraStrawberry
Jun 23, 2024
Author

sayakpaul Jun 23, 2024
Maintainer

CiaraStrawberry Jun 23, 2024
Author

sayakpaul Jun 24, 2024
Maintainer

xduzhangjiayu
Oct 25, 2024