-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LearningRate-Free Learning Algorithm #181
Comments
@BootsofLagrangian Would you be able to fork the repo and commit your changes so it would be easier for plebs like me to follow your changes? |
Sorry, I'm not familiar with github. It takes long time to make a fork or repo. and this changes include hard codes, does it matter if there's something like this? |
When you fork, the code becomes your own, and you can hard code changes into your own copy. But that's ok. Maybe I'll do it, and @ you if I have any problems. (Also, unless you're on mobile, forking is fast and easy, click fork, click done, and boom) |
I made a fork. I just changed train_network.py and requirements.txt |
Cool stuff |
This should be added as an official feature to the project. Like it! |
@BootsofLagrangian I see that both TE LR and UNet LR are no longer specified. Do you know if Dadatation set both to be the same? Do you know if it is possible to set them to different values is it is the same? For LoRA it used that setting TE to a smaller LR than UNet was better. Not sure how this is doing it for each. |
@bmaltais wouldn't you have to proc it twice with lr=1.0 for UNet and <1 for TE? Since in essence you have two different training problems going on at once? From the source repo And maybe Dadaptation is most suited for UNet, since under fitting the text encoder is often desirable. |
@BootsofLagrangian you're awesome! Can't wait to play with it! |
Yes, using lr argumentation make TE LR and UNet LR different. @AI-Casanova's comment is also right. I'm not sure, but using get_scheduler_fix function in train_network.py properly is the way to applying LRs differently. or directly |
Link to python module for reference: https://pypi.org/project/dadaptation/ |
I intuitively knew that there must be a way of adjusting learning rate in a context dependent manner, but knew I was far too uninformed to come up with one. This is definitely cool stuff. |
Quick comparison results from DAdaptAdam with TE:0.5 and UNet:1.0: DAdaptAdam-1-1: loss: 0.125, dlr: 4.02e-05 I think the winner is clear. TE LR need to be half of UNet... but there might be more optimal settings. Optimizer config for both was: I will redo the same test but with an optimizer config of: |
@bmaltais how did you implement the split learning rate? Or did you run it twice? |
@AI-Casanova I did it with lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 0.5, lambda epoch: 1], last_epoch=-1, verbose=False) |
@bmaltais awesome! I should have pulled on that thread, but my self taught lr for all things python and ML is already through the roof. 😅 |
Here is an interesting finding. For DAdaptSGD having a TE and UNet lambda both at 1 is better than 0.5,1... I wonder if having a weaker UNet with DAdaptSGD might be even better... like DAdaptSGD-1-0.5 Also, I have not been able to get anything out of DAdaptAdaGrad yet. |
Good question... I don't really know. But DAdaptAdam-0.5-1 appear to produce the most likeness of all the method... so I might stick with that for now... |
Published 1st model made with this new technique: https://civitai.com/models/8337/kim-wilde-1980s-pop-star |
I'm experiencing what I think is a way overtrained TE, even at 0.5. All styling goes out the window before my UNet catches up. I have to figure out how to log what the learning rates are independently. |
So @BootsofLagrangian was outputting the TE learning rate to the progress bar and logs, so what I thought was a suspiciously high UNet lr was an insanely high TE lr Dropped my scale to .25 .5 and trying again. |
Unfortunately it's starting to look to me like I've replaced one grid search with another, with scaling factor in the place of lr |
@AI-Casanova, you might need another learning rate scheduler. My fork use only LambdaLR(identity or scalar scaling). This is a problem, because of no using get_shceduler_fix function in sd-scripts. Usually, Transformer models use warmup LR scheduler. From dadaptation repo, applying LR scheduler using before also works fine. |
@BootsofLagrangian basically what I was seeing is very good likenesses being made, but they were so inflexible. I think I might have hit the sweet spot at 0.125 0.25 though. It still adjusts to my datasets, and is in a similar range as before. Now I'm gonna add a few other ideas to this fork. |
@BootsofLagrangian When network_dim=128, network_alpha=1, data was destroyed about 50 steps were executed. |
D-adaptation use inverse of subgradient of models. If you want more equations, details are in dadaptation paper LoRA model is multiplicated two matrix with low-rank(r) B and A. In LoRA paper, alpha and rank used external multiplication terms of model. Alpha used multiplying model and rank used dividing model. So, alpha/rank ratio is very directly and sensitively acting on subgradient. In destroyed case, alpha=1, rank=128, alpha/rank ratio is 1/128. This makes subgradient smaller. Now, return to D-adaptation. Small subgradient makes learning rate higher. High learning rate blow model up. Therefore, It is highly recommended alpha and rank set up same value, especially using big(?) rank value. Thank you for comment and experiments! :) |
Understood. Thanks for your reply! |
Tried out @BootsofLagrangian fork, works really well IMO. Green is the D-Adaptation and Orange is 1e-4 learning rate, 5e-5 for text encoder. Also added regularization images for anime to the green lines. Showing below after 2000 steps. With the |
Note: some posts related to learning: |
Usage of my fork changes. Now, using --use_dadaptation_optimzer args activate dadaptation. and, learning rate, UNet LR, TE LR will available args, but it is not commonly used for LR. 1-digit float value is proper value for LR, UNet LR, TE LR. ex) 1.0, 1.0, 0.5. |
D-Adaptation optimizer is finally implemented. Thank you to @BootsofLagrangian for the PR and thank you all for the great research! |
so glad i can ignore setting LR now, thanks!! @BootsofLagrangian @kohya-ss so does this mean if i set this option, i dont need to set LR value / it ignores it? |
Glad you found where the option was. It is indeed nice not to have to specify the LR. |
Really appreciate your efforts, however I seem to have a more success with the original post than the current one. With the original one, the results were great and Im still tweaking it. Do above posts imply that LR increases when applying a dampening factor (like network alpha) , rather than decreases like it normally does? With the new one, LR seems really low and can't produce results resembling input images, even using 5 times the steps I normally use, but I may have mistweaked it, have you guys had success? |
Interesting. I had a feeling the new d'adaptation was different from the one in the branch... Some day I will see if there is a way I could enable the original method vs using the old branch for the task. |
I have also been having issues with d-adaptation implementation. I originally used it in PR form and it was working well, but tried it recently in various testing I was doing and I couldn't get it to not explode and cause loss=nan. Also learning seems very low, even though it has a decent dlr (lower average magnitude/average strength). Tried upping the unet_lr and text_encoder_lr to 2, 1.15, 1.25 or lower to 0.75 (which I know isn't a multiplier) but still had poor results. Tried optimizer_args of "decouple=True" and/or weight_decay from 0.2, 0.1, 0.01, 1e-4, 1e-5, 1e-6 and nothing seemed to help it improve. I will try some tests on the code from BootsofLagrangian's branch to compare. I can also try to compare the code and see if there is something that stands out to me. |
@rockerBOO Now version only use one learning rate on UNet and TextEncoder. I think this is correct method following the reference. So, It is recommend that using lower unet learning rate(eg. 0.5) and using optimizer args "decouple=True" "weight_decay=1.0" Because D-Adaptation uses boundedness, it will use maximally high dlr(seems like using maximal learning rate). Anyway, try "weight_decay=1.0" on optimizer_args and lower coefficient of unet/TE lr(eg. 0.5) |
@BootsofLagrangian thanks for taking a look! Settings:
tried weight decay 0.5, 1.5 as well. And the last light blue is without min_snr_gamma but has the same problem once it starts going up it starts producing noise and goes up fast and never recovers. Using the same dataset and settings (changing the learning rate, and low or no weight decay) with AdamW produces good results so its within the reported settings being changed. Edit: Also noting I'm running |
First, rank(dimension) and alpha should be same value with D-Adaptation. α/r ratio has direct impact on learning rate and weight(model). d*lr will increase when α/r decrease. So, controlling α and r value is important and sensitive thing. Second, I don't have any experiment with min_snr_gamma, but, I think that min_snr_gamma accelerate training, also D-Adaptation too. With combining two method, model explode in earlier step. (And there is some math to understand deeply assumption on D-Adaptation. It suppose model is a kind of Lipschitz function. But SD model doesn't. Therefore mathematically D-Adaptation does not guarantee that automatically chosen lr lead model to convergence. So, D-Adaptation with other speed-up method makes model blow up.) Third, lr scheduler maybe is a matter. Most of Transformer model(including Stable Diffusion) use learning rate scheduler with warmup or restarts. It help model can update with small amount of weight(ΔW) and can reach to and find global minimum. You might need to consider using lr scheduler with warmup an restarts(I recommend lr_scheduler=cosine_with_restart and lr_warmup=[5~10% of total steps]). |
Thanks for these suggestions @BootsofLagrangian . I am still working through the different permutations and having varying results. Trying to isolate it to specific parameters that may be having a larger impact. I will try to assess and report back. network_dim=16
network_alpha=16 # match dim
unet_lr=0.5 # the highest value of these will be the learning rate
text_encoder_lr=0.5 # the highest value of these will be the learning rate
optimizer_type="DAdaptAdam"
optimizer_args=["decouple=True", "weight_decay=1.0"] # weight decay may not be necessary, can help with overfitting, play with different values and look up for more info
lr_scheduler="cosine_with_restarts"
lr_warmup_steps=350 # 5-10% of total steps In my initial findings, 0.5 LR, the matching rank, no min_snr_gamma (mostly to remove a variable) and using warmup and cosine_with_restart seemed to work a lot better. But it's not consistently working better with these options, and trying other options as well so haven't pinned down anything. I would say a "warmup" would be ideal with d-adaptation in my experimentation as it tampers the dynamic learning rate down. If you have too long or too short of a warmup it can drastically affect the dynamic learning rate, it my limited experience (needs more testing). The cycling learning rate also seems to help tamper down or letting the dynamic learning rate expand somewhat. |
Noob here... |
After some epochs, downsizing learning rate is an useful method and not a placebo effect. Most of learning rate scheduler do that. |
I only ask BootofLagrangain because of the language used in kohya when the training begins. It confused me with it's wordage... that and once I read up on the schedulers it all seemed like the schedulers already do what I was trying to mimic manually. Thank you for leting me know kohya is not overruling the settings (right?). For example, when I set to 1, .5, 1... the text learning rate is .5, the wordage makes it sound like all settings were changed to .5... I'll have to copy it next time, but I'm sure you know the text I'm talking about, something about using only the "first" setting. Anyway, thank you! |
Hi all - I was wondering how you are specifying different LRs for UNET and the text encoder? RuntimeError: Setting different lr values in different parameter groups is only supported for values of 0 Was this something that was changed in a recent update? |
@phasiclabs I do believe that the newest version of Dadaptation can only be set to 0/1 for each TE and UNet. This was the original implementation that allowed for a scalar |
Ah, ok thanks for the info - just found this post too #274 |
Have you solved this issue? I meet this |
@wuliebucha I think he fixed that issue by fixing alpha. Adaptation Optimizer is very sentive to ratio of alpha and rank. You need to set value of alpha same as value rank. If alpha is lower than rank, model can easily blow up. |
Hi, how about D-adaptation?
This is a kind of algorithm that end-user doesn't need to set specific learning rate.
In short, D-adaptation use boundedness to find proper learning rate.
So, it might be useful to someone who hard to find hyperparameters.
Before I wrote this issue, I implement D-adaptation optimizer(Adam) for LoRA. It works!
A few code need to implementation. But I don't know all about sd-scripts code, there exists hard codings.
Requirement for D-dataptation is only torch>=1.5.1 and pip install dadaptation.
Here are codes.
In train_network.py
from torch.optim as optim # using for a raw learning rate scheduler
import dadaptation
and I hard-coded for applying optimizer.
optimizer = optimizer_class(trainable_params, lr=args.learning_rate)
to
optimizer = dadaptation.DAdaptAdam(trainable_params, lr=1.0, decouple=True, weight_decay=1.0)
Setting decople=True means that optimizer is AdamW not Adam. and weight_decay is for l2 penalty.
Other argumentation is not for end-user.(maybe)
And trainable_params doesn't need a specific learning rate, so replace
trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.unet_lr)
to
trainable_params = network.prepare_optimizer_params(None, None)
In sd-scripts, lr_scheduler is a return of get_scheduler_fix function.
But I don't know why using get_scheduler_fix interrupt D-adaptation,
so I override lr_scheduler to LambdaLR. sorry for hard coding again :)
lr_scheduler = optim.lr_scheduler.LambdaLR(optimizer=optimizer, lr_lambda=[lambda epoch: 1, lambda epoch: 1], last_epoch=-1, verbose=False)
For monitoring dlr value,
logs['lr/d*lr'] = optimizer.param_groups[0]['d']*optimizer.param_groups[0]['lr']
might be needed. All things done.
This image is d*lr-step graph when I use D-dadaptation.
I trained LoRA using D-adaptation, result is here.
Thank you!
The text was updated successfully, but these errors were encountered: