Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training LyCORIS/LoHA: --multires_noise_discount=0.2']' returned non-zero exit status 1 #1032

Closed
tandpastatester opened this issue Jun 21, 2023 · 4 comments

Comments

@tandpastatester
Copy link

tandpastatester commented Jun 21, 2023

I followed this YT tutorial and used the same configuration.

I'm using the "SDiffusion Dreambooth ControlNet Deforum Kohya" Template on Runpod to run kohya_ss. I was able to train LoRA's with this setup without this error.

RunPod log:

0%| | 0/40 [00:00<?, ?it/s] 100%|██████████| 40/40 [00:00<00:00, 2931.80it/s] 2023-06-21T21:55:33.003925063Z │ /workspace/kohya_ss/venv/lib/python3.10/site-packages/lycoris/kohya.py:23 in │ 2023-06-21T21:55:33.003929659Z │ create_network │ 2023-06-21T21:55:33.003934216Z │ │ 2023-06-21T21:55:15.826947262Z /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py:258: FutureWarning:logging_diris deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Useproject_dir instead. 2023-06-21T21:55:15.826985409Z warnings.warn( 2023-06-21T21:55:15.846166327Z /workspace/kohya_ss/venv/lib/python3.10/site-packages/safetensors/torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2023-06-21T21:55:15.846198959Z with safe_open(filename, framework="pt", device=device) as f: 2023-06-21T21:55:32.520442552Z 0it [00:00, ?it/s] 0it [00:00, ?it/s] 2023-06-21T21:55:33.003799992Z ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ 2023-06-21T21:55:33.003837038Z │ /workspace/kohya_ss/train_network.py:864 in <module> │ 2023-06-21T21:55:33.003844305Z │ │ 2023-06-21T21:55:33.003849678Z │ 861 │ args = parser.parse_args() │ 2023-06-21T21:55:33.003854648Z │ 862 │ args = train_util.read_config_from_file(args, parser) │ 2023-06-21T21:55:33.003859455Z │ 863 │ │ 2023-06-21T21:55:33.003864261Z │ ❱ 864 │ train(args) │ 2023-06-21T21:55:33.003868991Z │ 865 │ 2023-06-21T21:55:33.003873864Z │ │ 2023-06-21T21:55:33.003878590Z │ /workspace/kohya_ss/train_network.py:214 in train │ 2023-06-21T21:55:33.003883247Z │ │ 2023-06-21T21:55:33.003887970Z │ 211 │ │ network, _ = network_module.create_network_from_weights(1, arg │ 2023-06-21T21:55:33.003892577Z │ 212 │ else: │ 2023-06-21T21:55:33.003897290Z │ 213 │ │ # LyCORIS will work with this... │ 2023-06-21T21:55:33.003901957Z │ ❱ 214 │ │ network = network_module.create_network( │ 2023-06-21T21:55:33.003906667Z │ 215 │ │ │ 1.0, args.network_dim, args.network_alpha, vae, text_encod │ 2023-06-21T21:55:33.003911327Z │ 216 │ │ ) │ 2023-06-21T21:55:33.003915867Z │ 217 │ if network is None: │ 2023-06-21T21:55:33.003920423Z │ │ 2023-06-21T21:55:33.003940609Z │ 20 │ │ network_dim = 4 # default │ 2023-06-21T21:55:33.003973733Z │ 21 │ conv_dim = int(kwargs.get('conv_dim', network_dim)) │ 2023-06-21T21:55:33.003980209Z │ 22 │ conv_alpha = float(kwargs.get('conv_alpha', network_alpha)) │ 2023-06-21T21:55:33.003985115Z │ ❱ 23 │ dropout = float(kwargs.get('dropout', 0.)) │ 2023-06-21T21:55:33.003990142Z │ 24 │ algo = kwargs.get('algo', 'lora') │ 2023-06-21T21:55:33.003995148Z │ 25 │ use_cp = (not kwargs.get('disable_conv_cp', True) │ 2023-06-21T21:55:33.003999972Z │ 26 │ │ │ or kwargs.get('use_conv_cp', False)) │ 2023-06-21T21:55:33.004004878Z ╰──────────────────────────────────────────────────────────────────────────────╯ 2023-06-21T21:55:33.004009942Z TypeError: float() argument must be a string or a real number, not 'NoneType' 2023-06-21T21:55:34.617664690Z ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ 2023-06-21T21:55:34.617706412Z │ /workspace/kohya_ss/venv/bin/accelerate:8 in <module> │ 2023-06-21T21:55:34.617710325Z │ │ 2023-06-21T21:55:34.617713479Z │ 5 from accelerate.commands.accelerate_cli import main │ 2023-06-21T21:55:34.617716380Z │ 6 if __name__ == '__main__': │ 2023-06-21T21:55:34.617718990Z │ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │ 2023-06-21T21:55:34.617721462Z │ ❱ 8 │ sys.exit(main()) │ 2023-06-21T21:55:34.617724482Z │ 9 │ 2023-06-21T21:55:34.617727397Z │ │ 2023-06-21T21:55:34.617730070Z │ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/ac │ 2023-06-21T21:55:34.617732729Z │ celerate_cli.py:45 in main │ 2023-06-21T21:55:34.617735623Z │ │ 2023-06-21T21:55:34.617738224Z │ 42 │ │ exit(1) │ 2023-06-21T21:55:34.617740750Z │ 43 │ │ 2023-06-21T21:55:34.617743305Z │ 44 │ # Run │ 2023-06-21T21:55:34.617745922Z │ ❱ 45 │ args.func(args) │ 2023-06-21T21:55:34.617748623Z │ 46 │ 2023-06-21T21:55:34.617751122Z │ 47 │ 2023-06-21T21:55:34.617753607Z │ 48 if __name__ == "__main__": │ 2023-06-21T21:55:34.617756216Z │ │ 2023-06-21T21:55:34.617758918Z │ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/la │ 2023-06-21T21:55:34.617761642Z │ unch.py:918 in launch_command │ 2023-06-21T21:55:34.617765814Z │ │ 2023-06-21T21:55:34.617768671Z │ 915 │ elif defaults is not None and defaults.compute_environment == Comp │ 2023-06-21T21:55:34.617771385Z │ 916 │ │ sagemaker_launcher(defaults, args) │ 2023-06-21T21:55:34.617773958Z │ 917 │ else: │ 2023-06-21T21:55:34.617797779Z │ ❱ 918 │ │ simple_launcher(args) │ 2023-06-21T21:55:34.617801331Z │ 919 │ 2023-06-21T21:55:34.617803873Z │ 920 │ 2023-06-21T21:55:34.617806439Z │ 921 def main(): │ 2023-06-21T21:55:34.617808952Z │ │ 2023-06-21T21:55:34.617811501Z │ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/la │ 2023-06-21T21:55:34.617814152Z │ unch.py:580 in simple_launcher │ 2023-06-21T21:55:34.617816667Z │ │ 2023-06-21T21:55:34.617819313Z │ 577 │ process.wait() │ 2023-06-21T21:55:34.617821833Z │ 578 │ if process.returncode != 0: │ 2023-06-21T21:55:34.617824312Z │ 579 │ │ if not args.quiet: │ 2023-06-21T21:55:34.617826822Z │ ❱ 580 │ │ │ raise subprocess.CalledProcessError(returncode=process.ret │ 2023-06-21T21:55:34.617829243Z │ 581 │ │ else: │ 2023-06-21T21:55:34.617831713Z │ 582 │ │ │ sys.exit(1) │ 2023-06-21T21:55:34.617834196Z │ 583 │ 2023-06-21T21:55:34.617836893Z ╰──────────────────────────────────────────────────────────────────────────────╯ 2023-06-21T21:55:34.617839562Z CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python', 2023-06-21T21:55:34.617842140Z 'train_network.py', '--enable_bucket', 2023-06-21T21:55:34.617844678Z '--pretrained_model_name_or_path=/workspace/stable-diffusion-webui/models/Stable 2023-06-21T21:55:34.617848077Z -diffusion/Realistic_Vision_V2.0-fp16-no-ema.safetensors', 2023-06-21T21:55:34.617851015Z '--train_data_dir=/workspace/partyparrot-v4/images_g4/', '--resolution=512,512', 2023-06-21T21:55:34.617853492Z '--output_dir=/workspace/partyparrot-v4/model', 2023-06-21T21:55:34.617856113Z '--logging_dir=/workspace/partyparrot-v4/log', '--network_alpha=16', 2023-06-21T21:55:34.617858670Z '--training_comment=trigger words: partyparrot', '--save_model_as=safetensors', 2023-06-21T21:55:34.617861206Z '--network_module=lycoris.kohya', '--network_args', 'conv_dim=8', 2023-06-21T21:55:34.617863748Z 'conv_alpha=4', 'algo=loha', '--text_encoder_lr=0.0001', '--unet_lr=0.0001', 2023-06-21T21:55:34.617866383Z '--network_dim=32', '--gradient_accumulation_steps=10', 2023-06-21T21:55:34.617868738Z '--output_name=partyparrot-v4', '--lr_scheduler_num_cycles=4', 2023-06-21T21:55:34.617871281Z '--learning_rate=0.0001', '--lr_scheduler=cosine', '--train_batch_size=4', 2023-06-21T21:55:34.617873732Z '--max_train_steps=120', '--save_every_n_epochs=1', '--mixed_precision=bf16', 2023-06-21T21:55:34.617876045Z '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', 2023-06-21T21:55:34.617878454Z '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=AdamW', 2023-06-21T21:55:34.617880989Z '--max_data_loader_n_workers=0', '--bucket_reso_steps=1', '--min_snr_gamma=10', 2023-06-21T21:55:34.617883435Z '--xformers', '--bucket_no_upscale', '--multires_noise_iterations=8', 2023-06-21T21:55:34.617886037Z '--multires_noise_discount=0.2']' returned non-zero exit status 1.

kohya_ss.log:
`23:55:02-826545 INFO Start training LoRA LyCORIS/LoHa ...
23:55:02-828821 INFO Folder 30_partyparrot-v4: 40 images found
23:55:02-830436 INFO Folder 30_partyparrot-v4: 1200 steps
23:55:02-832026 INFO Error: '.ipynb_checkpoints' does not contain an
underscore, skipping...
23:55:02-833831 INFO Total steps: 1200
23:55:02-835335 INFO Train batch size: 4
23:55:02-836837 INFO Gradient accumulation steps: 10.0
23:55:02-838378 INFO Epoch: 4
23:55:02-839839 INFO Regulatization factor: 1
23:55:02-841369 INFO max_train_steps (1200 / 4 / 10.0 * 4 * 1) = 120
23:55:02-843308 INFO stop_text_encoder_training = 0
23:55:02-844894 INFO lr_warmup_steps = 0
23:55:02-846482 INFO accelerate launch --num_cpu_threads_per_process=2
"train_network.py" --enable_bucket
--pretrained_model_name_or_path="/workspace/stable-diff
usion-webui/models/Stable-diffusion/Realistic_Vision_V2
.0-fp16-no-ema.safetensors"
--train_data_dir="/workspace/partyparrot-v4/images_g4/"
--resolution=512,512
--output_dir="/workspace/partyparrot-v4/model"
--logging_dir="/workspace/partyparrot-v4/log"
--network_alpha="16" --training_comment="trigger words:
partyparrot" --save_model_as=safetensors
--network_module=lycoris.kohya --network_args
"conv_dim=8" "conv_alpha=4" "algo=loha"
--text_encoder_lr=0.0001 --unet_lr=0.0001
--network_dim=32 --gradient_accumulation_steps=10
--output_name="partyparrot-v4" --lr_scheduler_num_cycles="4"
--learning_rate="0.0001" --lr_scheduler="cosine"
--train_batch_size="4" --max_train_steps="120"
--save_every_n_epochs="1" --mixed_precision="bf16"
--save_precision="fp16" --seed="1234"
--caption_extension=".txt" --cache_latents
--cache_latents_to_disk --optimizer_type="AdamW"
--max_data_loader_n_workers="0" --bucket_reso_steps=1
--min_snr_gamma=10 --xformers --bucket_no_upscale
--multires_noise_iterations="8"
--multires_noise_discount="0.2"
[23:55:08] WARNING The following values were not passed to launch.py:890
'accelerate launch' and had defaults used
instead:
'--num_processes' was set to a value
of '1'
'--num_machines' was set to a value of
'1'
'--mixed_precision' was set to a value
of ''no''
''--dynamo_backend'' was set to a value
of ''no''
To avoid this warning pass in values for each
of the problematic parameters or run
'accelerate config'.
prepare tokenizer
Using DreamBooth method.
ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: .ipynb_checkpoints
prepare images.
found directory /workspace/partyparrot-v4/images_g4/30_partyparrot-v4 contains 40 image files
No caption file found for 40 images. Training will continue without captions for these images. If class token exists, it will be used. / 40枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。
/workspace/partyparrot-v4/images_g4/30_partyparrot-v4/group-1-image-1.jpg
/workspace/partyparrot-v4/images_g4/30_partyparrot-v4/group-1-image-2.jpg
/workspace/partyparrot-v4/images_g4/30_partyparrot-v4/group-1-image-3.jpg
/workspace/partyparrot-v4/images_g4/30_partyparrot-v4/group-1-image-4.jpg
/workspace/partyparrot-v4/images_g4/30_partyparrot-v4/group-10-image-1.jpg
/workspace/partyparrot-v4/images_g4/30_partyparrot-v4/group-10-image-2.jpg... and 35 more
1200 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 4
resolution: (512, 512)
enable_bucket: True
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 1
bucket_no_upscale: True

[Subset 0 of Dataset 0]
image_dir: "/workspace/partyparrot-v4/images_g4/30_partyparrot-v4"
image_count: 40
num_repeats: 30
shuffle_caption: False
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: partyparrot-v4
caption_extension: .txt

[Dataset 0]
loading image sizes.
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (437, 599), count: 120
bucket 1: resolution (443, 591), count: 360
bucket 2: resolution (462, 568), count: 120
bucket 3: resolution (512, 512), count: 120
bucket 4: resolution (533, 492), count: 120
bucket 5: resolution (591, 443), count: 360
mean ar error (without repeats): 0.00043492747370370565
preparing accelerator
Using accelerator 0.15.0 or above.
loading model for process 0/1
load StableDiffusion checkpoint: /workspace/stable-diffusion-webui/models/Stable-diffusion/Realistic_Vision_V2.0-fp16-no-ema.safetensors
loading u-net:
loading vae:
loading text encoder:
CrossAttention.forward has been replaced to enable xformers.
import network module: lycoris.kohya
[Dataset 0]
caching latents.`

@tandpastatester tandpastatester changed the title Error when training LyCORIS/LoHA: --multires_noise_discount=0.8']' returned non-zero exit status 1 Error when training LyCORIS/LoHA: --multires_noise_discount=0.2']' returned non-zero exit status 1 Jun 21, 2023
@bmaltais
Copy link
Owner

Are you using python 3.10.9 ?

@tandpastatester
Copy link
Author

tandpastatester commented Jun 22, 2023

Are you using python 3.10.9 ?

According to JupyterLab:

root@5d369ee5783b:/workspace/kohya_ss# python --version
Python 3.10.6

According to the doc: Make sure to use a version of python >= 3.10.6 and < 3.11.0, or does it need 3.10.9 exactly? Not sure if I can upgrade that in a RunPod?

https://github.com/bmaltais/kohya_ss/blob/master/README.md#linux-and-macos

@bmaltais
Copy link
Owner

Yeah... probably something specific to the non windows environment. I only test on windows as it is my main system.

Kohya, the author, only support windows and the error appear to be in his train_network.py code. You could try to open an issue directly on his sd-script repo and see if he is willing to look into that one...

@tandpastatester
Copy link
Author

Yeah... probably something specific to the non windows environment. I only test on windows as it is my main system.

Kohya, the author, only support windows and the error appear to be in his train_network.py code. You could try to open an issue directly on his sd-script repo and see if he is willing to look into that one...

Thanks for your feedback. Instead of using the RunPod Template, I decided to try launching a clean pod and installing kohya_ss manually. Ended up having the same issue, but this way I had a terminal shell up with the kohya_ss CLI. This has more info/feedback than the log, see below.

It seems like it's just an OOM issue? I'm running in a Pod with a 24GB VRAM (GTX 3090). I'm using the exact same setup as you did in your video. What GPU are you using? Or could it be the size of the training images?

15:35:01-648887 INFO     Loading config...                                                                                                                                                                      
15:35:22-490939 INFO     Start training Dreambooth...                                                                                                                                                           
15:35:22-492653 INFO     Valid image folder names found in: /workspace/partyparrot-v4/img/                                                                                                                           
15:35:22-493936 INFO     Folder 30_partyparrot-v4 bird : steps 1200                                                                                                                                                 
15:35:22-494983 INFO     max_train_steps = 120                                                                                                                                                                  
15:35:22-495953 INFO     stop_text_encoder_training = 0                                                                                                                                                         
15:35:22-496937 INFO     lr_warmup_steps = 0                                                                                                                                                                    
15:35:22-498026 INFO     accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --enable_bucket                                                                                                        
                         --pretrained_model_name_or_path="/workspace/stable-diffusion-webui/models/Stable-diffusion/Realistic_Vision_V2.0-fp16-no-ema.safetensors" --train_data_dir="/workspace/partyparrot-v4/img/" 
                         --resolution="512,512" --output_dir="/workspace/partyparrot-v4/model" --logging_dir="/workspace/partyparrot-v4/log" --save_model_as=safetensors --output_name="partyparrot-v4"                        
                         --max_data_loader_n_workers="0" --gradient_accumulation_steps=10 --learning_rate="0.0001" --lr_scheduler="cosine" --train_batch_size="4" --max_train_steps="120"                       
                         --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="fp16" --seed="1234" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="AdamW"   
                         --max_data_loader_n_workers="0" --bucket_reso_steps=1 --min_snr_gamma=10 --xformers --bucket_no_upscale --multires_noise_iterations="8" --multires_noise_discount="0.2"                
2023-06-22 15:35:23.490915: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-22 15:35:23.606543: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-22 15:35:24.026100: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu
2023-06-22 15:35:24.026174: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu
2023-06-22 15:35:24.026183: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-06-22 15:35:25.807544: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-22 15:35:25.922280: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-22 15:35:26.284790: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu
2023-06-22 15:35:26.284844: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu
2023-06-22 15:35:26.284853: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
prepare tokenizer
prepare images.
found directory /workspace/partyparrot-v4/img/30_partyparrot-v4 bird contains 40 image files
No caption file found for 40 images. Training will continue without captions for these images. If class token exists, it will be used. / 40枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。
/workspace/partyparrot-v4/img/30_partyparrot-v4 bird/group-1-image-1.jpg
/workspace/partyparrot-v4/img/30_partyparrot-v4 bird/group-1-image-2.jpg
/workspace/partyparrot-v4/img/30_partyparrot-v4 bird/group-1-image-3.jpg
/workspace/partyparrot-v4/img/30_partyparrot-v4 bird/group-1-image-4.jpg
/workspace/partyparrot-v4/img/30_partyparrot-v4 bird/group-10-image-1.jpg
/workspace/partyparrot-v4/img/30_partyparrot-v4 bird/group-10-image-2.jpg... and 35 more
1200 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 4
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 1024
  bucket_reso_steps: 1
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "/workspace/partyparrot-v4/img/30_partyparrot-v4 bird"
    image_count: 40
    num_repeats: 30
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: partyparrot-v4 bird
    caption_extension: .txt


[Dataset 0]
loading image sizes.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 7702.68it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (437, 599), count: 120
bucket 1: resolution (443, 591), count: 360
bucket 2: resolution (462, 568), count: 120
bucket 3: resolution (512, 512), count: 120
bucket 4: resolution (533, 492), count: 120
bucket 5: resolution (591, 443), count: 360
mean ar error (without repeats): 0.00043492747370370565
prepare accelerator
gradient_accumulation_steps is 10. accelerate does not support gradient_accumulation_steps when training multiple models (U-Net and Text Encoder), so something might be wrong
gradient_accumulation_stepsが10に設定されています。accelerateは複数モデル(U-NetおよびText Encoder)の学習時にgradient_accumulation_stepsをサポートしていないため結果は未知数です
Using accelerator 0.15.0 or above.
loading model for process 0/1
load StableDiffusion checkpoint: /workspace/stable-diffusion-webui/models/Stable-diffusion/Realistic_Vision_V2.0-fp16-no-ema.safetensors
/workspace/kohya_ss/venv/lib/python3.10/site-packages/safetensors/torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.71G/1.71G [00:14<00:00, 114MB/s]
loading text encoder: <All keys matched successfully>
CrossAttention.forward has been replaced to enable xformers.
[Dataset 0]
caching latents.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.98it/s]
prepare optimizer, data loader etc.
use AdamW optimizer | {}
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 1200
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 300
  num epochs / epoch数: 4
  batch size per device / バッチサイズ: 4
  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 40
  gradient ccumulation steps / 勾配を合計するステップ数 = 10
  total optimization steps / 学習ステップ数: 120
steps:   0%|                                                                                                                                                                            | 0/120 [00:00<?, ?it/s]
epoch 1/4
steps:   0%|                                                                                                                                                               | 0/120 [00:03<?, ?it/s, loss=0.0596]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/kohya_ss/train_db.py:486 in <module>                                                  │
│                                                                                                  │
│   483 │   args = parser.parse_args()                                                             │
│   484 │   args = train_util.read_config_from_file(args, parser)                                  │
│   485 │                                                                                          │
│ ❱ 486 │   train(args)                                                                            │
│   487                                                                                            │
│                                                                                                  │
│ /workspace/kohya_ss/train_db.py:350 in train                                                     │
│                                                                                                  │
│   347 │   │   │   │   │   │   params_to_clip = unet.parameters()                                 │
│   348 │   │   │   │   │   accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)        │
│   349 │   │   │   │                                                                              │
│ ❱ 350 │   │   │   │   optimizer.step()                                                           │
│   351 │   │   │   │   lr_scheduler.step()                                                        │
│   352 │   │   │   │   optimizer.zero_grad(set_to_none=True)                                      │
│   353                                                                                            │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/optimizer.py:134 in step        │
│                                                                                                  │
│   131 │   │   │   │   xm.optimizer_step(self.optimizer, optimizer_args=optimizer_args)           │
│   132 │   │   │   elif self.scaler is not None:                                                  │
│   133 │   │   │   │   scale_before = self.scaler.get_scale()                                     │
│ ❱ 134 │   │   │   │   self.scaler.step(self.optimizer, closure)                                  │
│   135 │   │   │   │   self.scaler.update()                                                       │
│   136 │   │   │   │   scale_after = self.scaler.get_scale()                                      │
│   137 │   │   │   │   # If we reduced the loss scale, it means the optimizer step was skipped    │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:374 in step  │
│                                                                                                  │
│   371 │   │                                                                                      │
│   372 │   │   assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were rec   │
│   373 │   │                                                                                      │
│ ❱ 374 │   │   retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)         │
│   375 │   │                                                                                      │
│   376 │   │   optimizer_state["stage"] = OptState.STEPPED                                        │
│   377                                                                                            │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:290 in       │
│ _maybe_opt_step                                                                                  │
│                                                                                                  │
│   287 │   def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwargs):                │
│   288 │   │   retval = None                                                                      │
│   289 │   │   if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):    │
│ ❱ 290 │   │   │   retval = optimizer.step(*args, **kwargs)                                       │
│   291 │   │   return retval                                                                      │
│   292 │                                                                                          │
│   293 │   def step(self, optimizer, *args, **kwargs):                                            │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:69 in wrapper  │
│                                                                                                  │
│     66 │   │   │   │   instance = instance_ref()                                                 │
│     67 │   │   │   │   instance._step_count += 1                                                 │
│     68 │   │   │   │   wrapped = func.__get__(instance, cls)                                     │
│ ❱   69 │   │   │   │   return wrapped(*args, **kwargs)                                           │
│     70 │   │   │                                                                                 │
│     71 │   │   │   # Note that the returned function here is no longer a bound method,           │
│     72 │   │   │   # so attributes like `__func__` and `__self__` no longer exist.               │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/optimizer.py:280 in wrapper    │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/optimizer.py:33 in _use_grad   │
│                                                                                                  │
│    30 │   │   prev_grad = torch.is_grad_enabled()                                                │
│    31 │   │   try:                                                                               │
│    32 │   │   │   torch.set_grad_enabled(self.defaults['differentiable'])                        │
│ ❱  33 │   │   │   ret = func(self, *args, **kwargs)                                              │
│    34 │   │   finally:                                                                           │
│    35 │   │   │   torch.set_grad_enabled(prev_grad)                                              │
│    36 │   │   return ret                                                                         │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/adamw.py:171 in step           │
│                                                                                                  │
│   168 │   │   │   │   state_steps,                                                               │
│   169 │   │   │   )                                                                              │
│   170 │   │   │                                                                                  │
│ ❱ 171 │   │   │   adamw(                                                                         │
│   172 │   │   │   │   params_with_grad,                                                          │
│   173 │   │   │   │   grads,                                                                     │
│   174 │   │   │   │   exp_avgs,                                                                  │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/adamw.py:321 in adamw          │
│                                                                                                  │
│   318 │   else:                                                                                  │
│   319 │   │   func = _single_tensor_adamw                                                        │
│   320 │                                                                                          │
│ ❱ 321 │   func(                                                                                  │
│   322 │   │   params,                                                                            │
│   323 │   │   grads,                                                                             │
│   324 │   │   exp_avgs,                                                                          │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/optim/adamw.py:566 in                │
│ _multi_tensor_adamw                                                                              │
│                                                                                                  │
│   563 │   │   │   else:                                                                          │
│   564 │   │   │   │   exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)                  │
│   565 │   │   │   │   torch._foreach_div_(exp_avg_sq_sqrt, bias_correction2_sqrt)                │
│ ❱ 566 │   │   │   │   denom = torch._foreach_add(exp_avg_sq_sqrt, eps)                           │
│   567 │   │   │                                                                                  │
│   568 │   │   │   torch._foreach_addcdiv_(device_params, device_exp_avgs, denom, step_size)      │
│   569                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 23.69 GiB total capacity; 19.32 GiB already allocated; 4.06 MiB free; 19.62 GiB reserved in total by PyTorch) If reserved memory is >>
allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps:   0%|                                                                                                                                                               | 0/120 [00:04<?, ?it/s, loss=0.0596]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/kohya_ss/venv/bin/accelerate:8 in <module>                                            │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py:45   │
│ in main                                                                                          │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py:1104 in      │
│ launch_command                                                                                   │
│                                                                                                  │
│   1101 │   elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA  │
│   1102 │   │   sagemaker_launcher(defaults, args)                                                │
│   1103 │   else:                                                                                 │
│ ❱ 1104 │   │   simple_launcher(args)                                                             │
│   1105                                                                                           │
│   1106                                                                                           │
│   1107 def main():                                                                               │
│                                                                                                  │
│ /workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py:567 in       │
│ simple_launcher                                                                                  │
│                                                                                                  │
│    564 │   process = subprocess.Popen(cmd, env=current_env)                                      │
│    565 │   process.wait()                                                                        │
│    566 │   if process.returncode != 0:                                                           │
│ ❱  567 │   │   raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)       │
│    568                                                                                           │
│    569                                                                                           │
│    570 def multi_gpu_launcher(args):                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python', 'train_db.py', '--enable_bucket', 
'--pretrained_model_name_or_path=/workspace/stable-diffusion-webui/models/Stable-diffusion/Realistic_Vision_V2.0-fp16-no-ema.safetensors', '--train_data_dir=/workspace/partyparrot-v4/img/', '--resolution=512,512',
'--output_dir=/workspace/partyparrot-v4/model', '--logging_dir=/workspace/partyparrot-v4/log', '--save_model_as=safetensors', '--output_name=partyparrot-v4', '--max_data_loader_n_workers=0', 
'--gradient_accumulation_steps=10', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--train_batch_size=4', '--max_train_steps=120', '--save_every_n_epochs=1', '--mixed_precision=bf16', 
'--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=AdamW', '--max_data_loader_n_workers=0', '--bucket_reso_steps=1', 
'--min_snr_gamma=10', '--xformers', '--bucket_no_upscale', '--multires_noise_iterations=8', '--multires_noise_discount=0.2']' returned non-zero exit status 1.

bmaltais pushed a commit that referenced this issue Jan 8, 2024
Added cli argument for wandb session name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants