-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase speed #22
Comments
Going to check when I have time, thanks. |
Ok, removing bfloat from flux model support really gave 2x speedup , it now defaults to fp16. |
Oh cool, I wondered about that in my [Reddit Post]! (https://www.reddit.com/r/FluxAI/comments/1eztuch/flux_on_amd_gpus_rdna3_wzluda/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) I ran the update and did some testing, confirmed the float type is no longer bfloat16: The speed is the same though - about 2 seconds per it. (7900 XTX, 32GB RAM, Windows 10, Radeon 24.8.1 driver). What sampler/scheduler are you seeing the speed increase with? |
euler ,simple |
Euler - Simple, try using --force-fp32 or --force-fp16 and if no improvement then --use-split-cross-attention
I got the TunableOp working by putting in the start.bat:
The .csv file is created but the cmd can't write to it. It either does not support Windows/Zluda and needs to be tricked that it uses ROCM or like someone said in the comments main.py needs to be run directly which for me gives an error. |
I can run comfyui without putting zluda in front of python ... in batch file and this way it works. But lets try this, update comfy. There is a new batch file which enables this tunableop , try with sd1.5 for example 3 consecutive runs etc. Then you have "exit" with CTRL-C on cmd window , and you can see it is writing to csv file although I haven't tried with zluda in front to be honest. In the end, It is working , BUT when I tried it with sd1.5 & sdxl and also flux , there wasnt much difference. Maybe needs more testing. Maybe I am / we are already doing other speed up tricks to do what this intends to do. EDIT :::: Nope it doesn't work with zluda in front ... You can try to add zluda folder to path in windows path. type "env" in when start menu is open, it is going to show you a shortcut to "edit the system enviroment variables" , click enviroment variables, look for "path" in the bottom section, add zluda folder to the system path (from the comfyui-zluda folder probably gonna work) for example : " D:\ComfyUI-Zluda\zluda " , restart system to be sure. remove the part ".\zluda\zluda.exe -- " just start with |
Added the zluda path. It runs with both:
and
But I get a similar error as running it from start.bat, the tunableop_results0.csv remains empty: error with .\zluda\zluda.exe -- %PYTHON% main.py %COMMANDLINE_ARGS%
error with %PYTHON% main.py %COMMANDLINE_ARGS%
|
are these files specific to gpu or model or what ? maybe we can exchange them at least with same models ? I have a rx6600 |
Seems to be specific to each run, similar to the zluda.db file
I have an RX 6700, but the Validator,PT_VERSION and Validator,ROCM_VERSION should be the same or similar from when comfyui-zluda was installed. Could you open the tunableop_results0 and copy paste the
I'll try to edit the .csv file manually and run the .bat again |
No problem sharing the whole file, only pt version is there it is working this way though. [ |
Early results, haven't tested much yet! After most recent update, generation time went from ~2 seconds/it to ~45 seconds/it when using the t5XXL_FP16 CLIP. Using the FP8 clip, seeing about 8 seconds/it. Odd. Still using same torch.float16 data type. Let me tinker a bit more and I'll see if I can get some more conclusive info. |
it's works my 16s/it changes to 8s/it After the last update everything started working much slower, I collected information, some changes in the main branch caused a slowdown. My settings: checkout aeab6d1 normal start tunableop start after checkout 51af244 |
Having huge increases in execution time too!! I backed up the last two logs to see if I could cobble together anything helpful. Thankfully had one from early in the morning that was working. (I'm in Central time, US (UTC-5) FYI for Timestamp reference) In my case, Pytorch cross attention was used by default in the newer release. Previously, it was using Sub quadratic optimization for cross attention. I added command |
Has anyone managed to launch SUPIR? |
Thanks, it did pick up the config from the .csv, managed in the end to generate the .csv too by running as admin the cmd, unfortunately I did not notice more than a 1s/it increase which could be random too. I'll keep it on for now since I noticed the VRAM consumed is decreased a bit with it.
I get an increase in generation time whenever the main branch is updated by @comfyanonymous, usually restarting the pc fixes it, might have something to do with zluda.db or pagefile
Have you tried using --force-fp32 or a GGUF unet? https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main |
Main comfy keeps changing by the hour, I add the changes test with just one generation than apply it, so if there is some problem it is mostly caused by main , if that is huge one it is usually fixed very quickly. Regarding zluda and small changes I try to keep track of them. Usually restarting after an update , especially which changes one of the main py files works for the best. I myself only have an rx 6600 with 8 gb vram with 16 gb system ram so when I say there is %X speed change or Y is giving OOM problems that could also be because of my system. With models such as FLUX we are already well into the dangerous open sea territory :) At least with GPU's similar to mine. There are two ways I can suggest might improve speeds & / or memory ; 1-) Using q4 gguf versions of both schnell and dev works great also for clip , using dual clip , 1st clip_l, second t5xxl_fp8_e4m3fn.safetensors. There is also GGUF versions of t5 clips but that didn't do much impact regarding memory or speed at least in my pc. 2-) Just found out about this model , https://civitai.com/models/645943?modelVersionId=722828 , somehow faster than that combo on the first part, also there are other model variants there which when used with dual clip combo in 1. part, seems to be a bit better than standard models or gguf models. For reference, before fp16 change, with my setup I was getting around 35-40 seconds / it with both schnell and dev. After that speed increased twofold into 20 sec / it . Now the model I have shown in second part somehow gets better and gives me around 16 sec / it. It is almost as fast as I was getting one year ago with SDXL with the same system. (there was only directml back then) |
Can you share your workflow? I get 8-9 it/s with gguf and 16it/s with this model on my 6800xt |
If that values were reversed aka sec / it seems about the right speed you should be getting compared to a 8gb 6600 imo. Using standard workflows. Only thing I am using different is I am using --novram as a cmdline toggle this is usually a bit better for me regarding OOM. edit : https://pastebin.com/tzqCDSHZ |
I tried and it worked. Thanks a lot. |
Feature Idea
Found this comment by @Exploder98 suggesting removing bfloat16 which increased my speed by 50%, modifying
supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]
to
supported_inference_dtypes = [torch.float16, torch.float32]
in https://github.com/comfyanonymous/ComfyUI/blob/7df42b9a2364bae6822fbd9e9fa10cea2e319ba3/comfy/supported_models.py#L645
Additionally, running optimization through PyTorch TunableOp could be tried which did not work for me but others confirmed it worked, maybe a script could be created for it.
Existing Solutions
No response
Other
city96/ComfyUI-GGUF#48 (comment)
The text was updated successfully, but these errors were encountered: