-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364
Comments
check out my other compvis PR CompVis/stable-diffusion#177 |
Thanks for the tip! I'll check them both out. |
Would be cool to get this implemented! ❤️ |
Don't I know it! |
By the looks of it without all the white space changes we get...
attached in in diff -u format for patching |
Oh thank you very much for that! I actually just did the same thing with @neonsecret 's attention optimization and it works amazingly. Without any change to execution speed my test prompt now uses 3.60G of VRAM. Previously it was using 4.42G. Now I'm looking to see if image quality is affected. Any reason to prefer basunjal's attention.py optimization? |
nothing to add to this conversation, except to say that i'm excited for this lol ;) |
I've merged @neonsecret 's optimizations into the development branch "refactoring-simplet2i" and would welcome people testing it and sending feedback. This branch probably still has major bugs in it, but I refactored the code to make it much easier to add optimizations and new features (particularly inpainting, which I'd hoped to have done by today). |
Hmmm....on my barely coping 8G M1 its not so hot, the image is different and it took twice as long, but its an old clone, let me try in on a fresher one |
Darn. I'd hoped that there was such a thing as a free lunch. I'm on an atypical system with 32G of VRAM, so maybe my results aren't representative. I did timing and peak VRAM usage, and then looked at two images generated with the same seed and they were indistinguishable to the eye. Let me know what you find out. Are you on an Apple? I didn't know there were clones. The M1 MPS support in this fork is really new, and I wouldn't be surprised it needs additional tweaking to get it to work properly with the optimization. |
Sorry meant its an old local clone of your repo as I didn't want to make changes in my local clone of the current one as hot works quite nicely :-) , but yes I'm not surprised mps is breaking things and pytorch is pretty buggy too, I raised a few mps related issues over there that stable diffusion hits. |
okay, on the main branch the images are the same, but it is really slow, even compared to my normal times... 10/10 [06:49<00:00, 40.94s/it] I'll do some more digging @magnusviri any chance you can check this out on a bigger M1 ? |
Ok, this is @neonsecret 's PR, which I just tested and merged into the refactor branch. I'm seeing a 20% reduction in memory footprint, but unfortunately not the 35% reduction reported in the Reddit post. Presumably this is due to the earlier optimizations in basunjal's branch. I haven't really wanted to use those opts because the code is complex and I hear it has a performance hit. Advice? |
Last I looks at basunjal's there where loads of assumption about using cuda, a big chunk of the memory saving seemed to come from forcing half. Was a week ago things might have changed |
I've got half precision on already as the default. I think what I'm missing is basunjal's optimization of splitting the task into several chunks and loading them into the GPU sequentially. |
Frankly, I'm happy with the 20% savings for now. |
Seems the speed loss is coming from the twin calls to softmax.
If I change it to use instead I get all my speed back (I Assume more memory usage though I need better diagnostic stuff than Activity Monitor). I can do 640x512 images now so there do appear to be some memory saving even reverting that change, be interesting to see what happens on a larger box and if is it worth wrapping them into one variation in a "if mps" statement EDIT: seems I can also now do 384 x320 without it using swap. |
No measurable slowdown at all on CUDA. Maybe we make the twin softmax conditional on not MPS? |
test run: 751283a (main) EDIT: corrected hash 89a7622 (refactoring-simplet2i) XOR of both images gave a pitch black result -> seems to be no difference maybe this helps. Inference even seems to be a little faster in this test run. |
CUDA platform? What hardware?
|
Windows 10, NVIDIA GeForce RTX 2060 SUPER 8GB, CUDA if there's info missing I'll edit it in |
I made a new local PR with just changes to the Here are some of my test results after extensive testing - RTX 3080 8GB Card. Base Repo:
Updated attention.py
For a 512x768 image, the updated repo consumes 5.94GB of memory. That's approximately an 18% memory boost. I saw no difference in performance or inference time when using a single or twin softmax. On CUDA, the difference seems to be negligible if there is any. tl;dr -- Just the |
I think its a side affect of the unified architecture, it looks okay at 256x256 when all of the python 'image' fits in memory, but as soon as swapping kicks in I get half speed compared to the original code or single softmax |
I just tried regenerating an image done from the current development branch and the new refactor branch and they are totally different... Not sure if it is the memory saving feature doing it or something else. Is there a switch to activate/deactivate the memory saving? Would be nice to isolate if the diff is related to that or something else. Running on a GTX 3060. EDIT: OK... for some reason the picture I got when running the command the 1st time is different from the result when running the prompt logged in the prompt log file... Strange... but using the log file prompt on both the dev and the refactoring does indeed produce the same result with much less VRAM usage... I will try to reproduce the variation... This might be an issue with the variation code base not producing consistent result on 1st run vs reruns from logs. EDIT 2: I tracked the issue with the different outputs... it was a PEBKAC... I pasted the file name and directory info in front of the prompt in the log... this is why it resulted in a different output... so all good, it was my error. So as far as I can see the memory optimisation has no side effect on time not quality to generate an image. 6.4G on the dev branch vs 4.84G on the refactoring branch... so a 34% memory usage reduction and exactly the same run time. |
I did not do extensive testing to compare generation times but so far I have gotten the same exact results visually when comparing to images I've generated yesterday on main with same prompt/seeds. And I can cramp up resolution from max 576x576 up to 640x704, using 6.27G on my RTX 2070. Last time I have tried basunjal's, I could manage 704x768 but it was very slow. However if they implemented this PR + their original optimization, if it uses up even less memory than it used to, I can image doing even higher resolution. Very impressive. |
The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day. |
On my 3060 with 12GB VRAM I am seeing a 34% memory improvement... so this is pretty great. hat is when generating 512x704 images. |
My mistake... believed what was written before checking... upd. |
I think it's possible to incorporate the best of both methods. I've been working on adapting Doggettx's dynamic threshold. Will update the branch as soon as I can |
I'm porting @Doggettx optimizations to M1 and it looks promising :) Development branch
Doggettx-optimizations branch (M1 workarounds)
X means failed test. |
Great! I want to make a release soon and the choice of optimization(s) is a key milestone. |
After reviewing the back and forth on this thread, I do have one clarifying question: I've very much appreciated the set of optimizations and additional functionality provided by |
So overall, the 'doggettx-optimizations' branch seems to be superior in every regard. |
I don't have a 4 GB GPU to test that, but my guess would be that the 'doggettx-optimizations' branch works fine on 4 GB GPUs. |
I thought it might too, but with a Quadro P1000 (4GB), I get a CUDA OOM using |
All the memory optimizations have been directed at reducing memory requirements during image generation. Loading the model itself, which happens during initialization, requires 4.2 GB on its own. So 4GB cards are currently not supported by this fork. Sorry! |
Right, of course. Thanks! |
I should add that the basujindal fork does reduce the memory used during model loading and allows you to run SD on 4 GB cards. |
Have tested out these branches and just want to say the work being done he is amazing, both for people with less VRAM and those like myself that can squeeze out larger res. Though I notice larger res is not always better since the model is still effectively 512x512, there's definitely a point where you hit tradeoffs. Just wondered if these changes would / could pass through into the upstitching methods like txt2imghd allowing for larger sized chunks to be processed during upscaling and detailing? |
Take a look at what I explained here, that way you can generate at high resolutions without tradeoffs: #364 (comment) |
Just piping in here to second what @tildebyte and @blessedcoolant said. 99% of this repository is other people's code, and I and my collaborators are just trying to pull together the best innovations that are out there to create a stable base and a good user experience. All contributions are gratefully accepted, but are subject to review and testing. |
No, I don't think they will. All the optimizations we have been working on affect VRAM usage during image inference and generation. The loading of the model that takes place during initialization takes more than 4 GB (I think it's 4.2, so tantalizingly close!) and will cause an OOM error before you get to the inference prompt on 4 GB cards. The basunjal optimizations reduce memory requirements at load time, but unfortunately slow down performance noticeably and so haven't been incorporated here. |
Personally I don't like the memory->speed tradeoff in optimizedSD. I'm hoping that stable-diffusion-v1.5 will have reduced memory requirements and will run on 4 GB cards out of the box. |
That makes a lot of sense, and I think all the testing above bears out that in any situation where the model loads it’s probably the right move. The only potential thing that might be worth considering (even with a potential sub-4GB later release) is that many of the potential users with 4GB cards may be precisely the people who can’t fully free the full 4GB if it’s their laptop or machine’s sole GPU. A colleague with a small-ish XPS laptop can get the optimizedSD fork to run, but only with full precision and all the optimizations active and only after closing out all other open apps. In an odd way, that kind of user would be especially appreciative of the other UX improvements here — batch queuing a set of commands to run, logging all the output in a clean out file for easier reference given that even low DDIM runs still can take some time, etc. I guess after this next round of optimization settles here, it might still be worth considering the possibility of including a flag that splits the model for the dream workflow — with all the serious performance caveats documented in a big way — as a singular final fallback for the machines that have no other choice. Separately — thank you so much to all for all of your work. Watching this develop so rapidly and collaboratively the last few weeks has been absolutely fascinating! |
Yes, I think we'll get the inference optimization squared away and then can look into the model loading optimizations as a user flag. Thanks for the kind words! This is a fun project to work on. |
@Any-Winter-4079, how's your progress on the M1 port? |
Check #431 (comment) |
Thank you everyone for the wonderful work you did benchmarking and debugging the various optimizations. I have chosen the @Doggettx optimizations, with fixes contributed by @Any-Winter-4079 to run correctly on Macintosh M1 hardware. These optimizations have gone into the development branch and will be in the soon-forthcoming (I hope) 1.14 release. |
I made some more minor improvements when running in auto_cast or half mode, which seems to have made it run a lot faster Could use some testing though, since I don't know what it does on lower VRAM cards. If someone is willing to try, I've put it in a separate branch at https://github.com/Doggettx/stable-diffusion/tree/autocast-improvements |
Just a heads up. Mac users who have been testing on the release candidate (which contains the previous set of @Doggettx optimizations) are reporting a 2-3x decrease in speed on M1 hardware. This needs to be fixed before we announce a release, unfortunately. |
Seen on HN, might be interesting to pull in this repo? (PR looks a bit dirty with a lot of extra changes though).
basujindal/stable-diffusion#103
The text was updated successfully, but these errors were encountered: