-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage on Linux Systems #316
Comments
I think i can rule out the kernel and AMD driver since it happened with a lot of different versions, and even in a WSL environment with the paravirtualized Windows driver. I'm not a developer, so as soon as things get into the Python/PyTorch realm I'm out of my comfort zone and the best I can do is try random stuff and see what happens (most of the time nothing, because I broke something I didn't understand in the first place). |
The current implementation as of the However, if the general trend of memory usage grows arbitrarily over time and never levels off or if the floor memory usage continually grows, that's very like a bug. If I had to guess, the root of the problem for ROCm might be here:
The cryptic comment here by that developer suggests I'm probably abusing this function though the reason is left unclear, as it only notes that it "makes things worse for ROCm". You could modify the file directly in your site-packages dir of your venv to set this to I would just like to take a moment and emphasize that the worker's use case of ComfyUI (via horde-engine, formerly hordelib) is not officially supported by the comfy team in any way. Specifically, ComfyUI makes assumptions about the state of the machine based on sunny-day memory/vram conditions and does not anticipate any other high-memory usage applications. However, in the worker use case we spawn N ComfyUI (i.e., high memory usage applications) instances which shatters many of these built in expectations. Its therefore very much a guessing game for me as well to support the huge number of permutations of system configurations of our users have while still trying to understand the massive codebase that ComfyUI is, which is also constantly changing, and which often makes changes that are fundamentally contrary to the worker use case. This is of course not their fault but it is quite difficult hitting moving targets. Historically, we were only ever able to support CUDA cards and so for that and other reasons that I am sure are becoming obvious to you, I will readily admit support for ROCm is lacking. The truth of the matter is that AMD support has relied entirely on volunteers running under certain conditions of their choosing, sending me logs, and me attempting to optimize based on that alone. I have had very little hands-on time with an AMD/RoCM card to nail down these issues and only a few willing volunteers. If you are willing or able, we could have a more interactive conversation in the official AI-Horde discord server, found here: https://discord.gg/r7yE6Nu8. This conversation would be appropriate for the #local-workers channel, where you could feel free to ping me (@tazlin on discord). In any event, I can see you've clearly put sincere thought and energy into your recent flurry of activity and I appreciate the work you've put in so far. Feel free to continue here if needed or reach out to me in discord as mentioned above. |
And just as an aside, I would encourage you to ensure you have a reasonable amount of swap configured on your system, as it has been shown to defray some of the memory related issues at time. I do suspect that it wouldn't be a perfect silver bullet but if you had little or none configured, I would at least try adding some. |
I have a 1:1 ratio of memory to swap, and I've seen 15+GB being used (with some actual disk activity, so it's not just sitting there)
I tested around a bit, nothing new so far. But knowing where the load/unload is happening already helps a bit in narrowing down what I'm searching for. Something interesting I found was, that the process apparently thinks it has VAST amounts of virtual system memory available (or the units on that field are completely different to the VRAM one):
I'll hop on the Discord later, I just wanted to open this issue here first (as a reference), working with those long form texts over there is a bit tedious (handle on Discord is @Momi_V) |
I've done some more testing: |
Here are my findings from attempting to debug the suspected memory leak on CUDA/docker using I attached a shell to the running container using const soTotals = mappings
.filter((m) => m.pathname.includes(".so"))
const heapTotals = mappings
.filter((m) => m.pathname === "[heap]")
const otherTotals = mappings
.filter((m) => !m.pathname.includes(".so") && m.pathname !== "[heap]") The profile immediately after starting: {
"size": 4468936,
"rss": 565080,
"pathname": "TOTAL_SHARED_OBJECTS"
},
{
"size": 8150504,
"rss": 8081528,
"pathname": "TOTAL_HEAP"
},
{
"size": 51619560,
"rss": 17401036,
"pathname": "TOTAL_OTHER"
} The profile after an hour: {
"size": 4468936,
"rss": 69784,
"pathname": "TOTAL_SHARED_OBJECTS"
},
{
"size": 8157168,
"rss": 5703320,
"pathname": "TOTAL_HEAP"
},
{
"size": 78309296,
"rss": 30606032,
"pathname": "TOTAL_OTHER"
} From that you can see that the "growing" category of mmap objects is neither |
I'm experiencing the same running the worker in Google Colab. A few months ago, I was able to offer several models and the worker would swap between them without issue. Now it crashes almost every time it tries swapping models |
Somewhere within the ROCm stack, a library, ComfyUi or the reGen worker is a severe memory leak.
It seems to be triggered by loading and unloading models, not the actual compute.
When multiple models are offered after (almost) every swap on which of them are active (in VRAM) or preloaded (in RAM) a few more GB of system RAM are used.
I first noticed this after
no-vram
was merged v8.1.2...v9.0.2There the behavior changed from a relatively static ~17GB per queue/thread (also quite a lot, increasing over time before leveling off) to gradually hogging more and more RAM over time (as much as 40GB!! for just one thread).
If a worker thread got killed and restarted it's usage was reset, but the worker wasn't always able to recover.
VRAM usage on the other hand had gotten a lot better, going from 15-20+ GB (even on SDXL and SD1.5) to 5-15 GB (depending on model) with the only 20 GB occasions being FLUX1 (this would be somewhat expected).
Depending on loaded models, job mix, etc. the worker (even with 64GB of RAM) becomes unusable after 15-45 min.
Only loading 1 model seems to fix (or at least help) with this.
Impact of LoRa and controlnet is still unclear, but just disabling them doesn't magically fix things.
A clarification on expected behavior (when and how is memory supposed to be used) would be helpful.
Is the worker supposed to keep the active model in system RAM, even though it has already been transferred to VRAM?
Is the memory usage of a thread supposed to go down when switching from a heavier model (SDXL/FLUX) to a lighter one (SD1.5)?
I'll keep doing more testing, and will also do some comparisons to the ComfiUi behavior once I have time, but that might take a while (I'm not familiar with comfi yet, let alone with how to debug it).
System:
The text was updated successfully, but these errors were encountered: