Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage on Linux Systems #316

Open
HPPinata opened this issue Oct 6, 2024 · 7 comments
Open

High memory usage on Linux Systems #316

HPPinata opened this issue Oct 6, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@HPPinata
Copy link
Contributor

HPPinata commented Oct 6, 2024

Somewhere within the ROCm stack, a library, ComfyUi or the reGen worker is a severe memory leak.

It seems to be triggered by loading and unloading models, not the actual compute.
When multiple models are offered after (almost) every swap on which of them are active (in VRAM) or preloaded (in RAM) a few more GB of system RAM are used.

I first noticed this after no-vram was merged v8.1.2...v9.0.2
There the behavior changed from a relatively static ~17GB per queue/thread (also quite a lot, increasing over time before leveling off) to gradually hogging more and more RAM over time (as much as 40GB!! for just one thread).
If a worker thread got killed and restarted it's usage was reset, but the worker wasn't always able to recover.

VRAM usage on the other hand had gotten a lot better, going from 15-20+ GB (even on SDXL and SD1.5) to 5-15 GB (depending on model) with the only 20 GB occasions being FLUX1 (this would be somewhat expected).

Depending on loaded models, job mix, etc. the worker (even with 64GB of RAM) becomes unusable after 15-45 min.
Only loading 1 model seems to fix (or at least help) with this.
Impact of LoRa and controlnet is still unclear, but just disabling them doesn't magically fix things.

A clarification on expected behavior (when and how is memory supposed to be used) would be helpful.
Is the worker supposed to keep the active model in system RAM, even though it has already been transferred to VRAM?
Is the memory usage of a thread supposed to go down when switching from a heavier model (SDXL/FLUX) to a lighter one (SD1.5)?

I'll keep doing more testing, and will also do some comparisons to the ComfiUi behavior once I have time, but that might take a while (I'm not familiar with comfi yet, let alone with how to debug it).

System:

  • Platform: B550/5600X
  • Memory: 64GB DDR4
  • GPU: Radeon RX7900XTX (24GB VRAM)
  • Drive: 960GB PCIe 3.0x4 SSD
  • OS: OpenSUSE Tumbleweed
    • Kernels: 6.10, 6.11, 6.6.x (LTS), 6.6.x (LTS with amdgpu-dkms)
    • Container: docker 26.1.x-ce, rocm-terminal 6.0.2 - 6.2.1, Ubuntu 22.04 - 24.04 with ROCm 6.1.3-6.2.1
    • Torch: 2.3.1+rocm6.0, 2.3.1+rocm6.1, 2.4.1+rocm6.0, 2.4.1+rocm6.1
    • Affected reGen versions: everything past 9.0.2
@HPPinata
Copy link
Contributor Author

HPPinata commented Oct 6, 2024

I think i can rule out the kernel and AMD driver since it happened with a lot of different versions, and even in a WSL environment with the paravirtualized Windows driver.
The behavior also seems consistent over different ROCm versions, though in that regard I still have a few combinations to examine.
I also took a look at open process file handles. There was some accumulation of deleted /tmp entries over time, but only a couple MB worth, nowhere near the tens of GB unaccounted for.

I'm not a developer, so as soon as things get into the Python/PyTorch realm I'm out of my comfort zone and the best I can do is try random stuff and see what happens (most of the time nothing, because I broke something I didn't understand in the first place).
Ideas on how to debug things more effectively are very welcome.

@tazlin
Copy link
Member

tazlin commented Oct 7, 2024

A clarification on expected behavior (when and how is memory supposed to be used) would be helpful.
Is the worker supposed to keep the active model in system RAM, even though it has already been transferred to VRAM?
Is the memory usage of a thread supposed to go down when switching from a heavier model (SDXL/FLUX) to a lighter one (SD1.5)?

The current implementation as of the raw-png branch can be expected to use up all available system ram for some period of time. I generally encourage worker operators to only run the worker and no other memory/GPU intensive applications beyond a browser. The matter of what counts as "excessive memory usage" is up for debate, I suppose, but it would constitute a feature request out-of-scope from your bug report to change that behavior at this point.

However, if the general trend of memory usage grows arbitrarily over time and never levels off or if the floor memory usage continually grows, that's very like a bug. If I had to guess, the root of the problem for ROCm might be here:
https://github.com/Haidra-Org/hordelib/blob/main/hordelib/comfy_horde.py#L433:L436

_comfy_soft_empty_cache is an aliased call to the following:
https://github.com/comfyanonymous/ComfyUI/blob/e5ecdfdd2dd980262086b0df17cfde0b1d505dbc/comfy/model_management.py#L1077

The cryptic comment here by that developer suggests I'm probably abusing this function though the reason is left unclear, as it only notes that it "makes things worse for ROCm". You could modify the file directly in your site-packages dir of your venv to set this to false manually in horde-engine and see if that changes anything for you. In the short-medium term, especially if you are able to validate that this changes the dynamics of the problem, I could include sending a value as appropriate if the worker is started with a ROCm card. However, if I had to guess, the problem is more complicated than a simple flag flip.

I would just like to take a moment and emphasize that the worker's use case of ComfyUI (via horde-engine, formerly hordelib) is not officially supported by the comfy team in any way. Specifically, ComfyUI makes assumptions about the state of the machine based on sunny-day memory/vram conditions and does not anticipate any other high-memory usage applications. However, in the worker use case we spawn N ComfyUI (i.e., high memory usage applications) instances which shatters many of these built in expectations. Its therefore very much a guessing game for me as well to support the huge number of permutations of system configurations of our users have while still trying to understand the massive codebase that ComfyUI is, which is also constantly changing, and which often makes changes that are fundamentally contrary to the worker use case. This is of course not their fault but it is quite difficult hitting moving targets.

Historically, we were only ever able to support CUDA cards and so for that and other reasons that I am sure are becoming obvious to you, I will readily admit support for ROCm is lacking.

The truth of the matter is that AMD support has relied entirely on volunteers running under certain conditions of their choosing, sending me logs, and me attempting to optimize based on that alone. I have had very little hands-on time with an AMD/RoCM card to nail down these issues and only a few willing volunteers. If you are willing or able, we could have a more interactive conversation in the official AI-Horde discord server, found here: https://discord.gg/r7yE6Nu8. This conversation would be appropriate for the #local-workers channel, where you could feel free to ping me (@tazlin on discord).

In any event, I can see you've clearly put sincere thought and energy into your recent flurry of activity and I appreciate the work you've put in so far. Feel free to continue here if needed or reach out to me in discord as mentioned above.

@tazlin tazlin added the bug Something isn't working label Oct 7, 2024
@tazlin
Copy link
Member

tazlin commented Oct 7, 2024

And just as an aside, I would encourage you to ensure you have a reasonable amount of swap configured on your system, as it has been shown to defray some of the memory related issues at time. I do suspect that it wouldn't be a perfect silver bullet but if you had little or none configured, I would at least try adding some.

@HPPinata
Copy link
Contributor Author

HPPinata commented Oct 7, 2024

And just as an aside, I would encourage you to ensure you have a reasonable amount of swap configured on your system, as it has been shown to defray some of the memory related issues at time. I do suspect that it wouldn't be a perfect silver bullet but if you had little or none configured, I would at least try adding some.

I have a 1:1 ratio of memory to swap, and I've seen 15+GB being used (with some actual disk activity, so it's not just sitting there)

The cryptic comment here by that developer suggests I'm probably abusing this function though the reason is left unclear, as it only notes that it "makes things worse for ROCm". You could modify the file directly in your site-packages dir of your venv to set this to false manually in horde-engine and see if that changes anything for you. In the short-medium term, especially if you are able to validate that this changes the dynamics of the problem, I could include sending a value as appropriate if the worker is started with a ROCm card. However, if I had to guess, the problem is more complicated than a simple flag flip.

I tested around a bit, nothing new so far. But knowing where the load/unload is happening already helps a bit in narrowing down what I'm searching for.

Something interesting I found was, that the process apparently thinks it has VAST amounts of virtual system memory available (or the units on that field are completely different to the VRAM one):

2024-10-07 10:30:14.041 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 1 memory report: ram: 27614076928 vram: 13665 total vram: 24560
2024-10-07 10:30:14.042 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 1 memory report: ram: 27614076928 vram: 13665 total vram: 24560
2024-10-07 10:30:14.146 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 0 memory report: ram: 1274597376 vram: None total vram: None
2024-10-07 10:30:18.081 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040344576 vram: 6297 total vram: 24560
2024-10-07 10:30:18.082 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040344576 vram: 6297 total vram: 24560
2024-10-07 10:30:18.201 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 0 memory report: ram: 1278730240 vram: None total vram: None
2024-10-07 10:30:18.201 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040213504 vram: 6331 total vram: 24560
2024-10-07 10:30:18.202 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040213504 vram: 6331 total vram: 24560

I'll hop on the Discord later, I just wanted to open this issue here first (as a reference), working with those long form texts over there is a bit tedious (handle on Discord is @Momi_V)

@HPPinata
Copy link
Contributor Author

HPPinata commented Oct 17, 2024

I've done some more testing:
Running just one thread (queue_size: 0) the memory usage appears to stabilize between 40GB and 45GB.
AMD GO FAST (flash_attn) appears to have some interaction and without it usage is closer to 20GB.
I'll report back with more data.
The CPU memory values appear to be using Bytes instead of MB, but are otherwise consistent with btop

@HPPinata HPPinata changed the title Memory leak with AMD cards Memory issues with AMD cards Oct 17, 2024
@CIB
Copy link
Contributor

CIB commented Nov 29, 2024

Here are my findings from attempting to debug the suspected memory leak on CUDA/docker using mmap.

I attached a shell to the running container using docker exec and ran mmap near the start of the run, and then one hour later. I wrote a small script to parse and make sense of the results. I categorized the objects as such:

const soTotals = mappings
  .filter((m) => m.pathname.includes(".so"))
const heapTotals = mappings
  .filter((m) => m.pathname === "[heap]")
const otherTotals = mappings
  .filter((m) => !m.pathname.includes(".so") && m.pathname !== "[heap]")

The profile immediately after starting:

  {
    "size": 4468936,
    "rss": 565080,
    "pathname": "TOTAL_SHARED_OBJECTS"
  },
  {
    "size": 8150504,
    "rss": 8081528,
    "pathname": "TOTAL_HEAP"
  },
  {
    "size": 51619560,
    "rss": 17401036,
    "pathname": "TOTAL_OTHER"
  }

The profile after an hour:

  {
    "size": 4468936,
    "rss": 69784,
    "pathname": "TOTAL_SHARED_OBJECTS"
  },
  {
    "size": 8157168,
    "rss": 5703320,
    "pathname": "TOTAL_HEAP"
  },
  {
    "size": 78309296,
    "rss": 30606032,
    "pathname": "TOTAL_OTHER"
  }

From that you can see that the "growing" category of mmap objects is neither .so files nor heap. It's also notable that it's so much larger in reserved size (size) compared to allocated size (rss). Inspecting the output further, it becomes clear that all these "other" objects don't have any associated path, and it's thus impossible to figure out why they were allocated using mmap.

memory_summary_481.json
memory_summary_481.later3.json

@HPPinata HPPinata changed the title Memory issues with AMD cards High memory usage on Linux Systems Nov 29, 2024
@pcouy
Copy link

pcouy commented Dec 18, 2024

I'm experiencing the same running the worker in Google Colab. A few months ago, I was able to offer several models and the worker would swap between them without issue. Now it crashes almost every time it tries swapping models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants