Memory leak using TraceEnum_ELBO #3068

gioelelm · 2022-04-14T14:11:42Z

I noticed a major memory leak when training SVI using TraceEnum_ELBO.
I initially noticed this in a custom model we are developing but then I found it seems a more general bug.

For example, it affects even the Pyro tutorials GMM example here. Where memory usage rapidly goes from a couple of hundred MBs to a many GBs very quickly!

I have run this Macbook Pro 2019 running MacOS 10.15. To replicate the issue is enough running the notebook linked.

I have tried to comment out the following lines and add a garbage collector call, that reduces the entity of the memory accumulation of one order of magnitude but does not solve the problem completely, which becomes particularly severe for large datasets.

# Register hooks to monitor gradient norms.
# gradient_norms = defaultdict(list)
# for name, value in pyro.get_param_store().named_parameters():
#     value.register_hook(lambda g, name=name: gradient_norms[name].append(g.norm().item()))

import gc
losses = []
for i in range(200000):
    loss = svi.step(data)
    #losses.append(loss)
    gc.collect()

(from this forum post)

The text was updated successfully, but these errors were encountered:

fritzo · 2022-04-14T14:20:27Z

@gioelelm could you check to see if #3069 fixes your issue?

gioelelm · 2022-04-14T14:35:54Z

Thank you for the quick attempt, but no it does not fix the problem. Neither the one without using gc.collect() nor the residual leak when garbage collecting.

fritzo · 2022-04-14T16:41:12Z

Thanks for checking @gioelelm. I might have time in the next few weeks to dive deeper. If you have time I can recommend some strategies (what I'd try):

get an idea of which tensors are leaking using this trick

try to determine which objects might be holding references to the leaking tensors using something like

elbo = TraceEnum_ELBO()
optim = ClippedAdam(...)
svi = SVI(model, guide, optim, elbo)
for step in range(steps):
    svi.step()
    print("svi", len(pickle.dumps(svi)))
    print("elbo", len(pickle.dumps(elbo)))
    print("optim", len(pickle.dumps(optim)))
    print("param_store", len(pickle.dumps(pyro.get_param_store()))

See if this is a recent PyTorch bug by trying inference with different torch versions, say 1.11, 1.10, 1.9. 1.8. I'm pretty sure the GMM tutorial should still work with older PyTorch versions.

gioelelm · 2022-04-14T17:17:20Z

Ok thanks!

I will try the first two.

Regarding the last point, since the code runs successfully anyways (provided the machine has enough memory). Don't you think that the bug could have gone unnoticed? Or you have some reason to exclude that. I am thinking at the fact that on one would have had to profile the memory usage of the program to figure that there was a problem.

fritzo · 2022-04-14T17:45:28Z

Don't you think that the bug could have gone unnoticed?

It could have, but TraceEnum_ELBO is pretty heavily used, and we've done a lot of memory profiling in the past. After working with Pyro and PyTorch for a few years, my posterior is 40% on a recent PyTorch regression, 40% on an edge case memory leak in Pyro that has never been noticed, and 20% on a recently introduced weird interaction between Pyro and PyTorch, so 60% chance this could be narrowed down by searching through PyTorch versions.

ordabayevy · 2022-04-14T17:59:04Z

I have noticed a major GPU memory leak as well switching from PyTorch 1.10 to 1.11. Wasn't able to debug it and decided to stick to PyTorch 1.10.0 (and Pyro 1.8.0) for now.

Edit: CUDA 11.6, Arch Linux

fritzo · 2022-04-14T18:40:54Z

Hmm maybe we should relax pytorch requirements and release to allow Pyro 1.8.2 to work with PyTorch 1.10. We'd need to do the same with Funsor. I think I was a little too eager dropping PyTorch 1.10 support, especially given colab still uses 1.10.

qinqian · 2022-04-27T02:22:32Z

I have noticed a GPU memory leaks too with Pyro 1.8.1+06911dc and PyTorch 1.11.0. Downgrade to Pyro 1.6.0 and PyTorch 1.8.0 works normally.

OlaRonning · 2022-04-28T09:25:24Z

Downgrading, as @qinqian suggests, also resolves #3014.

fritzo added the bug label Apr 14, 2022

fritzo pushed a commit that referenced this issue Apr 14, 2022

Attempt to fix memory leak #3068

b473f4e

fritzo mentioned this issue May 5, 2022

Attempt to allow torch 1.10 again #3057

Closed

eb8680 mentioned this issue Jun 1, 2022

Excessive RAM consumption when coding mean field VI for LDA #3100

Closed

fehiepsi mentioned this issue Aug 28, 2022

Fix memory leak in TraceEnumELBO #3131

Merged

fritzo closed this as completed in #3131 Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak using TraceEnum_ELBO #3068

Memory leak using TraceEnum_ELBO #3068

gioelelm commented Apr 14, 2022 •

edited by fritzo

Loading

fritzo commented Apr 14, 2022

gioelelm commented Apr 14, 2022

fritzo commented Apr 14, 2022

gioelelm commented Apr 14, 2022

fritzo commented Apr 14, 2022 •

edited

Loading

ordabayevy commented Apr 14, 2022 •

edited

Loading

fritzo commented Apr 14, 2022

qinqian commented Apr 27, 2022 •

edited

Loading

OlaRonning commented Apr 28, 2022 •

edited

Loading

Memory leak using TraceEnum_ELBO #3068

Memory leak using TraceEnum_ELBO #3068

Comments

gioelelm commented Apr 14, 2022 • edited by fritzo Loading

fritzo commented Apr 14, 2022

gioelelm commented Apr 14, 2022

fritzo commented Apr 14, 2022

gioelelm commented Apr 14, 2022

fritzo commented Apr 14, 2022 • edited Loading

ordabayevy commented Apr 14, 2022 • edited Loading

fritzo commented Apr 14, 2022

qinqian commented Apr 27, 2022 • edited Loading

OlaRonning commented Apr 28, 2022 • edited Loading

gioelelm commented Apr 14, 2022 •

edited by fritzo

Loading

fritzo commented Apr 14, 2022 •

edited

Loading

ordabayevy commented Apr 14, 2022 •

edited

Loading

qinqian commented Apr 27, 2022 •

edited

Loading

OlaRonning commented Apr 28, 2022 •

edited

Loading