Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak using TraceEnum_ELBO #3068

Closed
gioelelm opened this issue Apr 14, 2022 · 9 comments · Fixed by #3131
Closed

Memory leak using TraceEnum_ELBO #3068

gioelelm opened this issue Apr 14, 2022 · 9 comments · Fixed by #3131
Labels

Comments

@gioelelm
Copy link

gioelelm commented Apr 14, 2022

I noticed a major memory leak when training SVI using TraceEnum_ELBO.
I initially noticed this in a custom model we are developing but then I found it seems a more general bug.

For example, it affects even the Pyro tutorials GMM example here. Where memory usage rapidly goes from a couple of hundred MBs to a many GBs very quickly!

I have run this Macbook Pro 2019 running MacOS 10.15. To replicate the issue is enough running the notebook linked.

I have tried to comment out the following lines and add a garbage collector call, that reduces the entity of the memory accumulation of one order of magnitude but does not solve the problem completely, which becomes particularly severe for large datasets.

# Register hooks to monitor gradient norms.
# gradient_norms = defaultdict(list)
# for name, value in pyro.get_param_store().named_parameters():
#     value.register_hook(lambda g, name=name: gradient_norms[name].append(g.norm().item()))

import gc
losses = []
for i in range(200000):
    loss = svi.step(data)
    #losses.append(loss)
    gc.collect()

(from this forum post)

@fritzo fritzo added the bug label Apr 14, 2022
fritzo pushed a commit that referenced this issue Apr 14, 2022
@fritzo
Copy link
Member

fritzo commented Apr 14, 2022

@gioelelm could you check to see if #3069 fixes your issue?

@gioelelm
Copy link
Author

Thank you for the quick attempt, but no it does not fix the problem. Neither the one without using gc.collect() nor the residual leak when garbage collecting.

@fritzo
Copy link
Member

fritzo commented Apr 14, 2022

Thanks for checking @gioelelm. I might have time in the next few weeks to dive deeper. If you have time I can recommend some strategies (what I'd try):

  • get an idea of which tensors are leaking using this trick
  • try to determine which objects might be holding references to the leaking tensors using something like
    elbo = TraceEnum_ELBO()
    optim = ClippedAdam(...)
    svi = SVI(model, guide, optim, elbo)
    for step in range(steps):
        svi.step()
        print("svi", len(pickle.dumps(svi)))
        print("elbo", len(pickle.dumps(elbo)))
        print("optim", len(pickle.dumps(optim)))
        print("param_store", len(pickle.dumps(pyro.get_param_store()))
  • See if this is a recent PyTorch bug by trying inference with different torch versions, say 1.11, 1.10, 1.9. 1.8. I'm pretty sure the GMM tutorial should still work with older PyTorch versions.

@gioelelm
Copy link
Author

Ok thanks!

I will try the first two.

Regarding the last point, since the code runs successfully anyways (provided the machine has enough memory). Don't you think that the bug could have gone unnoticed? Or you have some reason to exclude that. I am thinking at the fact that on one would have had to profile the memory usage of the program to figure that there was a problem.

@fritzo
Copy link
Member

fritzo commented Apr 14, 2022

Don't you think that the bug could have gone unnoticed?

It could have, but TraceEnum_ELBO is pretty heavily used, and we've done a lot of memory profiling in the past. After working with Pyro and PyTorch for a few years, my posterior is 40% on a recent PyTorch regression, 40% on an edge case memory leak in Pyro that has never been noticed, and 20% on a recently introduced weird interaction between Pyro and PyTorch, so 60% chance this could be narrowed down by searching through PyTorch versions.

@ordabayevy
Copy link
Member

ordabayevy commented Apr 14, 2022

I have noticed a major GPU memory leak as well switching from PyTorch 1.10 to 1.11. Wasn't able to debug it and decided to stick to PyTorch 1.10.0 (and Pyro 1.8.0) for now.

Edit: CUDA 11.6, Arch Linux

@fritzo
Copy link
Member

fritzo commented Apr 14, 2022

Hmm maybe we should relax pytorch requirements and release to allow Pyro 1.8.2 to work with PyTorch 1.10. We'd need to do the same with Funsor. I think I was a little too eager dropping PyTorch 1.10 support, especially given colab still uses 1.10.

@qinqian
Copy link

qinqian commented Apr 27, 2022

I have noticed a GPU memory leaks too with Pyro 1.8.1+06911dc and PyTorch 1.11.0. Downgrade to Pyro 1.6.0 and PyTorch 1.8.0 works normally.

@OlaRonning
Copy link
Member

OlaRonning commented Apr 28, 2022

Downgrading, as @qinqian suggests, also resolves #3014.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants