Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage #186

Closed
hmcezar opened this issue Jul 11, 2022 · 13 comments · Fixed by #194
Closed

Memory usage #186

hmcezar opened this issue Jul 11, 2022 · 13 comments · Fixed by #194
Assignees
Labels
optimization Make code go brrrr

Comments

@hmcezar
Copy link
Member

hmcezar commented Jul 11, 2022

When running simulations for a lot of steps, it looks like there are memory leaks which make the memory usage increase with time.

We should use tracemalloc and other tools to see how we can fix this problem.

@hmcezar hmcezar added the optimization Make code go brrrr label Jul 11, 2022
@hmcezar hmcezar self-assigned this Jul 11, 2022
@hmcezar
Copy link
Member Author

hmcezar commented Jul 20, 2022

Just adding some more information: here it is what happens when I use HyMD on Saga (running with 10 MPI processes - plot created using memory-profiler)
mprof

These are the packages that I load:

Currently Loaded Modules:
  1) StdEnv                        (S)   9) libpciaccess/0.16-GCCcore-11.2.0 (H)  17) ncurses/6.2-GCCcore-11.2.0     (H)
  2) GCCcore/11.2.0                     10) hwloc/2.5.0-GCCcore-11.2.0       (H)  18) libreadline/8.1-GCCcore-11.2.0 (H)
  3) zlib/1.2.11-GCCcore-11.2.0    (H)  11) hpcx/2.9                              19) Tcl/8.6.11-GCCcore-11.2.0      (H)
  4) binutils/2.37-GCCcore-11.2.0  (H)  12) OpenMPI/4.1.1-GCC-11.2.0              20) SQLite/3.36-GCCcore-11.2.0     (H)
  5) GCC/11.2.0                         13) gompi/2021b                           21) GMP/6.2.1-GCCcore-11.2.0       (H)
  6) numactl/2.0.14-GCCcore-11.2.0 (H)  14) Szip/2.1.1-GCCcore-11.2.0        (H)  22) libffi/3.4.2-GCCcore-11.2.0    (H)
  7) XZ/5.2.5-GCCcore-11.2.0       (H)  15) HDF5/1.12.1-gompi-2021b               23) OpenSSL/1.1                    (H)
  8) libxml2/2.9.10-GCCcore-11.2.0 (H)  16) bzip2/1.0.8-GCCcore-11.2.0       (H)  24) Python/3.9.6-GCCcore-11.2.0

And these are the packages installed in my Python environment:

attrs==21.4.0
Cython==0.29.30
h5py==3.7.0
-e git+https://github.com/hmcezar/HyMD.git@a6d1b1f8ba719150cb0929e812402599c66d6514#egg=hymd
iniconfig==1.1.1
memory-profiler==0.60.0
mpi4py==3.1.3
mpmath==1.2.1
mpsort==0.1.17
networkx==2.8.4
numpy==1.23.0
packaging==21.3
pfft-python==0.1.21
pluggy==1.0.0
plumed==2.9.0.dev0
pmesh==0.1.56
psutil==5.9.1
py==1.11.0
pyparsing==3.0.9
pytest==7.1.2
pytest-mpi==0.6
sympy==1.10.1
tomli==2.0.1

Not sure if this is the problem, but some people reported leaks with HDF5 1.12.1, and a workaround is given here.

In my laptop, the leak is not so brutal. But different packages and Python versions (even though HDF5 is in the same version), and I also ran shorter simulations with fewer threads. But still, there seems to be a small leak somewhere. Tracemalloc shows a lot of stuff being deallocated and allocated in each MD step, so debugging this will be a bit trickier than I expected.

@Lun4m
Copy link
Member

Lun4m commented Jul 20, 2022

Nice graph! 😅
Maybe try loading HDF5 1.21.0 if it's availble on Saga, so we can actually make sure it's a HDF5 issue?

@hmcezar
Copy link
Member Author

hmcezar commented Jul 20, 2022

Nice graph! sweat_smile Maybe try loading HDF5 1.21.0 if it's availble on Saga, so we can actually make sure it's a HDF5 issue?

Unfortunately, it's not available.. but I'll check if there's an easybuild for it :)

@hmcezar
Copy link
Member Author

hmcezar commented Jul 20, 2022

Well after a short test, it looks like HDF5 should not be blamed. The plot below is using HDF5 1.12.0, and the trend is the same.

mprof2

@Lun4m
Copy link
Member

Lun4m commented Jul 21, 2022

Maybe this software might be useful?

@hmcezar
Copy link
Member Author

hmcezar commented Jul 21, 2022

One thing that is strange to me is how different, for the same input (and same number of MPI processes), is the memory usage on Saga:
mprof_saga

And on my laptop:
mprof_local

There are a few differences concerning MPI implementation (OpenMPI on Saga and MPICH on my laptop), Python versions etc.. but still the difference is huge.
However, it is still possible to notice a small leak even on my laptop.

@hmcezar
Copy link
Member Author

hmcezar commented Jul 21, 2022

This is the profiling of the main loop for about 2000 steps running on my machine using Blackfire:

https://blackfire.io/profiles/4b4e2cd8-6985-49a3-97d5-d6925326f515/graph

The memory profiling seems to point to pmesh.domain.Layout._exchange so maybe the problem is there?

@hmcezar
Copy link
Member Author

hmcezar commented Jul 22, 2022

Some more information. If I run HyMD on Saga with 4 MPI processes I get what I got before
mprof_parallel

However, if I run without mpirun
mprof_serial

If I run with mpirun -n 1 it also doesn't leak
mprof_serial_mpirun1

@Lun4m
Copy link
Member

Lun4m commented Jul 22, 2022

Which makes sense since from the Blackfire report it seemed the problem was about an mpi4py.Alltoallv call.
I was looking at the source code for pmesh.domain.Layout._exchange, and the only place where that happens is here:

dt = MPI.BYTE.Create_contiguous(itemsize)
dt.Commit()
dtype = numpy.dtype((data.dtype, data.shape[1:]))
recvbuffer = numpy.empty(self.recvlength, dtype=dtype, order='C')
self.comm.Barrier()

# now fire
rt = self.comm.Alltoallv((buffer, (self.sendcounts, self.sendoffsets), dt), 
                    (recvbuffer, (self.recvcounts, self.recvoffsets), dt))
dt.Free()
self.comm.Barrier()

Maybe you don't need to assign the variable rt and instead you just call comm.Alltoallv? Could it be that simple? 😅

@hmcezar
Copy link
Member Author

hmcezar commented Jul 25, 2022

That was the first thing I tried when I saw that call, but unfortunately removing the rt didn't help 😢.

I'm trying a different OpenMPI version (I'm having to recompile some stuff) to see if that fixes the problem. There were some fixes in OpenMPI 4.1.2 onwards concerning leaks.

@hmcezar
Copy link
Member Author

hmcezar commented Jul 25, 2022

More evidence points to this OpenMPI version as the culprit of the leak. Running memray (thanks @Lun4m!) with its flags and tracking every allocation (and after a bugfix) , we see that the memory usage of the mpi4py import (last python call) that goes down to a PMPI_Init_thread call from the 4.1.1 module.
image

The full flamegraph is in this .HTML
memray-flamegraph-hymd.21000.zip

I opened a ticket with the sigma2 people requesting a newer toolchain with a more recent OpenMPI.
Hopefully, using this new toolchain this leak will be gone.

@hmcezar
Copy link
Member Author

hmcezar commented Jul 27, 2022

Good news, the problem apparently was, indeed, with OpenMPI.
mprof_fixed

With the 2022a toolchain on Saga (which uses OpenMPI 4.1.4), HDF 1.13.1, and Python 3.10.4, the problem seems to be solved (at least the biggest leak, there might be others).

These are the loaded modules:

  1) StdEnv                        (S)   8) libxml2/2.9.13-GCCcore-11.3.0         15) PMIx/4.1.2-GCCcore-11.3.0   22) ncurses/6.3-GCCcore-11.3.0
  2) GCCcore/11.3.0                      9) libpciaccess/0.16-GCCcore-11.3.0      16) UCC/1.0.0-GCCcore-11.3.0    23) libreadline/8.1.2-GCCcore-11.3.0
  3) zlib/1.2.12-GCCcore-11.3.0         10) hwloc/2.7.1-GCCcore-11.3.0            17) OpenMPI/4.1.4-GCC-11.3.0    24) Tcl/8.6.12-GCCcore-11.3.0
  4) binutils/2.38-GCCcore-11.3.0       11) OpenSSL/1.1                      (H)  18) gompi/2022a                 25) SQLite/3.38.3-GCCcore-11.3.0
  5) GCC/11.3.0                         12) libevent/2.1.12-GCCcore-11.3.0        19) Szip/2.1.1-GCCcore-11.3.0   26) GMP/6.2.1-GCCcore-11.3.0
  6) numactl/2.0.14-GCCcore-11.3.0      13) UCX/1.12.1-GCCcore-11.3.0             20) HDF5/1.13.1-gompi-2022a     27) libffi/3.4.2-GCCcore-11.3.0
  7) XZ/5.2.5-GCCcore-11.3.0            14) libfabric/1.15.1-GCCcore-11.3.0       21) bzip2/1.0.8-GCCcore-11.3.0  28) Python/3.10.4-GCCcore-11.3.0

And these are the packages in my Python environment:

Cython==0.29.30
h5py==3.7.0
-e git+https://github.com/hmcezar/HyMD.git@621c95abbc8f57f46f04fea12550111b7bda5c09#egg=hymd
memory-profiler==0.60.0
mpi4py==3.1.3
mpmath==1.2.1
mpsort==0.1.17
networkx==2.8.5
numpy==1.23.1
pfft-python==0.1.21
pmesh==0.1.56
psutil==5.9.1
sympy==1.10.1
tomli==2.0.1

@pablogsal
Copy link

👋 Hi @hmcezar,

I am one of the authors of memray. We are collecting success stories here. If you have a minute, do you mind leaving a short message on how memray helped with this issue? Knowing how we managed to help will help us track trends and target areas for improvement, prioritize new features and development and identify potential bugs or areas of confusion.

Thanks a lot for your consideration and for helping us improve the profiler :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Make code go brrrr
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants