-
-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak while walking and diffing #943
Comments
The complete test case is the following:
I get the source code of the modified files for every commit. |
@jdavid do you think you will have time to look into this at some point? I'm just wondering if we will be able to rely on this library in the future or not. |
I started to look but didn't have time to finish. Note, however, that libgit2 has a cache, so it's normal that memory usage increases in that code. The question is whether running the same code several times increases or not the memory. Maybe you can test that? |
Hi @jdavid, thanks for the response! |
So, I run the code posted before on the entire hadoop repo. Is there a way to "clear" the cache maybe? PS: I am running pygit2==0.28.2, because the v1.0.0 is giving me an error on MacOS. |
could this be related to saltstack/salt#50313 ? |
Mmmm yes indeed, it seems so. They are also facing issues with memory usage using pygit2. Thanks for pointing it out! I will follow that thread 😄 |
Adding to @ishepard I used a few utilities to see if this is a python layer issue or a c layer issue- it appears to be a c-layer issue. I tried with 3 different repos hadoop, libleak and a bare git init repo. There's a bunch of other smaller leaks that look like python layer persistent lists or dicts but those amount to < 30MB when the process balloons to 800MB. I used memleak for the analysis and it appears that there's a tiny leak until you get to larger projects A bunch of stack traces showing memory leak
If this is also helpful this appears to be another callstack that has a possible leak
This is the modified replication script- from pygit2 import Repository
import psutil
import os
def mem_leak():
#repo = Repository('libleak/')
repo = Repository('hadoop/')
#repo = Repository('empty_dir/')
proc = psutil.Process(os.getpid())
for commit in repo.walk(repo.head.target):
if len(commit.parents) == 1:
diff = repo.diff(str(commit.parents[0].id), str(commit.id))
for p in diff:
if str(p.delta.status_char()) == 'D':
blob = commit.parents[0].tree[p.delta.old_file.path].id
d = repo[blob].data.decode('utf-8', 'ignore')
else:
blob = commit.tree[p.delta.new_file.path].id
d = repo[blob].data.decode('utf-8', 'ignore')
while True:
try:
mem_leak()
except:
pass When I run with valgrind Here's the locations that matter- Line 308 in 2ebfeb8
Line 442 in d76de97
it looks like there's ways to clear the libgitcache. But I'm already past what I can figure out here. Couple of things I'd try if I can figure out how compile and run locally-
|
Thanks @wimo7083 for the detailed report. Maybe someone can try with older versions of libgit2/pygit2, to see whether this is a regression? |
I've started reviewing with the help of valgrind, and done the first commit. It will take time to fix this issue though. You can give a look at commit f0724c5 ; I've added some notes on using valgrind, see https://github.com/libgit2/pygit2/blob/master/docs/development.rst#running-valgrind |
I've backported the fix to the 1.0.x branch. Please verify. Also, the code could be faster, I've opened an issue for that, see #969 (a bit of work, but not difficult, if someone wants to give it a try). |
Hi @jdavid, thanks for looking into this! First column is number of commits, second is memory. In version 1.0.2, after 23K commits we end up consuming around 880MB, while with version 1.0.x we end up with 760MB. This is consistent for all my runs. As you can see, there seems to be an improvement! We have 120MB less memory consumed. Pretty good! Do you think we can improve something else as well? Or is this the best we can achieve? |
@ishepard One thing is the memory leak. I get this output, with the master branch (see my test script at the bottom):
So the memory is stable now, the leak is fixed. Now, I think most of the memory used is the libgit2 cache, but I've not analysed it. My test script, derived from the test script posted above: import gc, os, psutil, time
import pygit2
proc = psutil.Process(os.getpid())
repo = pygit2.Repository('hadoop/')
def mem_leak():
for commit in repo.walk(repo.head.target):
parents = commit.parents
if len(parents) == 1:
parent = parents[0]
diff = repo.diff(str(parent.id), str(commit.id))
for p in diff:
delta = p.delta
status_char = str(delta.status_char())
if status_char == 'D':
f = delta.old_file
obj = parent
else:
f = delta.new_file
obj = commit
path = f.path
tree = obj.tree
blob = tree[path].id
repo[blob].data.decode('utf-8', 'ignore')
def mem_data():
data = int(proc.memory_info().data / (1024 * 1024))
return f'{data} MB'
def test_memory(n):
print(f'Start : {mem_data()}')
for i in range(n):
t0 = time.time()
mem_leak()
t = int(time.time() - t0)
gc.collect()
print(f'Loop {i}: {mem_data()} {t}s')
global repo
del repo
gc.collect()
print(f'End : {mem_data()}')
if __name__ == '__main__':
test_memory(4) |
Thank you for the assistance here. |
just released 1.0.3 (wheel upload in progress) |
There seems to be a memory leak while iterating through commits and performing diff.
There is a sample test case in ishepard/pydriller#54 (comment).
This might be related to (or a duplicate of) #625.
The text was updated successfully, but these errors were encountered: