Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak while walking and diffing #943

Closed
marco-c opened this issue Sep 18, 2019 · 15 comments
Closed

Memory leak while walking and diffing #943

marco-c opened this issue Sep 18, 2019 · 15 comments

Comments

@marco-c
Copy link

marco-c commented Sep 18, 2019

There seems to be a memory leak while iterating through commits and performing diff.
There is a sample test case in ishepard/pydriller#54 (comment).

This might be related to (or a duplicate of) #625.

@ishepard
Copy link

The complete test case is the following:

repo = Repository('hadoop/')
proc = psutil.Process(os.getpid())
for commit in repo.walk(repo.head.target):
    if len(commit.parents) == 1:
        diff = repo.diff(str(commit.parents[0].id), str(commit.id))
        for p in diff:
            if str(p.delta.status_char()) == 'D':
                blob = commit.parents[0].tree[p.delta.old_file.path].id
                d = repo[blob].data.decode('utf-8', 'ignore')
            else:
                blob = commit.tree[p.delta.new_file.path].id
                d = repo[blob].data.decode('utf-8', 'ignore')
            print(proc.memory_info()[0] / (2 ** 20))

I get the source code of the modified files for every commit.
The memory seems to go up a lot, and quickly.

@marco-c
Copy link
Author

marco-c commented Dec 6, 2019

@jdavid do you think you will have time to look into this at some point? I'm just wondering if we will be able to rely on this library in the future or not.

@jdavid
Copy link
Member

jdavid commented Dec 6, 2019

I started to look but didn't have time to finish.

Note, however, that libgit2 has a cache, so it's normal that memory usage increases in that code. The question is whether running the same code several times increases or not the memory. Maybe you can test that?

@ishepard
Copy link

ishepard commented Dec 9, 2019

Hi @jdavid, thanks for the response!
I will have a look at what happens when we run the code multiple times.
I will also try to run the tool on an entire repository, instead of just 6 months, and see what happens (maybe it stops growing after a while?).

@ishepard
Copy link

ishepard commented Dec 9, 2019

So, I run the code posted before on the entire hadoop repo.
The result in memory consumption is here.
As you can see, at the end we almost hit 1GB. It seems quite a lot :) especially because we start from 60MB.
I know that hadoop is a big repo, 23K commits and a lot of files..though I think it's good to investigate this memory consumption.

Is there a way to "clear" the cache maybe?

PS: I am running pygit2==0.28.2, because the v1.0.0 is giving me an error on MacOS.

@Deshke
Copy link

Deshke commented Jan 20, 2020

could this be related to saltstack/salt#50313 ?

@ishepard
Copy link

Mmmm yes indeed, it seems so. They are also facing issues with memory usage using pygit2. Thanks for pointing it out! I will follow that thread 😄

@apex-omontgomery
Copy link

Adding to @ishepard

I used a few utilities to see if this is a python layer issue or a c layer issue- it appears to be a c-layer issue. I tried with 3 different repos hadoop, libleak and a bare git init repo.

There's a bunch of other smaller leaks that look like python layer persistent lists or dicts but those amount to < 30MB when the process balloons to 800MB.

I used memleak for the analysis and it appears that there's a tiny leak until you get to larger projects

A bunch of stack traces showing memory leak
34 bytes
memleak
callstack[8329] expires. count=1 size=168/168 alloc=13529 free=13495
    ./libleak.so(calloc+0x2a) [0x7f2b33ac050a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7f2b31c3e889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xcef3f) [0x7f2b31bb0f3f]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x176d47) [0x7f2b31c58d47]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11aa5a) [0x7f2b31bfca5a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11afb2) [0x7f2b31bfcfb2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x175e98) [0x7f2b31c57e98]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x10a103) [0x7f2b31bec103]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1356cf) [0x7f2b31c176cf]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(diff_get_patch_byindex+0x1d) [0x7f2b321f1bdd]
    python3(PyEval_EvalFrameEx+0xaae) [0x53102e]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f2b334f7830]
    python3(_start+0x29) [0x5d36e9]
bare repo
26 bytes
callstack[8288] expires. count=1 size=88/88 alloc=2645 free=2609
    ./libleak.so(calloc+0x2a) [0x7fe89018a50a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7fe88e308889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1770e2) [0x7fe88e3230e2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11a5b9) [0x7fe88e2c65b9]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11afb2) [0x7fe88e2c6fb2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x175e98) [0x7fe88e321e98]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x10a103) [0x7fe88e2b6103]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1356cf) [0x7fe88e2e16cf]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(diff_get_patch_byindex+0x1d) [0x7fe88e8bbbdd]
    python3(PyEval_EvalFrameEx+0xaae) [0x53102e]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fe88fbc1830]
    python3(_start+0x29) [0x5d36e9]
hadoop
1663 bytes
callstack[8434] expires. count=201 size=168/33768 alloc=5461 free=3793
    ./libleak.so(calloc+0x2a) [0x7f82207a050a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7f821e91e889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xcef3f) [0x7f821e890f3f]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x176d47) [0x7f821e938d47]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11a3fb) [0x7f821e8dc3fb]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xe86da) [0x7f821e8aa6da]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11a97f) [0x7f821e8dc97f]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11afb2) [0x7f821e8dcfb2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x175e98) [0x7f821e937e98]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x10a103) [0x7f821e8cc103]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1356cf) [0x7f821e8f76cf]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(diff_get_patch_byindex+0x1d) [0x7f821eed1bdd]
    python3(PyEval_EvalFrameEx+0xaae) [0x53102e]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f82201d7830]
    python3(_start+0x29) [0x5d36e9]

If this is also helpful this appears to be another callstack that has a possible leak

hadoop
callstack[8280] expires. count=1 size=80/80 alloc=10601 free=3562
    ./libleak.so(calloc+0x2a) [0x7f82207a050a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7f821e91e889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x13a535) [0x7f821e8fc535]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(git_object_lookup_prefix+0xf7) [0x7f821e8fc6f7]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xe469c) [0x7f821e8a669c]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xff69e) [0x7f821e8c169e]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x100cd4) [0x7f821e8c2cd4]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(git_diff_tree_to_tree+0x17a) [0x7f821e8c3c5a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(Tree_diff_to_tree+0x13d) [0x7f821eeda33d]
    python3(PyCFunction_Call+0x77) [0x4e1307]
    python3(PyEval_EvalFrameEx+0x6b80) [0x537100]
    python3() [0x539a13]
    python3(PyEval_EvalFrameEx+0x5122) [0x5356a2]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f82201d7830]
    python3(_start+0x29) [0x5d36e9]

This is the modified replication script-

from pygit2 import Repository
import psutil
import os


def mem_leak():
    #repo = Repository('libleak/')
    repo = Repository('hadoop/')
    #repo = Repository('empty_dir/')
    proc = psutil.Process(os.getpid())
    for commit in repo.walk(repo.head.target):
        if len(commit.parents) == 1:
            diff = repo.diff(str(commit.parents[0].id), str(commit.id))
            for p in diff:
                if str(p.delta.status_char()) == 'D':
                    blob = commit.parents[0].tree[p.delta.old_file.path].id
                    d = repo[blob].data.decode('utf-8', 'ignore')
                else:
                    blob = commit.tree[p.delta.new_file.path].id
                    d = repo[blob].data.decode('utf-8', 'ignore')


while True:
    try:
        mem_leak()
    except:
        pass

When I run with valgrind

Here's the locations that matter-

err = git_diff_tree_to_index(&diff, self->repo->repo, self->tree, index, &opts);

err = git_patch_from_diff(&patch, diff, idx);

it looks like there's ways to clear the libgitcache. But I'm already past what I can figure out here. Couple of things I'd try if I can figure out how compile and run locally-

  1. If it's possible that when the libgit2 functions return error that you need to perform libgit free
  2. If these functions in libgit2 give us some information or clearing cache helps here
  3. Try running it with a valgrind enabled python install.

@jdavid
Copy link
Member

jdavid commented Jan 23, 2020

Thanks @wimo7083 for the detailed report.

Maybe someone can try with older versions of libgit2/pygit2, to see whether this is a regression?

@jdavid
Copy link
Member

jdavid commented Jan 24, 2020

I've started reviewing with the help of valgrind, and done the first commit. It will take time to fix this issue though. You can give a look at commit f0724c5 ; I've added some notes on using valgrind, see https://github.com/libgit2/pygit2/blob/master/docs/development.rst#running-valgrind

@jdavid jdavid closed this as completed in 4bb5893 Jan 27, 2020
jdavid added a commit that referenced this issue Jan 27, 2020
@jdavid
Copy link
Member

jdavid commented Jan 27, 2020

I've backported the fix to the 1.0.x branch. Please verify.

Also, the code could be faster, I've opened an issue for that, see #969 (a bit of work, but not difficult, if someone wants to give it a try).

@ishepard
Copy link

ishepard commented Jan 28, 2020

Hi @jdavid, thanks for looking into this!
So, I've run the same code that I posted on this issue on pygit2 1.0.2 and pygit2 on branch 1.0.x, and these are the results (I did 4 runs per version, obtaining always the same results):

First column is number of commits, second is memory.
pygit2 1.0.2: https://pastebin.com/hy6ctNnh
pygit2 1.0.x: https://pastebin.com/yjx17EUi

In version 1.0.2, after 23K commits we end up consuming around 880MB, while with version 1.0.x we end up with 760MB. This is consistent for all my runs.

As you can see, there seems to be an improvement! We have 120MB less memory consumed. Pretty good!

Do you think we can improve something else as well? Or is this the best we can achieve?

@jdavid
Copy link
Member

jdavid commented Jan 29, 2020

@ishepard One thing is the memory leak. I get this output, with the master branch (see my test script at the bottom):

Start : 10 MB
Loop 0: 521 MB 493s
Loop 1: 527 MB 454s
Loop 2: 527 MB 453s
Loop 3: 527 MB 470s
End   : 511 MB

So the memory is stable now, the leak is fixed. Now, I think most of the memory used is the libgit2 cache, but I've not analysed it.

My test script, derived from the test script posted above:

import gc, os, psutil, time
import pygit2

proc = psutil.Process(os.getpid())
repo = pygit2.Repository('hadoop/')

def mem_leak():
    for commit in repo.walk(repo.head.target):
        parents = commit.parents
        if len(parents) == 1:
            parent = parents[0]
            diff = repo.diff(str(parent.id), str(commit.id))
            for p in diff:
                delta = p.delta 
                status_char = str(delta.status_char())
                if status_char == 'D':
                    f = delta.old_file
                    obj = parent
                else:
                    f = delta.new_file
                    obj = commit
                path = f.path
                tree = obj.tree
                blob = tree[path].id
                repo[blob].data.decode('utf-8', 'ignore')


def mem_data():
    data = int(proc.memory_info().data / (1024 * 1024))
    return f'{data} MB'


def test_memory(n):
    print(f'Start : {mem_data()}')
    for i in range(n):
        t0 = time.time()
        mem_leak()
        t = int(time.time() - t0)
        gc.collect()
        print(f'Loop {i}: {mem_data()} {t}s')

    global repo
    del repo
    gc.collect()
    print(f'End   : {mem_data()}')


if __name__ == '__main__':
    test_memory(4)

@apex-omontgomery
Copy link

Thank you for the assistance here.

@jdavid
Copy link
Member

jdavid commented Jan 31, 2020

just released 1.0.3 (wheel upload in progress)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants