Memory leak while walking and diffing #943

marco-c · 2019-09-18T08:40:29Z

There seems to be a memory leak while iterating through commits and performing diff.
There is a sample test case in ishepard/pydriller#54 (comment).

This might be related to (or a duplicate of) #625.

ishepard · 2019-09-18T08:44:17Z

The complete test case is the following:

repo = Repository('hadoop/')
proc = psutil.Process(os.getpid())
for commit in repo.walk(repo.head.target):
    if len(commit.parents) == 1:
        diff = repo.diff(str(commit.parents[0].id), str(commit.id))
        for p in diff:
            if str(p.delta.status_char()) == 'D':
                blob = commit.parents[0].tree[p.delta.old_file.path].id
                d = repo[blob].data.decode('utf-8', 'ignore')
            else:
                blob = commit.tree[p.delta.new_file.path].id
                d = repo[blob].data.decode('utf-8', 'ignore')
            print(proc.memory_info()[0] / (2 ** 20))

I get the source code of the modified files for every commit.
The memory seems to go up a lot, and quickly.

marco-c · 2019-12-06T13:05:15Z

@jdavid do you think you will have time to look into this at some point? I'm just wondering if we will be able to rely on this library in the future or not.

jdavid · 2019-12-06T13:28:09Z

I started to look but didn't have time to finish.

Note, however, that libgit2 has a cache, so it's normal that memory usage increases in that code. The question is whether running the same code several times increases or not the memory. Maybe you can test that?

ishepard · 2019-12-09T09:08:09Z

Hi @jdavid, thanks for the response!
I will have a look at what happens when we run the code multiple times.
I will also try to run the tool on an entire repository, instead of just 6 months, and see what happens (maybe it stops growing after a while?).

ishepard · 2019-12-09T09:52:48Z

So, I run the code posted before on the entire hadoop repo.
The result in memory consumption is here.
As you can see, at the end we almost hit 1GB. It seems quite a lot :) especially because we start from 60MB.
I know that hadoop is a big repo, 23K commits and a lot of files..though I think it's good to investigate this memory consumption.

Is there a way to "clear" the cache maybe?

PS: I am running pygit2==0.28.2, because the v1.0.0 is giving me an error on MacOS.

Deshke · 2020-01-20T08:57:35Z

could this be related to saltstack/salt#50313 ?

ishepard · 2020-01-20T09:07:21Z

Mmmm yes indeed, it seems so. They are also facing issues with memory usage using pygit2. Thanks for pointing it out! I will follow that thread 😄

apex-omontgomery · 2020-01-22T07:31:29Z

Adding to @ishepard

I used a few utilities to see if this is a python layer issue or a c layer issue- it appears to be a c-layer issue. I tried with 3 different repos hadoop, libleak and a bare git init repo.

There's a bunch of other smaller leaks that look like python layer persistent lists or dicts but those amount to < 30MB when the process balloons to 800MB.

I used memleak for the analysis and it appears that there's a tiny leak until you get to larger projects

A bunch of stack traces showing memory leak

34 bytes
memleak
callstack[8329] expires. count=1 size=168/168 alloc=13529 free=13495
    ./libleak.so(calloc+0x2a) [0x7f2b33ac050a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7f2b31c3e889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xcef3f) [0x7f2b31bb0f3f]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x176d47) [0x7f2b31c58d47]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11aa5a) [0x7f2b31bfca5a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11afb2) [0x7f2b31bfcfb2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x175e98) [0x7f2b31c57e98]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x10a103) [0x7f2b31bec103]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1356cf) [0x7f2b31c176cf]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(diff_get_patch_byindex+0x1d) [0x7f2b321f1bdd]
    python3(PyEval_EvalFrameEx+0xaae) [0x53102e]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f2b334f7830]
    python3(_start+0x29) [0x5d36e9]

bare repo
26 bytes
callstack[8288] expires. count=1 size=88/88 alloc=2645 free=2609
    ./libleak.so(calloc+0x2a) [0x7fe89018a50a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7fe88e308889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1770e2) [0x7fe88e3230e2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11a5b9) [0x7fe88e2c65b9]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11afb2) [0x7fe88e2c6fb2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x175e98) [0x7fe88e321e98]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x10a103) [0x7fe88e2b6103]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1356cf) [0x7fe88e2e16cf]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(diff_get_patch_byindex+0x1d) [0x7fe88e8bbbdd]
    python3(PyEval_EvalFrameEx+0xaae) [0x53102e]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fe88fbc1830]
    python3(_start+0x29) [0x5d36e9]

hadoop
1663 bytes
callstack[8434] expires. count=201 size=168/33768 alloc=5461 free=3793
    ./libleak.so(calloc+0x2a) [0x7f82207a050a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7f821e91e889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xcef3f) [0x7f821e890f3f]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x176d47) [0x7f821e938d47]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11a3fb) [0x7f821e8dc3fb]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xe86da) [0x7f821e8aa6da]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11a97f) [0x7f821e8dc97f]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x11afb2) [0x7f821e8dcfb2]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x175e98) [0x7f821e937e98]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x10a103) [0x7f821e8cc103]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x1356cf) [0x7f821e8f76cf]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(diff_get_patch_byindex+0x1d) [0x7f821eed1bdd]
    python3(PyEval_EvalFrameEx+0xaae) [0x53102e]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f82201d7830]
    python3(_start+0x29) [0x5d36e9]

If this is also helpful this appears to be another callstack that has a possible leak

hadoop
callstack[8280] expires. count=1 size=80/80 alloc=10601 free=3562
    ./libleak.so(calloc+0x2a) [0x7f82207a050a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x15c889) [0x7f821e91e889]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x13a535) [0x7f821e8fc535]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(git_object_lookup_prefix+0xf7) [0x7f821e8fc6f7]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xe469c) [0x7f821e8a669c]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0xff69e) [0x7f821e8c169e]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(+0x100cd4) [0x7f821e8c2cd4]
    /home/wmontgomery/.local/lib/python3.5/site-packages/.libs_pygit2/libgit2-25903d06.so.0.28.4(git_diff_tree_to_tree+0x17a) [0x7f821e8c3c5a]
    /home/wmontgomery/.local/lib/python3.5/site-packages/_pygit2.cpython-35m-x86_64-linux-gnu.so(Tree_diff_to_tree+0x13d) [0x7f821eeda33d]
    python3(PyCFunction_Call+0x77) [0x4e1307]
    python3(PyEval_EvalFrameEx+0x6b80) [0x537100]
    python3() [0x539a13]
    python3(PyEval_EvalFrameEx+0x5122) [0x5356a2]
    python3(PyEval_EvalFrameEx+0x4b64) [0x5350e4]
    python3() [0x539a13]
    python3(PyEval_EvalCode+0x1f) [0x53a6cf]
    python3() [0x6292c2]
    python3(PyRun_FileExFlags+0x9a) [0x62b76a]
    python3(PyRun_SimpleFileExFlags+0x1bc) [0x62bf5c]
    python3(Py_Main+0x456) [0x63d506]
    python3(main+0xe1) [0x4cfd11]
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f82201d7830]
    python3(_start+0x29) [0x5d36e9]

This is the modified replication script-

from pygit2 import Repository
import psutil
import os


def mem_leak():
    #repo = Repository('libleak/')
    repo = Repository('hadoop/')
    #repo = Repository('empty_dir/')
    proc = psutil.Process(os.getpid())
    for commit in repo.walk(repo.head.target):
        if len(commit.parents) == 1:
            diff = repo.diff(str(commit.parents[0].id), str(commit.id))
            for p in diff:
                if str(p.delta.status_char()) == 'D':
                    blob = commit.parents[0].tree[p.delta.old_file.path].id
                    d = repo[blob].data.decode('utf-8', 'ignore')
                else:
                    blob = commit.tree[p.delta.new_file.path].id
                    d = repo[blob].data.decode('utf-8', 'ignore')


while True:
    try:
        mem_leak()
    except:
        pass

When I run with valgrind

Here's the locations that matter-

pygit2/src/tree.c

Line 308 in 2ebfeb8

    
           err = git_diff_tree_to_index(&diff, self->repo->repo, self->tree, index, &opts);

pygit2/src/diff.c

Line 442 in d76de97

err = git_patch_from_diff(&patch, diff, idx);

it looks like there's ways to clear the libgitcache. But I'm already past what I can figure out here. Couple of things I'd try if I can figure out how compile and run locally-

If it's possible that when the libgit2 functions return error that you need to perform libgit free
If these functions in libgit2 give us some information or clearing cache helps here
Try running it with a valgrind enabled python install.

jdavid · 2020-01-23T19:11:44Z

Thanks @wimo7083 for the detailed report.

Maybe someone can try with older versions of libgit2/pygit2, to see whether this is a regression?

jdavid · 2020-01-24T20:40:47Z

I've started reviewing with the help of valgrind, and done the first commit. It will take time to fix this issue though. You can give a look at commit f0724c5 ; I've added some notes on using valgrind, see https://github.com/libgit2/pygit2/blob/master/docs/development.rst#running-valgrind

Fixes #943

jdavid · 2020-01-27T12:16:41Z

I've backported the fix to the 1.0.x branch. Please verify.

Also, the code could be faster, I've opened an issue for that, see #969 (a bit of work, but not difficult, if someone wants to give it a try).

ishepard · 2020-01-28T10:38:12Z

Hi @jdavid, thanks for looking into this!
So, I've run the same code that I posted on this issue on pygit2 1.0.2 and pygit2 on branch 1.0.x, and these are the results (I did 4 runs per version, obtaining always the same results):

First column is number of commits, second is memory.
pygit2 1.0.2: https://pastebin.com/hy6ctNnh
pygit2 1.0.x: https://pastebin.com/yjx17EUi

In version 1.0.2, after 23K commits we end up consuming around 880MB, while with version 1.0.x we end up with 760MB. This is consistent for all my runs.

As you can see, there seems to be an improvement! We have 120MB less memory consumed. Pretty good!

Do you think we can improve something else as well? Or is this the best we can achieve?

jdavid · 2020-01-29T09:51:23Z

@ishepard One thing is the memory leak. I get this output, with the master branch (see my test script at the bottom):

Start : 10 MB
Loop 0: 521 MB 493s
Loop 1: 527 MB 454s
Loop 2: 527 MB 453s
Loop 3: 527 MB 470s
End   : 511 MB

So the memory is stable now, the leak is fixed. Now, I think most of the memory used is the libgit2 cache, but I've not analysed it.

My test script, derived from the test script posted above:

import gc, os, psutil, time
import pygit2

proc = psutil.Process(os.getpid())
repo = pygit2.Repository('hadoop/')

def mem_leak():
    for commit in repo.walk(repo.head.target):
        parents = commit.parents
        if len(parents) == 1:
            parent = parents[0]
            diff = repo.diff(str(parent.id), str(commit.id))
            for p in diff:
                delta = p.delta 
                status_char = str(delta.status_char())
                if status_char == 'D':
                    f = delta.old_file
                    obj = parent
                else:
                    f = delta.new_file
                    obj = commit
                path = f.path
                tree = obj.tree
                blob = tree[path].id
                repo[blob].data.decode('utf-8', 'ignore')


def mem_data():
    data = int(proc.memory_info().data / (1024 * 1024))
    return f'{data} MB'


def test_memory(n):
    print(f'Start : {mem_data()}')
    for i in range(n):
        t0 = time.time()
        mem_leak()
        t = int(time.time() - t0)
        gc.collect()
        print(f'Loop {i}: {mem_data()} {t}s')

    global repo
    del repo
    gc.collect()
    print(f'End   : {mem_data()}')


if __name__ == '__main__':
    test_memory(4)

apex-omontgomery · 2020-01-29T15:43:32Z

Thank you for the assistance here.

jdavid · 2020-01-31T12:01:57Z

just released 1.0.3 (wheel upload in progress)

ishepard mentioned this issue Jan 20, 2020

salt-master process leaks memory when running in a container saltstack/salt#50313

Closed

jdavid closed this as completed in 4bb5893 Jan 27, 2020

jdavid added a commit that referenced this issue Jan 27, 2020

Fix reference leak in DiffFile

d7d1d79

Fixes #943

jdavid mentioned this issue May 16, 2020

diff has memory leak #625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak while walking and diffing #943

Memory leak while walking and diffing #943

marco-c commented Sep 18, 2019

ishepard commented Sep 18, 2019

marco-c commented Dec 6, 2019

jdavid commented Dec 6, 2019

ishepard commented Dec 9, 2019

ishepard commented Dec 9, 2019 •

edited

Loading

Deshke commented Jan 20, 2020

ishepard commented Jan 20, 2020

apex-omontgomery commented Jan 22, 2020

jdavid commented Jan 23, 2020

jdavid commented Jan 24, 2020

jdavid commented Jan 27, 2020

ishepard commented Jan 28, 2020 •

edited

Loading

jdavid commented Jan 29, 2020

apex-omontgomery commented Jan 29, 2020

jdavid commented Jan 31, 2020

Memory leak while walking and diffing #943

Memory leak while walking and diffing #943

Comments

marco-c commented Sep 18, 2019

ishepard commented Sep 18, 2019

marco-c commented Dec 6, 2019

jdavid commented Dec 6, 2019

ishepard commented Dec 9, 2019

ishepard commented Dec 9, 2019 • edited Loading

Deshke commented Jan 20, 2020

ishepard commented Jan 20, 2020

apex-omontgomery commented Jan 22, 2020

jdavid commented Jan 23, 2020

jdavid commented Jan 24, 2020

jdavid commented Jan 27, 2020

ishepard commented Jan 28, 2020 • edited Loading

jdavid commented Jan 29, 2020

apex-omontgomery commented Jan 29, 2020

jdavid commented Jan 31, 2020

ishepard commented Dec 9, 2019 •

edited

Loading

ishepard commented Jan 28, 2020 •

edited

Loading