-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Git, GitDB, and GitCmdObjectDB give different results when counting objects. #765
Comments
@ali1234 Could you tell me the repository you are looking at? I would like to run https://github.com/Byron/git-count on it, assuming it's as fast as |
This is my .git/config:
|
Also this is my python object counter. It takes
|
@ali1234 The compilation issue might stem from an old rustc version. Rustc 1.26 is the one I use, and I manage it with Rustup. |
Here is the result on my system (OSX) - run right after clone
|
And here is the result of invoking the script you provided on my system:
The conclusion I am making is that the default object database implementation is correct, and probably that the GitDB implementation should be deprecated permanently. Also we might see that |
I get different numbers from git-count and my script. git-count:
and for my script:
|
It's odd that the numbers don't match, as this was consistent for me. |
6m32.922s for the release build. |
If |
Same git version here, on Ubuntu 18.04. |
I realized that the only way to fix GitPython is not to use it, or at least stick with the proven cgit implementation. Here is the first result:
Please note that the comparison is probably unfair, as the program just does a lookup by offset, not by SHA1, which is pretty much as efficient as it gets. I am also declaring this issue as 'wont-fix' as the only 'working' git database is the |
I can't understand how to compile that, but I got different results from git-count and git itself, and I don't even know which one of them is correct. |
You can compile it with I think it would be good if the problem you are trying to solve would be stated as well. By that it might be easier to determine which algorithm is preferable. Generally I would think its best to build the count on top of an algorithm you trust, and the one using |
I think in the past week I have spent more time explaining to people what I am doing than writing code. :) The problem: given a list of remote repositories and an arbitrary tree which does not exist in any of them, calculate the list of parent commits that will produce the merge commit with the smallest possible diff when committing the tree with commit-tree. The algorithm:
The list of remote repositories: #765 (comment) Now you are probably going to ask "why not just do X?" (everyone else did). The answer is speed. This algorithm takes around 8 hours and 10GB of RAM to run. There are 54,000 blobs in the input tree and 900,000 commits in the upstream repos. One algorithm that has been suggested is:
This is not practical, because walking the tree of an arbitrary kernel commit takes approximately 2 seconds. That means this algorithm will take approximately 3 weeks to complete. Another method which has been suggested:
This is slightly better but still much worse than my approach. So now that you hopefully fully understand what I am doing and why, let's look at the actual problem: In step 1b I wrote "for each blob and commit in the repo..." and in 1c I wrote "for each commit in the repo..." For speed, I do this with a single loop which just looks at every object in the repo. And the problem I have is that the various tools do not agree on the number of objects in the repo. |
Thanks so much for sharing, @ali1234 ! Also this problems hits a weak spot of mine, as for some reason, I really like to make things suitably fast. Looking at the algorithm, there is the part that builds the lookup table ( If I were to implement grit far enough to generate the lookup table, do you think that would already help you? For that I think what I should do first is to write a version of the program that uses What do you think? Regarding your question, which I understand as "do I actually get to see all commits": I think it all depends on whether this git command does what you need. The |
Yes, exactly. It's for when OEMs release tarballs with no history. I like that phrase, "historically-suitable", it captures what I am trying to do very well.
Yes. If the tree (or a superset of it) exists in the remotes, the algorithm should output one commit.
I already do this using pickle. Building the backrefs takes about 12 minutes with 8 "threads". Single thread version takes about 70 minutes. Loading it back in from disk takes about 90 seconds. It is 2GB on disk and 6GB in RAM. :)
If you can make a way to fetch a list of every commit containing a blob given its SHA1 which takes less than 1 second and I can call it from Python, then I am interested. If you want to rewrite the whole thing in rust then go ahead - but I have no knowledge of rust programming at all. Regarding I don't think I want orphaned objects because presumably they do not have a useful history, and the whole point is to reconstruct history... |
PS thanks for taking an interest. Most people were just like "well, good luck with that" after I spent 15 minutes explaining the problem to them :) I also met one other person trying to solve the same problem, with a different OEM tarball, but they took a totally different approach: https://framagit.org/GNUtoo/reconstruct-git-history |
Perfect - this is a great starting point - it can create the lookup table in memory, especially since it is not taking too long given the data you kindly provided. My plan is to
The alternative program you linked to (I just had a very brief look) seems not suitable for huge repositories, but I may be wrong. |
For |
Intermediate results are in, and it becomes clear that I will have to look at how the LUT in python is really working. Right now even after processing about 5000 commits of linux-stable, the program is at a whopping 7GB real memory. |
Are you building a map of commit -> list of objects it contains? That won't work... the result will be 900000 lists of ~54000 objects. You need a minimum of 3 bytes to uniquely identify all the binshas, so the result will be at an absolute minimum 150GB of data, not including datastructure overhead. What I do is invert the graph... so where in git a tree has a list of subtrees and blobs, in backrefs every tree and blob has a list of trees it appears in. And top level trees also have the binsha of the commit(s) they belong to. This is still much larger than the forward version that git stores. Trying to flattening makes it grow exponentially. I still had to use some tricks to make it fit in memory: If tree A is in tree B, then A does not store the binsha of B in its list. Instead it stores a direct reference to the list representing B, which is another list of lists... and so on. The only data that gets stored is the binshas of commits. Everything else is lists of lists. This saved about 16GB of RAM, although curiously it had almost no effect on the on-disk size of the pickled data. Another trick: Every binsha seen by the program is stored in a dict. Whenever it sees a new one it checks the dict and returns a reference to the existing one, instead of allocating new memory for the same string. This saved 2GB. The dict is thrown away once the backrefs table has been built. This seems quite slow too... after 24 minutes it has indexed 10000 commits (using 18GB), so the total expected run time would be about 1.5 days (using 1.6TB of RAM). I think this is because you are walking the commits, and then walking the tree of each commit. This is very slow as you are looking at the same trees over and over. But it is possible (actually very likely) that I do not understand this rust code at all. :) |
Thanks for your help! I couldn't sleep, so I got up and fixed the memory issue by throwing in a bitvec. Just pushed a version which will probably not have that memory issue! From what I see, for the 150k blobs on master, it should use just about 3GB of RAM (it now scales by amount of blobs only). |
If you are storing a bitvec/bitarray/bitmap of length |
It's great to hear you got some additional speedups - not that it wasn't a challenge already :D!
Memory consumption was 13.4GB after less then half of the blobs which are to be expected, which wouldn't fit! |
Couldn't let it go, now there is a rather crude mapping of vertices (tree, commits or blobs) to their parents, all identified via OIDs.
Max observed memory consumption was 8GB, so it fits already. |
The current version should be usable for you - it does all the work in 6minutes and has incredibly quick lookups afterwards. Memory usage, now slightly compacted, is a mere 8GB (6.2GB Real). |
Ok, that's it :)! The current version implements a better compaction step which brings the overall memory consumption down to 5GB with a peak of the prior 8GB, even though it also increases the time it takes until ready by 3m40s. The next step I want to take is to multi-thread both compaction as well as lookup table generation. Thanks to Rust, this will be easy to do and guarantees that I won't run into race conditions. Please let me know what you think :D. |
The last features... now you
I am very curious how this affects your benchmark, and if it makes sense to walk the extra mile and implement the code which creates the most fitting merge-commit, too. |
I've been implementing speed ups and multiprocessing code all day. My python now builds the LUT in 6 minutes (including 90 seconds to write the cache) vs 8 minutes for your rust. However I just realised that actually, all we need to do is:
This takes two minutes to run and is naturally multithreaded by nature of being a shell pipeline. It is just a matter of parsing the firehose quickly, because it contains exactly the minimum amount of data you can pull from the repo and nothing else. No need to implement multithreading inside the program. If written directly to disk it is 13GB. |
Wow! That's great to hear! Who said that a little competition hurt anyone ;)! And just because I couldn't let you claim that 6min of multi-core python beat my 8 minutes of single-core Rust, please do try again with the latest version :) - it implements multi-threading for the LUT generation (without workstealing, yet, which I leave for another day).
The above yielded the following:
4 minutes :)! |
I just let it run on a CoreI7 laptop with 4 cores, and it goes 2min with compaction (and 7GB of RAM) and 1:30min without compaction (but 12GB of RAM). |
54000 look ups completed in 42 minutes.
|
Thanks again! I never get to the point where I could actually test the lookup speed! 0.05 seconds for each lookup sounds a bit slow - I wonder if inefficient writing to stdout could be the problem. |
It's actually very fast compared to my python. Of course more speed is always good. :) |
:D! It's always nice to hear that! |
For me it ran in 57min (without compaction) and |
Not sure what you did but it is much slower now - like 100x slower. (Head currently at 55d9886). |
Oh, darn, a regression - no wonder given the way I test the performance currently. The given hash I couldn't find, unfortunately. |
Ok, it's fixed. Problem was that the commit output buffer was not cleared in one particular case, thus piling up commits and increasing the time it took to output data. |
The definitive version has landed: https://github.com/Byron/git-reconstruct . Please note that it drops the By now I have the problem that when I leave this project in an unsatisfying state, I have serious trouble thinking about anything else (or do the work I am paid for :D). It's great to work under pressure sometimes.
Improvements can be made with unchecked array access and multithreading. |
Here is the final results. Multithreading for graph generation was removed due to memory constraints later on, and generation of bitmaps is what was taking most of the time for me. Do you think that's a decent time?
PS: I found the numbers quite impressive - after all it is seeing 7.5 billion commits in the process. |
That's a very impressive time. Today I found an optimization. I don't know if you are already doing this. If a node has only one parent edge, then you can replace it in all children with its parent at no cost. This Python code is called once for each node in the whole graph:
It eliminates about 350 million edges, reducing memory use and halving the lookup time. Total run time for the python version is now at about 3 hours. |
I am happy to hear that! Right now I don't really have a comparison and rely solely on your judgement! Great work there with the optimization - 3h is impressive too! I have tried to implement a multi-pass topology optimization as you suggested, but... there seems to be a problem with the output and so far I only see 3 million edges removed for the first pass.
What you see as well is caching, and it can now dump its graph to disk in an lz4 compressed binary format. It's ... fast, I didn't really see it happen. The resulting file weighs only 800MB on disk. Will keep you posted, there is some time left today to fix this. |
Turns out I was able (and allowed) to forget calling Judging by these times it seems that either my way of simplifying the topology of the graph isn't really working, or that the changes don't make that much of a difference in my case. Maybe the code tells you something (please note that my data structure is not a hash-based tree). Memory consumption by now goes up to 17GB (!! half way!!) for me as I have to trade memory for performance - using hashmaps for commit-to-bits lookups is too expensive for me. Right now it seems you are outputting human readable text. Could you show me an example? Or maybe tell me what you would be going for? |
The output format is not particularly important. All it needs to do is print the set of commits that should be used in the merge commit. I made the commit manually. For the datastructure stuff, remember the repository is a directed acyclic graph, not a tree. Sources are commits, sinks are blobs, everything else is a tree. Backrefs is the reversed graph, so blobs become sources and commits become sinks. I use dict/hash to find nodes by SHA1 but they are not internal to the graph. Each vertex is simply a list of lists, except for commits which are just a binsha. The dict containing tree vertices is discarded at the end, as only the blob vertices need to be found by binsha. While writing this post it occurred to me that, in the language of DAGs, the algorithm would be described like this:
There is probably a much better algorithm for computing the transitive closure than walking from every source vertex. |
Our data-structures are actually similar! Even though yours seems to store a little less - I for example keep everything I see during traversal, and thus can lookup everything by sha. The reverse is also true, and the SHA can be obtained for every index. That might be an opportunity to safe a few hundred megabytes, but won't speed up the operation.
Even though the graph lookup algorithm itself might be suboptimal, I have a feeling that any improvement will likely cost more memory than there is. But I am open to be corrected on that one :D! As to be more of a feature-guide for |
So amazing that fearless concurrency really is a thing in Rust - I just quickly put in concurrency for lookups, which are the most expensive part for me, and here is the timing:
12min 30s! And with slight thread overallocation it even goes down to 12min :). |
6m25s with cached graph, numpy, and 1 thread:
Building the cache takes 10 minutes, also single threaded. The external git pipe line is doing most of the work. I am not entirely convinced it is still producing valid results but assuming it is, using this algorithm in rust should allow you to get below 1 minute pretty easily I think. |
Assuming correctness, your excursion to graph theory soooo paid off! I am
absolutely blown away 🤩!
Also I won’t be able to sleep anymore until I have beaten that time 😉.
Great show of how optimizations usually go.
By far the most powerful tool is the choice of algorithm, everything else
comes afterwards.
I see it is now using 14.3GB of memory, is that up or down for you?
Admittedly for that speed up, I am happily paying a few gig more 😅.
…On Fri 15. Jun 2018 at 14:11, Alistair Buxton ***@***.***> wrote:
6m25s with cached graph, numpy, and 1 thread:
Topological sort: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 53741/53741 [00:46<00:00, 1152.03 sources/s]
Making bitmaps: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1775132/1775132 [04:29<00:00, 6586.07 vertices/s]
Command being timed: "python3 -m gitxref /home/al/gemini/kernel2/upstream/ /home/al/gemini/kernel2/kernel-3.18/"
User time (seconds): 380.67
System time (seconds): 4.70
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:25.43
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 14130952
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 3610107
Voluntary context switches: 199
Involuntary context switches: 2021
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Building the cache takes 10 minutes, also single threaded. The external
git pipe line is doing most of the work.
I am not entirely convinced it is still producing valid results but
assuming it is, using this algorithm in rust should allow you to get below
1 minute pretty easily I think.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#765 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD4hnDW45F1ozCA0ZJ0WsZzL4Ik0-wtks5t86RjgaJpZM4UbZjX>
.
|
Memory use is way up because intermediate bitmaps are cached in the new algorithm. backrefs.py is dead, replaced by graph.py. It works like this:
Another side effect of this algorithm is you can partition the search. You could in theory do it with a step size of 1 and that would be the same as the old way. Or you can split it in half. This reduces the memory required. The code handles this, and theoretically the partitions can be done concurrently if you adapt the way temporary bitmaps are generated. In order to do it all in one go it is necessary to reduce the graph. This is the part I am not sure is correct. Without reduction I have 350 million edges, and generating the bitmaps all at once would need 32GB of RAM. With reduction it fits in about 8GB. |
I've reimplemented the full output and it looks like the results are reasonable. One problem is that there are usually several commits that match the same number of blobs and the ordering is not stable. Numpy even sped up the last part of the algorithm a lot - it is much faster at bitwise operations on long arrays than the bitarray library. |
git rev-list --objects --all
, fetch each object with name_to_object, and count each type:Result:
So:
Git thinks there are 6,990,030 objects in the database.
GitDB thinks there are 8,268,978.
GitCmdObjectDB thinks there are 3,839.
Git thinks there are 4,178,263 trees in the database.
Both GitDB and GitCmdObjectDB think there are 568,851.
The text was updated successfully, but these errors were encountered: