Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Git history to prune large 'old files'? #38

Closed
pwaller opened this issue Nov 20, 2018 · 18 comments
Closed

Rewrite Git history to prune large 'old files'? #38

pwaller opened this issue Nov 20, 2018 · 18 comments
Milestone

Comments

@pwaller
Copy link
Member

pwaller commented Nov 20, 2018

Update summary, 23/11/2018: This repository currently requires ~10MiB of download, which isn't ideal considering the source is only a few hundreds of kilobytes. @mewmew and I propose to shrink it to ~800kiB, to give a faster "Go install" experience for anyone using the repository.

The reason for the blowup is that there were some large test cases (including sqlite) which measure in the 10's of MiBs, and various other bits relating to parsing were also quite large. Those have now moved into other repositories in the llir organization, so we don't need to download those anymore if you just want to import llir.


Original issue text.

I just saw @mewmew's comment in ec48d54 but thought it would be easier to have a separate issue for discussion - the commit itself is very long so if I commented on the commit the discussion would be way down at the bottom!

First, can I clarify the question - are you asking how to remove lots of old large assets from the history of the repository?

If that is the question, the answer is, yes you can do it, but anyone who cloned the repository needs to know about it otherwise they might get in a mess, since it requires rewriting history. At least, that's the best I know. See github's guidance on the issue.

@mewmew
Copy link
Member

mewmew commented Nov 20, 2018

Thanks for creating the issue, I agree, it's easier to keep the discussion here.

First, can I clarify the question - are you asking how to remove lots of old large assets from the history of the repository?

Exactly!

On the other hand, I did a git clone just now to check the size of the repo, and it wasn't as bad as I had thought. Perhaps we don't need to do this after all.

[u@x1 ~]$ time git clone https://github.com/llir/llvm
Cloning into 'llvm'...
remote: Enumerating objects: 20, done.
remote: Counting objects: 100% (20/20), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 10760 (delta 10), reused 15 (delta 6), pack-reused 10740
Receiving objects: 100% (10760/10760), 9.03 MiB | 1.47 MiB/s, done.
Resolving deltas: 100% (6437/6437), done.

real	0m7.715s
user	0m1.810s
sys	0m0.521s

[u@x1 ~]$ du -hs llvm
11M	llvm

I think we could prune the repo down to 1 MB or so instead of 11 MB, but the question is if it's worth it, given that it requires a force push.

However, should we decide to do this, then having it ready before the v0.3.0 release seems like a perfect time.

@mewmew mewmew added this to the v0.3 milestone Nov 20, 2018
@pwaller
Copy link
Member Author

pwaller commented Nov 21, 2018

I like the idea in principle, but I think this repository has enough followers that it may cause harm to drop the old content.

In practice, the repository is quite small even at 11MiB.

If you decided to do it, you could keep a copy of the old repository around at llir/llvm-legacy, analogously to https://github.com/go-gl-legacy/gl - and that way if anyone does need the old content (e.g, they were depending on a specific git hash) then at least it still exists for the purposes of figuring out what the git hash is in the new repository.

Maybe it's possible to keep the old history around in a separate git ref which doesn't get cloned by default. But in that case I guess the content would be harder to discover.

@mewmew
Copy link
Member

mewmew commented Nov 21, 2018

I like the idea in principle, but I think this repository has enough followers that it may cause harm to drop the old content.

In practice, the repository is quite small even at 11MiB.

Agreed. Had the repo been at 100 MB, then we probably would have done it, but at this size it does not seem worth the potential harm to users. (The idea with shrinking the repo was of course to make it easier for users to make the initial download, especially those who happen to be on a slow Internet connection, as may be common in parts of Asia, etc).

So, for now. I'm fine with keeping it as it is, and just being careful when adding large content in the future. Closing this issue for now. We can always refer back and re-open at a later point.

@mewmew mewmew closed this as completed Nov 21, 2018
@mewmew
Copy link
Member

mewmew commented Nov 21, 2018

Maybe it's possible to keep the old history around in a separate git ref which doesn't get cloned by default. But in that case I guess the content would be harder to discover.

Also, if Go ever does shallow Git clone, this issue would be resolved I think. (upstream issue golang/go#13078)

@pwaller
Copy link
Member Author

pwaller commented Nov 22, 2018

I just learned that Go did this to their repository recently, the discussion in there and how they went about it is pretty interesting:

golang/go#28899

I think it probably doesn't change anything with respect to what we might do to this repository.

@mewmew
Copy link
Member

mewmew commented Nov 22, 2018

I just learned that Go did this to their repository recently, the discussion in there and how they went about it is pretty interesting:

Thanks for the link! It was an interesting read to see how they resolved it.

I think it probably doesn't change anything with respect to what we might do to this repository.

Most likely not. If we end up doing a pruning, then I'd suggest we use bfg as suggested on the GitHub link you posted. Also, if we do this, then perhaps in the next few weeks, as the intention is to have v0.3.0 released some time in early December.

I'm kind of still a bit on the fence. I don't think we need the rewrite. However, should we ever do one, now is basically the perfect time to. As we move from v0.2 to v0.3, since users will have to do manual changes to get the latest release anyways (updating to the latest API, etc).

@mewmew
Copy link
Member

mewmew commented Nov 22, 2018

Until we decide for sure. I'll re-open the issue. Also, this may help get input from other users of the repo who it may affect. I'll also re-name the title to include a mention of Git history rewrite.

@mewmew mewmew reopened this Nov 22, 2018
@mewmew mewmew changed the title Removal of 'old files' Rewrite Git history to prune large 'old files'? Nov 22, 2018
@pwaller
Copy link
Member Author

pwaller commented Nov 23, 2018

Some large paths:

 git rev-list --objects --all | git cat-file --batch-check='%(objectsize:disk) %(objectname) %(objecttype) %(rest)' | grep ' blob ' | awk '{print $4" "$1}' | awk '{
    arr[$1]+=$2
   }
   END {
     for (key in arr) printf("%s\t%s\n", arr[key], key)
   }' | sort -nr | awk '{print $2"\t"$1}' | column -t -s$'\t' | head
old/asm/internal/testdata/sqlite/sqlite3.ll                                 3404085
old/asm/internal/testdata/sqlite/sqlite3.c                                  1726782
asm/internal/parser/actiontable.go                                          472010
old/asm/internal/parser/actiontable.go                                      246534
asm/internal/parser/gototable.go                                            149186
old/asm/internal/parser/gototable.go                                        67735
asm/testdata/DebugInfo/COFF/big-type.ll                                     61648
asm/internal/ll.bnf                                                         59883
asm/ll/ll.tm                                                                59610
asm/testdata/c4.ll                                                          55555

This graph shows how much space will be saved, assuming you eliminate large file paths:

image

@sbinet
Copy link

sbinet commented Nov 23, 2018

@pwaller why matplotlib? gonum/plot is so much better :P

@pwaller
Copy link
Member Author

pwaller commented Nov 23, 2018

kill_ids.csv

@mewmew
Copy link
Member

mewmew commented Nov 23, 2018

The current intention is to clone llir/llvm into llir/llvm-legacy, to preserve the complete history. Then, to start clean, we will keep any fine currently in HEAD, and it's entire history at that path. Since we need to do a force push anyway, this seem to be the time to really get the size of the repo down.

If anyone currently using the repo has some input or feedback, feel welcome to contribute your thoughts.

@pwaller
Copy link
Member Author

pwaller commented Nov 23, 2018

@mewmew and I propose to run the following:

$ du --apparent-size -sch .git
9.5M	.git
9.5M	total

# Kill objects at and before v0.2.1
git rev-list --objects v0.2.1 | awk '{print $1}' > killset.txt

# Kill unwanted objects - testdata, textmapper and other experimental code.
git rev-list --objects --all | git cat-file --batch-check='%(objectname) %(rest)' | egrep '(/testdata/| l/|\.tm$)' | awk '{print $1}' >> killset.txt

java -jar ~/Downloads/bfg-1.13.0.jar -bi killset.txt 

git repack -a && git reflog expire --expire=now --all && git gc --prune=now --aggressive

$ du --apparent-size -sch .git
800K	.git
800K	total

@pwaller
Copy link
Member Author

pwaller commented Nov 23, 2018

See https://github.com/llir/llvm-clean for the new repository. The intent is to force push the HEAD of that repository into llir/llvm at some point (or to redo the above commands against this repository assuming development continues here for now).

@pwaller
Copy link
Member Author

pwaller commented Nov 24, 2018

https://github.com/reedkotler/scala-llc doesn't seem to contain any go code?

@mewmew
Copy link
Member

mewmew commented Nov 25, 2018

https://github.com/reedkotler/scala-llc doesn't seem to contain any go code?

Oh, the code match was from the BNF https://github.com/reedkotler/scala-llc/blob/ff3578b14171a5332e1c7f972c0c40b32f7a9e4c/ll.bnf#L187

<< import (
   "github.com/llir/llvm/asm/internal/ast"
   "github.com/llir/llvm/asm/internal/astx"
) >>

We can remove it from the list.

@mewmew
Copy link
Member

mewmew commented Nov 30, 2018

I'd like to trim the llir/llvm repo size today, using the approach outlined by @pwaller in #38 (comment), essentially the earlier we do this the better. So we can keep Git history intact going forward.

@mewmew
Copy link
Member

mewmew commented Nov 30, 2018

On the 30th of November we pruned the using BFG to reduce its initial download size. The following commands were run at the old revision d3f412d.

$ du --apparent-size -sch .git
9.6M	.git
9.6M	total

# Kill objects at and before v0.2.1
git rev-list --objects 7a17b32c1767cfeb5287d164e92865adb98985c8 | awk '{print $1}' > killset.txt

# Kill unwanted objects - testdata, textmapper and other experimental code.
git rev-list --objects --all | git cat-file --batch-check='%(objectname) %(rest)' | egrep '(/testdata/| l/|\.tm$)' | awk '{print $1}' >> killset.txt

bfg -bi killset.txt 

git repack -a && git reflog expire --expire=now --all && git gc --prune=now --aggressive

$ du --apparent-size -sch .git
934K	.git
934K	total

@mewmew mewmew closed this as completed Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants