Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repo size #538

Closed
sbfnk opened this issue Feb 5, 2024 · 12 comments
Closed

repo size #538

sbfnk opened this issue Feb 5, 2024 · 12 comments
Assignees

Comments

@sbfnk
Copy link
Contributor

sbfnk commented Feb 5, 2024

The repo has grown fairly large (~1 GB), but the files currently in the repo are only 11 MB in size. It might be nice, particularly towards those on low bandwidth connections or paying by volume, to look at reducing the size without losing any relevant development history.

Using git filter-repo --analyze reveals a few potential easy gains:

> cat filter-repo/analysis/directories-deleted-sizes.txt 
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
     6810551    3891751 2020-07-22 docs
     6341588    3835465 2020-07-22 docs/reference
     4357982    3715367 2020-07-22 docs/reference/figures
     3724958    1913831 2022-12-19 deps/bootstrap-5.1.3
     3118580    1835087 2022-12-19 deps/bootstrap-5.1.3/fonts
    23619299     747946 2023-02-03 src
     9092536     729112 2023-01-17 inst/pkg-structure
       53863       8984 2023-02-02 .devcontainer
       35864       5688 2023-02-02 .devcontainer/library-scripts
       26185       1104 2020-07-22 docs/news
           0         90 2022-10-15 tests/testthat/test-data
> head -n 10 filter-repo/analysis/path-deleted-sizes.txt 
=== Deleted paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name(s)
    65775542   64539999 2020-11-30 synthetic.rds
     5907877    5236611 2020-11-08 man/figures/unnamed-chunk-11-1.png
     5207701    4641809 2021-06-03 reference/figures/unnamed-chunk-11-1.png
     3976705    3530160 2020-11-08 man/figures/unnamed-chunk-12-1.png
     3776311    3362607 2020-07-22 docs/reference/figures/unnamed-chunk-11-1.png
     3552356    3153533 2021-06-03 reference/figures/unnamed-chunk-12-1.png
     3143380    3080722 2023-10-03 data/example_regional_epinow.rda
     1619219    1504604 2021-06-03 reference/epinow-5.png

At the very least this suggests to me that all the directories above, as well as all png files in man/figures (which, if I understand correctly, aren't used anywhere) could be purged. A line to exclude png files in man/figures could also be added to .gitignore. This could be followed by a deeper investigation of blob sizes for existing files.

@seabbs
Copy link
Contributor

seabbs commented Feb 6, 2024

yes definitely agree. Certainly the main culprits (docs, deps, and src). Agree we could remove prior figures from the old readme as well

@sbfnk
Copy link
Contributor Author

sbfnk commented Feb 7, 2024

Running

> git filter-repo \
  --path src/ \
  --path deps/ \
  --path dev/ \
  --path reference/ \
  --path synthetic.rds \
  --path data/example_regional_epinow.rda \
  --path data/example_estimate_infections.rda \
  --path-regex man/figures/unnamed-chunk-\[0-9\]+-1\\.png \
  --path-regex inst/dev/figs/.\*scores\\.png \
  --invert-paths

reduces the size of the repo from 1.1GB to 34MB. Any objections to going ahead with it? I could create a backup fork in my personal account first.

Given that this would require a force push anyone who has the repo checked out locally will have to do a git reset at some point. I don't think there's a way around this - the alternative is to keep things as they are. On balance I'd think it's worth it but if anyone disagrees please leave a comment.

@jamesmbaazam
Copy link
Contributor

I'm not sure of the cons, so I'd say go ahead. It's good that you're keeping a backup just in case.

@Bisaloo
Copy link
Member

Bisaloo commented Feb 7, 2024

I agree this is necessary but highlighting some important caveats we discovered with @ntorresd when going through the same process with serofoi:

  • This is going to automatically close all open pull requests because they won't have any shared commits with main (at least for a brief moment in time). They can be opened at new PRs once you have force pushed all branches but ongoing conversations may be interrupted / you will have to start a new thread.
  • All contributors will have to not just reset but delete their entire clone and reclone from scratch
  • You may not see the effects of the operation immediately as GitHub repacks repos on a schedule
  • It is important to keep somewhere (e.g., this issue) a map of the old <-> new commit hashes so that past references to commits can still be resolved in the future.

@ntorresd, did I forget anything?

@ntorresd
Copy link

ntorresd commented Feb 7, 2024

I would only add that you will not see the effects of the clean up until the clean versions of the git tags had been pushed. When we did this with @Bisaloo for serofoi we didn't see the change reflected on fresh copies of the repository until we ran git push origin v0.0.9 -f on my local cleaned copy.

@Bisaloo
Copy link
Member

Bisaloo commented Feb 7, 2024

Thanks, I had forgotten about the tags.

I wonder about the impact of all of this on renv.lock lockfiles since it stores a hash of the source 🤔

@sbfnk
Copy link
Contributor Author

sbfnk commented Feb 8, 2024

Thanks all for the helpful comments. To confirm I will:

  • create a fork of this repo and rename to EpiNow2-backup
  • run the git filter-repo command as above
  • post the contents of filter-repo/commit-map
  • git push --tags --force
  • reopen all PRs

which should address all the points raised above, unless I've forgotten something.

@Bisaloo
Copy link
Member

Bisaloo commented Feb 8, 2024

Yes, this seems right.

To be 100% clear because a previous version of my message wasn't: from what we've seen in serofoi, I don't think you'll be able to reopen closed PRs. You will have to create new ones. No issues from a git point of view, but conversation will be spread across two PRs.

@sbfnk
Copy link
Contributor Author

sbfnk commented Feb 8, 2024

Ah ok probably worth waiting for currently open ones to be merged then.

@seabbs seabbs mentioned this issue Feb 20, 2024
@sbfnk sbfnk added this to the CRAN v1.5 release milestone Apr 30, 2024
@sbfnk sbfnk self-assigned this Apr 30, 2024
@sbfnk
Copy link
Contributor Author

sbfnk commented Apr 30, 2024

To do before 1.5 release

@sbfnk
Copy link
Contributor Author

sbfnk commented May 3, 2024

I've done the steps outlined above and the force push succeeded - old refs are still there and PRs still open though, so not sure if I'm missing a step or if it's a matter of waiting for repacking. see next comment

@sbfnk
Copy link
Contributor Author

sbfnk commented May 3, 2024

Upon closer inspection the vast majority of the repo content was in the gh-pages branch so I've done a big squash there has reduced the size to manageable levels (1.1 GB -> 100MB).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

5 participants