-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a pre-commit hook to reject large files #2699
Add a pre-commit hook to reject large files #2699
Conversation
Signed-off-by: Gera Shegalov <gera@apache.org>
Developer's use case: $ fallocate -l 51M tools/src/test/resources/bigfile.txt
$ git add tools/src/test/resources/bigfile.txt
$ git commit -m test
Reject files over 50M....................................................Failed
- hook id: check-added-large-files
- exit code: 1
tools/src/test/resources/bigfile.txt (52224 KB) exceeds 51200 KB. Even if developers don't enable pre-commit checks and accidentally push PRs with large files $ pre-commit run check-added-large-files --all-files
Reject files over 50M....................................................Failed
- hook id: check-added-large-files
- exit code: 1
tools/src/test/resources/bigfile.txt (52224 KB) exceeds 51200 KB. |
This is a nice change, my only question is about the
Does it mean if I have a file that I didn't run |
I really dislike the 50 MiB limit. We should not be checking anything in that big. I would much rather see a 1.5 MiB limit, and allow the files we already have in place to stay. |
@revans2 I agree with you. The number comes from the ticket and the current state of the discussion. My main point is to demonstrate how to reuse the same check on the dev machines and CI. We can converge on the size limit here. The default in this hook is 500KiB. 1.5MiB sounds also reasonable to me. |
@abellina no this is only for the use CI use-case where the hook is run outside git with --all-files flag. If you run just However, if we converge on @revans2 's proposal to lower the limit to 1.5MiB for new files without checking existing files I need to make sure that we touch only some safe suffix of the git log: e,g ccd6018 breaks: $ pre-commit run check-added-large-files --all-files --from-ref ccd6018ff2ee194313389d375eaedcf3388e9226 --to-ref HEAD
Check for file over 1.5MiB...............................................Failed
- hook id: check-added-large-files
- exit code: 1
tools/src/test/resources/spark-events-qualification/join_missing_sql_end (1583 KB) exceeds 1500 KB.
tools/src/test/resources/spark-events-profiling/rapids_join_eventlog2 (2167 KB) exceeds 1500 KB. but starting with 2f977dc $ pre-commit run check-added-large-files --all-files --from-ref 2f977dc7367958a6aec0f0bed279e31f683dbe80 --to-ref HEAD
Check for file over 1.5MiB...............................................Passed is fine. |
sounds good, I'd argue for a limit that allows our current repo to pass CI without any extra sha restrictions (something like 2.5MB). |
@abellina unfortunately, 2.5MiB won't do it per @revans2's comment #2674 (comment) $ git ls-files | xargs ls -alhS | head -3
-rw-rw-r-- 1 gshegalov gshegalov 44M Jun 9 11:00 tools/src/test/resources/spark-events-qualification/nds_q86_fail_test
-rw-rw-r-- 1 gshegalov gshegalov 44M Jun 7 17:09 tools/src/test/resources/spark-events-qualification/nds_q86_test
-rw-rw-r-- 1 gshegalov gshegalov 4.6M Jun 7 17:09 tools/src/test/resources/spark-events-qualification/dsAndDf_eventlog |
What files are that big?
It is all of the files that were just added in as a part of the tools tests. Everything before that was small, and I would argue we should not have allowed the tools files in. We should have flagged it early on and someone should have looked at it and thought do we really need a 44M MiB file checked into the repo. |
those files can be compressed as well or we can decide to move them out |
That does not fix the problem. Once they are in the repo unless we do surgery to the repo, like we just did, they will be downloaded every time, because the full history is downloaded for each clone. The point of this patch is not to fix what was already done. The point of this is to stop it from ever happening again in the future and making the problem worse |
Never mind (re 2.5MB). I was assuming the output from an earlier comment was all inclusive. |
If the CI installs
we should be good. Developers should install pre-commit too which will add automatic copyright updates as a bonus |
build |
Prevents large files as in #2644.
Contributes to #2420, #2674
Signed-off-by: Gera Shegalov gera@apache.org