Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store an MD5 hash of uploaded/indexed file and check before prepdocs #835

Merged
merged 6 commits into from
Oct 22, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -144,4 +144,6 @@ cython_debug/
# NPM
npm-debug.log*
node_modules
static/
static/

data/*.md5
22 changes: 22 additions & 0 deletions scripts/prepdocs.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import argparse
import base64
import glob
import hashlib
import html
import io
import os
Expand Down Expand Up @@ -515,6 +516,27 @@ def read_files(
read_files(filename + "/*", use_vectors, vectors_batch_support)
continue
try:
# if filename ends in .md5 skip
if filename.endswith(".md5"):
print("Skipping md5 hash index.")
tonybaloney marked this conversation as resolved.
Show resolved Hide resolved
continue

# if there is a file called .md5 in this directory, see if its updated
stored_hash = None
with open(filename, "rb") as file:
existing_hash = hashlib.md5(file.read()).hexdigest()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called something like : new_hash. Here we are calculating a hash of a new file.

if os.path.exists(filename + ".md5"):
with open(filename + ".md5", encoding="utf-8") as md5_f:
stored_hash = md5_f.read()

if stored_hash and stored_hash.strip() == existing_hash.strip():
print(f"Skipping {filename}, no changes detected.")
continue
else:
# Write the hash
with open(filename + ".md5", "w", encoding="utf-8") as md5_f:
md5_f.write(existing_hash)

if not args.skipblobs:
upload_blobs(filename)
page_map = get_document_text(filename)
Expand Down