-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store an MD5 hash of uploaded/indexed file and check before prepdocs #835
Store an MD5 hash of uploaded/indexed file and check before prepdocs #835
Conversation
…nd time and they haven't changed
Yay! Also adding @mattmsft as a reviewer as we discussed the issue together about avoiding repetitive indexing, and also because Matt has prepdocs-related changes planned in the future. Looks like you were able to make this change fairly unobtrusively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me, we can add tests at a later point
commit 3bf2d19 Author: Pamela Fox <pamelafox@microsoft.com> Date: Thu Nov 2 09:10:15 2023 -0700 Fix list file (Azure-Samples#897) commit b3c55b0 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Nov 1 06:42:42 2023 -0700 Bump pypdf from 3.16.3 to 3.17.0 in /scripts (Azure-Samples#890) Bumps [pypdf](https://github.com/py-pdf/pypdf) from 3.16.3 to 3.17.0. - [Release notes](https://github.com/py-pdf/pypdf/releases) - [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md) - [Commits](py-pdf/pypdf@3.16.3...3.17.0) --- updated-dependencies: - dependency-name: pypdf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit cfbb90b Author: Pamela Fox <pamelafox@microsoft.com> Date: Tue Oct 31 21:37:55 2023 -0700 Add more readmes/guides (Azure-Samples#889) * Add more readmes/guides * Add image * Diagram added commit 4479a2c Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 30 21:38:26 2023 -0700 Handle errors better especialyl for streaming (Azure-Samples#884) commit 0d54f84 Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 30 17:58:09 2023 -0700 Add exclude files (Azure-Samples#876) commit 3647826 Author: Roderic Bos <github@rooc.nl> Date: Mon Oct 30 17:35:50 2023 +0100 When using the option --storagekey for the prepdocs script the key might (Azure-Samples#866) contain `==` base64 padding at the end. This will fail to succesfully login because the script just removes the `=` signs during the split action. Copied the version from the app/start.ps1 which is better suited here. Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit 5daa934 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Oct 30 09:17:55 2023 -0700 Bump the github-actions group with 1 update (Azure-Samples#880) Bumps the github-actions group with 1 update: [actions/setup-node](https://github.com/actions/setup-node). - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v3...v4) --- updated-dependencies: - dependency-name: actions/setup-node dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit bfb3ee5 Author: Pamela Fox <pamelafox@microsoft.com> Date: Fri Oct 27 11:04:37 2023 -0700 Render entire PDFs instead of single pages (Azure-Samples#840) * Adding anchors * Show whole file * Show whole file * Page number support * More experiments with whole file * Revert unintentional changes * Add tests * Remove random num * Add retry_total=0 to avoid unnecessary network requests * Add comment to explain retry_total * Bring back deleted file * Blob manager refactor after merge * Update coverage amount * Make mypy happy with explicit check of path * Add debug for 3.9 * Skip in 3.9 since its silly * Reduce fail under percentage due to 3.9 commit c989048 Author: Pamela Fox <pamelafox@microsoft.com> Date: Thu Oct 26 20:01:18 2023 -0700 add screenshot (Azure-Samples#870) commit a64a12e Author: Matt <57731498+mattmsft@users.noreply.github.com> Date: Thu Oct 26 12:49:02 2023 -0700 Refactor prepdocs (Azure-Samples#862) * setting up types * setting up more types... * working on it... * prepdocs refactor * typing fixes; updating tests * more fixes; updating tests * fixing retry for embeddings * fixing adls gen2 list * more test fixes * fixes from manual runs * more fixes * more fixes * type fixes * more type fixes and test fixes * break into modules * openai embedding fix * novectors fix * fix id generation * doc strings * feedback from pr * rename feedback * trying to get imports to work * update test workflow with pamela's suggestion * fix ci again * delete __init__ * mypy configuration --------- Co-authored-by: Matt Gotteiner <magottei@microsoft.com> commit 94be632 Author: MaciejLitwiniec <MaciejLitwiniec@users.noreply.github.com> Date: Thu Oct 26 15:23:00 2023 +0200 Updated FAQ so that it reflect PR 835 (Azure-Samples#868) * Updated FAQ so that it reflect PR 835 * Update README.md * Update README.md --------- Co-authored-by: Pamela Fox <pamela.fox@gmail.com> commit d7bbf9f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 25 14:23:05 2023 -0700 Bump werkzeug from 3.0.0 to 3.0.1 in /app/backend (Azure-Samples#863) Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.0 to 3.0.1. - [Release notes](https://github.com/pallets/werkzeug/releases) - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst) - [Commits](pallets/werkzeug@3.0.0...3.0.1) --- updated-dependencies: - dependency-name: werkzeug dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit d02aa14 Author: Pamela Fox <pamelafox@microsoft.com> Date: Wed Oct 25 13:48:45 2023 -0700 Message builder improvements (Azure-Samples#852) commit 8f55988 Author: Pamela Fox <pamelafox@microsoft.com> Date: Tue Oct 24 19:53:07 2023 -0700 Reorder tags to optimize for sample browser (Azure-Samples#853) commit 16a61bf Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 23 17:03:59 2023 -0700 Improve follow-up questions and pipe into context (Azure-Samples#832) * Add follow-up questions and parsing * Test breaking the e2e * Actually run tests * Fix runner * Add conditional * Fix the test * chat approach commit ca01af9 Author: Anthony Shaw <anthony.p.shaw@gmail.com> Date: Mon Oct 23 09:27:32 2023 +1100 Store an MD5 hash of uploaded/indexed file and check before prepdocs (Azure-Samples#835) * Hash the uploaded files locally and skip them if you provision a second time and they haven't changed * Overwrite the hash when it changes * Remove open mode parameter * fix f-string * reformat changes * Update prepdocs.py
…zure-Samples#835) * Hash the uploaded files locally and skip them if you provision a second time and they haven't changed * Overwrite the hash when it changes * Remove open mode parameter * fix f-string * reformat changes * Update prepdocs.py
# if there is a file called .md5 in this directory, see if its updated | ||
stored_hash = None | ||
with open(filename, "rb") as file: | ||
existing_hash = hashlib.md5(file.read()).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be called something like : new_hash. Here we are calculating a hash of a new file.
Purpose
Running prep docs on every provision is really time consuming, especially if the docs haven't changed.
This PR writes an
.md5
file with an MD5 hash of the files indata/
when they get uploaded, then every time the prep docs script is re-run that file is checked against the current hash and the file is skipped if it hasn't changed.I've added those md5 files to gitignore so they don't get committed and they also won't be uploaded.
Does this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
How to Test
What to Check
Verify that the following are valid
Other Information