Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed settings to unstructured to make largest chunks the proper size #1029

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

jamesrichards4
Copy link
Contributor

Context

Our 'largest' chunk resolution had no chunking and so were very small. This returns them to being the proper size

Changes proposed in this pull request

Use a single configurable document loader for all chunk resolutions. This uses chunk_by_title and sets the min and max chunk sizes to ensure we get the configured sizes.

Guidance to review

We probably want to add a new env var for largest_min_chunk_size at some point and there are improvements coming to the actual numbers. This PR just addresses the regression and a refactor which makes sense with this new config

Things to check

  • I have added any new ENV vars in all deployed environments
  • I have tested any code added or changed
  • I have run integration tests

…. Refactored loaders now they're so similar to make the whole setup simpler
Copy link
Collaborator

@gecBurton gecBurton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gecBurton gecBurton added this to the 0.6.2 milestone Sep 10, 2024
@jamesrichards4 jamesrichards4 merged commit 1ca1e95 into main Sep 10, 2024
6 of 7 checks passed
@wpfl-dbt wpfl-dbt mentioned this pull request Sep 12, 2024
3 tasks
@gecBurton gecBurton deleted the bugfix/ingest-chunk-sizes branch October 29, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants