Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: MarkdownChunker, retain subsection headers #323

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alexchao
Copy link

@alexchao alexchao commented Nov 8, 2024

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in the chunk content for better retrieval. Previous logic was only grabbing the top-level header.

Closes #322


The LangChain MarkdownHeaderTextSplitter produces Document objects that carry around the containing header content as metadata values. There are possibly a number of ways to use this header metadata to improve the retrieval of relevant chunks, but I'm not changing the overall approach; this change just ensures that level 2 and level 3 headers are retained when this header text is added to the chunk content.

⚠️ As others have noted: I tried to follow the contribution guidelines, but the dev branch seems to pull in some other commits, so this pull request is directed to main. Also, I see there are no automated tests for the project yet. I'm happy to write tests, but didn't want to add test code for which there's no automated pipeline, so I'll skip for now.

I've tested my change manually:

doc = Document(content="...")
await MarkdownChunker().chunk(config={}, documents=[doc])
print([chunk.content for chunk in doc.chunks])

... and through the Verba UI on a local deployment by importing the Markdown doc mentioned in the issue.

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in
the chunk content for better retrieval.

Closes weaviate#322
@weaviate-git-bot
Copy link

To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

@alexchao
Copy link
Author

alexchao commented Nov 9, 2024

Agree with the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MarkdownChunker drops subsection headers
2 participants