fix: MarkdownChunker, retain subsection headers #323

alexchao · 2024-11-08T21:25:52Z

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in the chunk content for better retrieval. Previous logic was only grabbing the top-level header.

Closes #322

The LangChain MarkdownHeaderTextSplitter produces Document objects that carry around the containing header content as metadata values. There are possibly a number of ways to use this header metadata to improve the retrieval of relevant chunks, but I'm not changing the overall approach; this change just ensures that level 2 and level 3 headers are retained when this header text is added to the chunk content.

⚠️ As others have noted: I tried to follow the contribution guidelines, but the dev branch seems to pull in some other commits, so this pull request is directed to main. Also, I see there are no automated tests for the project yet. I'm happy to write tests, but didn't want to add test code for which there's no automated pipeline, so I'll skip for now.

I've tested my change manually:

doc = Document(content="...")
await MarkdownChunker().chunk(config={}, documents=[doc])
print([chunk.content for chunk in doc.chunks])

... and through the Verba UI on a local deployment by importing the Markdown doc mentioned in the issue.

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in the chunk content for better retrieval. Closes weaviate#322

weaviate-git-bot · 2024-11-09T10:01:46Z

To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

alexchao · 2024-11-09T18:48:06Z

Agree with the CLA.

fix: MarkdownChunker, retain subsection headers

8f33d65

Update MarkdownChunker to retain level 2 and level 3 Markdown headers in the chunk content for better retrieval. Closes weaviate#322

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: MarkdownChunker, retain subsection headers #323

fix: MarkdownChunker, retain subsection headers #323

alexchao commented Nov 8, 2024 •

edited

Loading

weaviate-git-bot commented Nov 9, 2024

alexchao commented Nov 9, 2024

fix: MarkdownChunker, retain subsection headers #323

Are you sure you want to change the base?

fix: MarkdownChunker, retain subsection headers #323

Conversation

alexchao commented Nov 8, 2024 • edited Loading

weaviate-git-bot commented Nov 9, 2024

alexchao commented Nov 9, 2024

alexchao commented Nov 8, 2024 •

edited

Loading