v0.13.0 #171
benbrandt
announced in
Announcements
v0.13.0
#171
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What's New / Breaking Changes
Unicode Segmentation is now only used as a fallback. This prioritizes the semantic levels of each splitter, and only uses Unicode grapheme/word/sentence segmentation when none of the semantic levels can be split at the desired capacity.
In most cases, this won't change the behavior of the splitter, and will likely mean that speed will improve because it is able to skip several semantic levels at the start, acting as a bisect or binary search, and only go back to the lower levels if it can't fit.
However, for the
MarkdownSplitter
at very small sizes (i.e., less than 16 tokens), this may produce different output, becuase prior to this change, the splitter may have used Unicode sentence segmentation instead of the Markdown semantic levels, due to an optimization in the level selection. Now, the splitter will prioritize the parsed Markdown levels before it falls back to Unicode segmentation, which preserves better structure at small sizes.So, it is likely in most cases, this is a non-breaking update. However, if you were using extremely small chunk sizes for Markdown, the behavior is different, and I wanted to inidicate that with a major version bump
Full Changelog: v0.12.3...v0.13.0
This discussion was created from the release v0.13.0.
Beta Was this translation helpful? Give feedback.
All reactions