v0.14.0 #242
benbrandt
announced in
Announcements
v0.14.0
#242
Replies: 1 comment 1 reply
-
Hello Ben, Sorry if my question sounds absurd, but can I use "jina-embeddings-v2-base-de" with your TextSplitter? Will it work? Best, |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What's New
Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.
For the "happy path", this new approach also led to big speed gains in the
CodeSplitter
(50%+ speed increase in some cases), marginal regressions in theMarkdownSplitter
, and not much difference in theTextSplitter
. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.Breaking Changes
MarkdownSplitter
at very small sizes, and any splitter usingRustTokenizers
because of its offset behavior.Rust
ChunkSize
has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.ChunkSizer
much easier, as you now only need to generate the size of the chunk as ausize
. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.Before
After
Full Changelog: v0.13.3...v0.14.0
This discussion was created from the release v0.14.0.
Beta Was this translation helpful? Give feedback.
All reactions