v0.14.0 #242

benbrandt · 2024-06-21T20:54:34Z

benbrandt
Jun 21, 2024
Maintainer

What's New

Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.

For the "happy path", this new approach also led to big speed gains in the CodeSplitter (50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter, and not much difference in the TextSplitter. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.

Breaking Changes

Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the MarkdownSplitter at very small sizes, and any splitter using RustTokenizers because of its offset behavior.

Rust

ChunkSize has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
This makes implementing a custom ChunkSizer much easier, as you now only need to generate the size of the chunk as a usize. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.

Before

pub trait ChunkSizer {
    // Required method
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}

After

pub trait ChunkSizer {
    // Required method
    fn size(&self, chunk: &str) -> usize;
}

Optimization for SemanticSplitRange searching by @benbrandt in Optimization for SemanticSplitRange searching #219
Performance Optimization: Expanding binary search window by @benbrandt in Performance Optimization: Expanding binary search window #231

Full Changelog: v0.13.3...v0.14.0

This discussion was created from the release v0.14.0.

SyedAgha · 2024-07-03T18:16:31Z

SyedAgha
Jul 3, 2024

Hello Ben,

Sorry if my question sounds absurd, but can I use "jina-embeddings-v2-base-de" with your TextSplitter? Will it work?

Best,
Agha

1 reply

benbrandt Jul 5, 2024
Maintainer Author

Hi @SyedAgha I think so. Your best bet will be to try and load it via the HuggingFace tokenizers library, probably passing the model name via from_pretrained in that library to instantiate the tokenizer.

Let me know if that works!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.14.0 #242

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

v0.14.0 #242

benbrandt Jun 21, 2024 Maintainer

What's New

Breaking Changes

Rust

Before

After

Replies: 1 comment · 1 reply

SyedAgha Jul 3, 2024

benbrandt Jul 5, 2024 Maintainer Author

benbrandt
Jun 21, 2024
Maintainer

Replies: 1 comment 1 reply

SyedAgha
Jul 3, 2024

benbrandt Jul 5, 2024
Maintainer Author