Skip to content

Commit

Permalink
Fix small chunk behavior (#84)
Browse files Browse the repository at this point in the history
* Make less tests fail without features

* Fix for regression not being fixed for tiny chunk sizes

For very small chunk sizes (i.e. 5 tokens), the chunk size behavior wasn't completely brought back to pre-v0.5.0 behavior. While sizes of 10 or higher seemed to be unaffected, smaller had a higher chance of seeing this occaisional bug. While an edge case, the behavior is fixed now.

* Exclude tokenizers from packaging

* Readable tokenizer file
  • Loading branch information
benbrandt authored Jan 20, 2024
1 parent d28f8c0 commit fb21920
Show file tree
Hide file tree
Showing 6 changed files with 255,859 additions and 3 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## v0.6.1

### Fixes

- Fix error in section filtering that didn't fix the chunk behavior regression from v0.5.0 in very tiny chunk capacities. For most commonly used chunk sizes, this shouldn't have been an issue.

## v0.6.0

### Breaking Changes
Expand Down
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
[package]
name = "text-splitter"
version = "0.6.0"
version = "0.6.1"
authors = ["Ben Brandt <benjamin.j.brandt@gmail.com>"]
edition = "2021"
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models)."
repository = "https://github.com/benbrandt/text-splitter"
license = "MIT"
keywords = ["text", "split", "tokenizer", "nlp", "ai"]
categories = ["text-processing"]
exclude = ["/tests/snapshots/**", "/tests/inputs/**", "/bindings/**"]
exclude = ["/tests/snapshots/**", "/tests/inputs/**", "/bindings/**", "/tests/tokenizers/**"]
rust-version = "1.65.0"

[package.metadata.docs.rs]
Expand Down
2 changes: 1 addition & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -787,7 +787,7 @@ where
// likely a meaningful breakpoint we want to preserve. We already know that the next highest doesn't fit anyway,
// so we should be safe to break once we reach it.
.take_while_inclusive(move |(offset, _)| {
max_encoded_offset.map_or(true, |max| offset < &max)
max_encoded_offset.map_or(true, |max| offset <= &max)
})
.filter(|(_, str)| !str.is_empty()),
)
Expand Down
13 changes: 13 additions & 0 deletions tests/text_splitter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,16 @@ fn random_chunk_range() {
}
}
}

#[cfg(feature = "tokenizers")]
#[test]
fn huggingface_small_chunk_behavior() {
let tokenizer =
tokenizers::Tokenizer::from_file("./tests/tokenizers/huggingface.json").unwrap();
let splitter = TextSplitter::new(tokenizer);

let text = "notokenexistsforthisword";
let chunks = splitter.chunks(text, 5).collect::<Vec<_>>();

assert_eq!(chunks, ["notokenexistsforth", "isword"]);
}
8 changes: 8 additions & 0 deletions tests/text_splitter_snapshots.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ use std::fs;

use once_cell::sync::Lazy;
use text_splitter::{Characters, ChunkSizer, TextSplitter};
#[cfg(feature = "tiktoken-rs")]
use tiktoken_rs::{cl100k_base, CoreBPE};
#[cfg(feature = "tokenizers")]
use tokenizers::Tokenizer;

#[test]
Expand Down Expand Up @@ -75,9 +77,11 @@ fn characters_range_trim() {
});
}

#[cfg(feature = "tokenizers")]
static HUGGINGFACE_TOKENIZER: Lazy<Tokenizer> =
Lazy::new(|| Tokenizer::from_pretrained("bert-base-cased", None).unwrap());

#[cfg(feature = "tokenizers")]
#[test]
fn huggingface_default() {
insta::glob!("inputs/text/*.txt", |path| {
Expand All @@ -99,6 +103,7 @@ fn huggingface_default() {
});
}

#[cfg(feature = "tokenizers")]
#[test]
fn huggingface_trim() {
insta::glob!("inputs/text/*.txt", |path| {
Expand All @@ -119,8 +124,10 @@ fn huggingface_trim() {
});
}

#[cfg(feature = "tiktoken-rs")]
static TIKTOKEN_TOKENIZER: Lazy<CoreBPE> = Lazy::new(|| cl100k_base().unwrap());

#[cfg(feature = "tiktoken-rs")]
#[test]
fn tiktoken_default() {
insta::glob!("inputs/text/*.txt", |path| {
Expand All @@ -142,6 +149,7 @@ fn tiktoken_default() {
});
}

#[cfg(feature = "tiktoken-rs")]
#[test]
fn tiktoken_trim() {
insta::glob!("inputs/text/*.txt", |path| {
Expand Down
Loading

0 comments on commit fb21920

Please sign in to comment.