Skip to content

Commit

Permalink
Python: Update to text-splitter 0.4.2 (#31)
Browse files Browse the repository at this point in the history
  • Loading branch information
benbrandt authored Jul 2, 2023
1 parent 2f5f718 commit fed9dde
Show file tree
Hide file tree
Showing 4 changed files with 68 additions and 37 deletions.
6 changes: 6 additions & 0 deletions bindings/python/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## v0.2.2

### What's New

- Update to v0.4.2 of `text-splitter` to support `tiktoken-rs@0.5.0`

## v0.2.1

### What's New
Expand Down
65 changes: 31 additions & 34 deletions bindings/python/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions bindings/python/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "semantic-text-splitter"
version = "0.2.1"
version = "0.2.2"
authors = ["Ben Brandt <benjamin.j.brandt@gmail.com>"]
edition = "2021"
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models)."
Expand All @@ -15,8 +15,8 @@ crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.19.0", features = ["abi3-py37"] }
text-splitter = { version = "0.4.1", features = ["tiktoken-rs", "tokenizers"] }
tiktoken-rs = "0.4.2"
text-splitter = { version = "0.4.2", features = ["tiktoken-rs", "tokenizers"] }
tiktoken-rs = "0.5.0"
tokenizers = { version = "0.13.3", default_features = false, features = [
"onig",
] }
28 changes: 28 additions & 0 deletions bindings/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,34 @@ splitter = CharacterTextSplitter(trim_chunks=False)
chunks = splitter.chunks("your document text", max_characters)
```

### With Huggingface Tokenizer

```python
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)
```

### With Tiktoken Tokenizer

```python
from semantic_text_splitter import TiktokenTextSplitter

# Maximum number of tokens in a chunk
max_tokens = 1000
# Optionally can also have the splitter not trim whitespace for you
splitter = TiktokenTextSplitter("gpt-3.5-turbo", trim_chunks=False)

chunks = splitter.chunks("your document text", max_tokens)
```

### Using a Range for Chunk Capacity

You also have the option of specifying your chunk capacity as a range.
Expand Down

0 comments on commit fed9dde

Please sign in to comment.