Releases · benbrandt/text-splitter

What's New

New HuggingFaceTextSplitter, which allows for using Hugging Face's tokenizers package to count chunks by tokens with a tokenizer of your choice.

from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Breaking Changes

trim_chunks now defaults to True instead of False. For most use cases, this is the desired behavior, especially with chunk ranges.

Full Changelog: python-v0.1.4...python-v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's New

Breaking Changes

What's Changed

Contributors

What's Changed

Contributors

Releases: benbrandt/text-splitter

v0.4.2

What's Changed

Contributors

Python v0.2.2

What's Changed

Contributors

Python v0.2.1 - OpenAI Tiktoken Support

What's Changed

Contributors

Python: v0.2.0 - Hugging Face Tokenizer support

What's New

Breaking Changes

v0.4.1 - Remove unneeded `tokenizers` features

What's Changed

Contributors

Python: v0.1.4 - Fifth time is the charm?

Python: v0.1.3 - New package name

Python: v0.1.2 - Fix bad release

Python: v0.1.1 - Fix bad release

Python: v0.1.0 - Initial Python Binding Release

What's Changed

Contributors