bytepiece

Implementation of Su's bytepiece.

Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent.

Bindings

Rust
Python

Quick Example using Python

from rs_bytepiece import Tokenizer

tokenizer = Tokenizer()
output = tokenizer.encode("今天天气不错")
print(output)
# [40496, 45268, 39432]

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
bindings/python		bindings/python
bytepiece_rs		bytepiece_rs
experiment		experiment
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bytepiece

Bindings

Quick Example using Python

About

Releases

Packages

Languages

hscspring/bytepiece-rs

Folders and files

Latest commit

History

Repository files navigation

bytepiece

Bindings

Quick Example using Python

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages