Implementation of Su's bytepiece.
Bytepiece is a new tokenize method, which uses UTF-8 Byte as unigram to process text. It needs little preprocessing, more pure and language independent.
from rs_bytepiece import Tokenizer
tokenizer = Tokenizer()
output = tokenizer.encode("今天天气不错")
print(output)
# [40496, 45268, 39432]