Skip to content

Commit

Permalink
adding tokenizer configuration (#19)
Browse files Browse the repository at this point in the history
### Description
This PR aims to add some better functionality for configuration of the
tokenizer (#5).

### Todo
- [x] Allow for selection of huggingface tokenizer. Downloads the model
from the hub.
- [x] Jieba tokenizer (chinese)
- [x] tiktoken
- [x] tiniestsegmenter (japanese) [optional]
- [x] Allow switching of HF tokenizers
- [x] Add Tests for tokenizing

### Notes
~~1. Currently, if a HuggingFace tokenizer is initialized, the tokenizer
cannot be changed. e.g. Doing `SELECT tokenize('i have an apple', 'hf',
'sentence-transformers/LaBSE');` after `SELECT tokenize('i have an
apple', 'hf', 'google-t5/t5-base');` would just use the LaBSE tokenizer
as it has already been initialized. Need a better way to handle this.~~

~~2. There is no official rust crate for tiktoken.~~ Used [tiktoken-rs](
https://github.com/zurawiki/tiktoken-rs )

### Usage
```SQL
SELECT tokenize('i have an apple', 'hf', 'google-bert/bert-base-uncased');
SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base');
SELECT tokenize('i have an apple', 'tiktoken', 'gpt2');
SELECT tokenize('i have an apple', 'tiniestsegmenter', '');
SELECT tokenize('i have an apple', 'jieba', '');
SELECT tokenize('i have an apple', 'ws', '');
```

---------

Signed-off-by: jwnz <tkjones93@gmail.com>
  • Loading branch information
jwnz authored Oct 8, 2024
1 parent 0c3b777 commit 9a7fcae
Show file tree
Hide file tree
Showing 5 changed files with 1,158 additions and 135 deletions.
Loading

0 comments on commit 9a7fcae

Please sign in to comment.