adding tokenizer configuration (#19) · tensorchord/pg_bestmatch.rs@9a7fcae

Commit

adding tokenizer configuration (#19)

### Description
This PR aims to add some better functionality for configuration of the
tokenizer (#5).

### Todo
- [x] Allow for selection of huggingface tokenizer. Downloads the model
from the hub.
- [x] Jieba tokenizer (chinese)
- [x] tiktoken
- [x] tiniestsegmenter (japanese) [optional]
- [x] Allow switching of HF tokenizers
- [x] Add Tests for tokenizing

### Notes
~~1. Currently, if a HuggingFace tokenizer is initialized, the tokenizer
cannot be changed. e.g. Doing `SELECT tokenize('i have an apple', 'hf',
'sentence-transformers/LaBSE');` after `SELECT tokenize('i have an
apple', 'hf', 'google-t5/t5-base');` would just use the LaBSE tokenizer
as it has already been initialized. Need a better way to handle this.~~

~~2. There is no official rust crate for tiktoken.~~ Used [tiktoken-rs](
https://github.com/zurawiki/tiktoken-rs )

### Usage
```SQL
SELECT tokenize('i have an apple', 'hf', 'google-bert/bert-base-uncased');
SELECT tokenize('i have an apple', 'hf', 'google-t5/t5-base');
SELECT tokenize('i have an apple', 'tiktoken', 'gpt2');
SELECT tokenize('i have an apple', 'tiniestsegmenter', '');
SELECT tokenize('i have an apple', 'jieba', '');
SELECT tokenize('i have an apple', 'ws', '');
```

---------

Signed-off-by: jwnz <tkjones93@gmail.com>

Loading branch information

jwnz authored Oct 8, 2024

1 parent 0c3b777 commit 9a7fcae

0 comments on commit `9a7fcae`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `9a7fcae`

Commit

There are no files selected for viewing

0 comments on commit 9a7fcae

0 comments on commit `9a7fcae`