-
PR DS Repo
-
add dict model from icbsv2
NEXT TASKS
-
support fallback
-
fix namespacing -- dont need encoders to do it
-
pyarrow based processing
-
Support Vaex
-
Caching to avoid repeat embeddings
-
add support for weights in new embedder
-
Add vaex streaming disk-to-disk support
-
support SGPT https://github.com/Muennighoff/sgpt
-
test new embedder more rigorously separate text test from embedding testing
-
support fasttext
-
support PolaRS (maybe just through PyArrow)
-
Add WordPierce style tokenization: https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944 (also in BERTTokenizer)
-
Support making the whole pipeline into an object to put in other models
-
make own fast se lib
-
SGPT support: https://github.com/Muennighoff/sgpt
-
native fasttext support: https://huggingface.co/blog/fasttext