Add a way to count tokens without encoding the whole text; improve performance; fix bugs #38

niieani · 2023-04-15T20:33:02Z

adds highly performant isWithinTokenLimit to count tokens without encoding the whole text
improve overall performance by removing transitive arrays
include precomputed bpeRanks
add type-checking
fix a few minor bugs (thanks to type-checking)
add generator versions of both decoder and encoder
adds prettier and prettifies the codebase

To test, see fork published as gpt-tokenizer.

- adds highly performant `isWithinTokenLimit` to count tokens without encoding the whole text - improve overall performance by removing transitive arrays - include precomputed `bpeRanks` - add type-checking - fix a few minor bugs (thanks to type-checking) - add generator versions of both decoder and encoder

requires providing an additional argument with cache if you want to make it shared fixes #35

niieani added 2 commits April 15, 2023 13:31

chore: add prettier

405e35c

niieani changed the title ~~feat: add a way to count tokens without encoding the whole text~~ Add a way to count tokens without encoding the whole text; improve performance; fix bugs Apr 15, 2023

niieani mentioned this pull request Apr 15, 2023

UTF-8 issue when decode back #37

Open

niieani added 2 commits April 15, 2023 16:58

feat: add decodeAsyncGenerator and support any iterable input

380f229

refactor: simplify decoding logic

9d800de

niieani force-pushed the bb/count-tokens branch from d32f3e8 to 9d800de Compare April 16, 2023 00:33

fix: remove global cache memory leak

22aed68

requires providing an additional argument with cache if you want to make it shared fixes #35

This was referenced Apr 16, 2023

The cache in bpe() may occupy a large amount of memory after long-time running. #35

Open

Some Updates RFC #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to count tokens without encoding the whole text; improve performance; fix bugs #38

Add a way to count tokens without encoding the whole text; improve performance; fix bugs #38

niieani commented Apr 15, 2023 •

edited

Loading

Add a way to count tokens without encoding the whole text; improve performance; fix bugs #38

Are you sure you want to change the base?

Add a way to count tokens without encoding the whole text; improve performance; fix bugs #38

Conversation

niieani commented Apr 15, 2023 • edited Loading

niieani commented Apr 15, 2023 •

edited

Loading