Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models

This is an implementation of the Tiktoken tokeniser, a BPE used by OpenAI's models. It's a partial Dart port from the original tiktoken library from OpenAI, but with a much nicer API.

Although there are other tokenizers available on pub.dev, as of November 2024, none of them support the GPT-4o and o1 model families. This package was created to fill that gap.

The supported models are these:

Gpt-4
Gpt-4o
Gpt-4o-mini
o1
o1-mini
o1-preview

Also important, this is a Dart-only package (does not require any platform channels to work), and the tokenization is done synchronously.

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you whether:

Some text is too long for a text model to process.
How much an OpenAI API call costs (as usage is priced by token).

Example

To see it in action, run the example app:

How to use it

// Create a Tiktoken instance for the model you want to use.
var tiktoken = Tiktoken(OpenAiModel.gpt_4);

// Encode a text string into tokens.
var encoded = tiktoken.encode("hello world");

// Decode a token string back into text.
var decoded = tiktoken.decode(encoded);

// Count the number of tokens in a text string.
int numberOfTokens = tiktoken.count("hello world");

Advanced use

Alternatively, you can use the static helper functions getEncoder and getEncoderForModel to get a TiktokenEncoder instance first:

var encoder = Tiktoken.getEncoder(TiktokenEncodingType.o200k_base);
var encoder = Tiktoken.getEncoderForModel(OpenAiModel.gpt_4o);

The TiktokenEncoder instance gives you more fine-grained control over the encoding process, as you now have access to more advanced methods:

Uint32List encode(
    String text, {
    SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
    SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
  });

Uint32List encodeOrdinary(String text);

(List<int>, Set<List<int>>) encodeWithUnstable(
    String text, {
    SpecialTokensSet allowedSpecial = const SpecialTokensSet.empty(),
    SpecialTokensSet disallowedSpecial = const SpecialTokensSet.all(),
  });
  
int encodeSingleToken(List<int> bytes);

Uint8List decodeBytes(List<int> tokens); 

String decode(List<int> tokens, {bool allowMalformed = true}); 

Uint8List decodeSingleTokenBytes(int token)

List<Uint8List> decodeTokenBytes(List<int> tokens);

int? get eotToken;

Online tokenizer

I've added many tests to make sure this Dart implementation is correct, but you can also compare yourself the output of this package with the output of the default implementation, by visiting the online Tiktokenizer.

Counting words

What's the relationship between words and tokens? Every language has a different word-to-token ratio. Here are a few general rules:

For English: 1 word is about 1.3 tokens
For Spanish and French: 1 word is about 2 tokens
How Many Tokens Are Punctuation Marks, Special Characters, and Emojis? Each punctuation mark (like ,:;?!) counts as 1 token. Special characters (like ∝√∅°¬) range from 1 to 3 tokens, and emojis (like 😁🙂🤩) range from 2 to 3 tokens.

In this package I provide a word counter. Here is how you can use it:

var wordCounter = WordCounter();

// Prints 0
print(wordCounter.count(''));

// Prints 1
print(wordCounter.count('hello'));

// Prints 2
print(wordCounter.count('hello world!'));

Counting words is complex because each language has its own rules for what constitutes a word. For this reason, the provided word counter is only an approximation and will give reasonable results only for languages written in the Latin alphabet.

Credits

This package code was mostly adapted from: https://pub.dev/packages/langchain_tiktoken from publisher dragonx.cloud / website. I've just added more encodings, added tests, and made the API more user-friendly.

By Marcelo Glasberg

glasberg.dev
github.com/marcglasberg
linkedin.com/in/marcglasberg/
twitter.com/glasbergmarcelo
stackoverflow.com/users/3411681/marcg
medium.com/@marcglasberg

My article in the official Flutter documentation:

Understanding constraints

The Flutter packages I've authored:

My Medium Articles:

Async Redux: Flutter’s non-boilerplate version of Redux (versions: Português)
i18n_extension (versions: Português)
Flutter: The Advanced Layout Rule Even Beginners Must Know (versions: русский)
The New Way to create Themes in your Flutter App

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
example		example
lib		lib
test		test
.gitignore		.gitignore
.metadata		.metadata
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
analysis_options.yaml		analysis_options.yaml
pubspec.lock		pubspec.lock
pubspec.yaml		pubspec.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models

Example

How to use it

Advanced use

Online tokenizer

Counting words

Credits

About

Releases

Packages

Languages

License

marcglasberg/tiktoken_tokenizer_gpt4o_o1

Folders and files

Latest commit

History

Repository files navigation

Tiktoken Tokenizer for GPT-4o, GPT-4, and o1 OpenAI models

Example

How to use it

Advanced use

Online tokenizer

Counting words

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages