Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dolma Counter script #85

Merged
merged 2 commits into from
Jun 12, 2024
Merged

Add Dolma Counter script #85

merged 2 commits into from
Jun 12, 2024

Conversation

blester125
Copy link
Collaborator

This PR adds a new dolma processor that can be used to count the number of (whitespace delimited) tokens in a data source.

You can point it at a directory and it will find all the .jsonl.gz files in it or it's subdirectories.

It can be used from the root dir via python -m licensed_pile.count or from anywhere with count-tokens-dolma.

One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.

This PR adds a new dolma processor that can be used to count the number
of (whitespace delimited) tokens in a data source.

You can point it at a directory and it will find all the `.jsonl.gz`
files in it or it's subdirectories.

It can be used from the root dir via `python -m licensed_pile.count` or
from anywhere with `count-tokens-dolma`.
@blester125 blester125 requested a review from craffel June 10, 2024 22:57
@baberabb
Copy link
Contributor

One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.

Thats just what tqdm defaults to if you use unit_scale = True iirc

@blester125
Copy link
Collaborator Author

You're right, it not really configurable in dolma tho https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/core/parallel.py#L268

I think it fine to leave it as is lol, plus the unit_divisor defaults to 1000 https://github.com/tqdm/tqdm?tab=readme-ov-file#parameters so it'll give us the right values 🤷‍♀️

@craffel
Copy link
Collaborator

craffel commented Jun 12, 2024

Very cool, thank you for doing this. Since "token" is sort of an overloaded term, and since I'm suggesting you also report the decompressed text-only (non-json-overhead) byte count, perhaps we can call this something about "size statistics" instead of tokens.

The PR also allows one to point the script at a single dolma shard.
@blester125 blester125 merged commit de0dad9 into main Jun 12, 2024
2 checks passed
@blester125 blester125 deleted the tool/counter branch June 12, 2024 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants