Add Dolma Counter script #85

blester125 · 2024-06-10T22:51:09Z

This PR adds a new dolma processor that can be used to count the number of (whitespace delimited) tokens in a data source.

You can point it at a directory and it will find all the .jsonl.gz files in it or it's subdirectories.

It can be used from the root dir via python -m licensed_pile.count or from anywhere with count-tokens-dolma.

One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.

This PR adds a new dolma processor that can be used to count the number of (whitespace delimited) tokens in a data source. You can point it at a directory and it will find all the `.jsonl.gz` files in it or it's subdirectories. It can be used from the root dir via `python -m licensed_pile.count` or from anywhere with `count-tokens-dolma`.

baberabb · 2024-06-11T15:47:00Z

One weird aspect is that it uses prefixed like giga or tera "tokens" instead of a billion, a trillion, etc.

Thats just what tqdm defaults to if you use unit_scale = True iirc

blester125 · 2024-06-11T16:29:12Z

You're right, it not really configurable in dolma tho https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/core/parallel.py#L268

I think it fine to leave it as is lol, plus the unit_divisor defaults to 1000 https://github.com/tqdm/tqdm?tab=readme-ov-file#parameters so it'll give us the right values 🤷‍♀️

craffel · 2024-06-12T12:22:02Z

Very cool, thank you for doing this. Since "token" is sort of an overloaded term, and since I'm suggesting you also report the decompressed text-only (non-json-overhead) byte count, perhaps we can call this something about "size statistics" instead of tokens.

The PR also allows one to point the script at a single dolma shard.

blester125 requested a review from craffel June 10, 2024 22:57

Include num bytes calculations to stats script

cf71f7c

The PR also allows one to point the script at a single dolma shard.

blester125 merged commit de0dad9 into main Jun 12, 2024
2 checks passed

blester125 deleted the tool/counter branch June 12, 2024 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dolma Counter script #85

Add Dolma Counter script #85

blester125 commented Jun 10, 2024

baberabb commented Jun 11, 2024

blester125 commented Jun 11, 2024

craffel commented Jun 12, 2024

Add Dolma Counter script #85

Add Dolma Counter script #85

Conversation

blester125 commented Jun 10, 2024

baberabb commented Jun 11, 2024

blester125 commented Jun 11, 2024

craffel commented Jun 12, 2024