Fix llama tokenizer #22402

ArthurZucker · 2023-03-27T16:02:12Z

What does this PR do?

Draft but:

Fixes the conversion script
update the llama default special tokens
fixed compatibility issues
cleanup llama tokeniztion code
add tests

HuggingFaceDocBuilderDev · 2023-03-27T16:18:25Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-03-27T16:26:38Z

cc @Narsil for visibility!

…to fix-llama-tokenizer

ArthurZucker · 2023-03-28T12:04:50Z

This will need to wait for #22341

Narsil

Let's put some tests before merging. (And a PR description)

Probably focusing on the breaking changes we're making here.

ArthurZucker · 2023-03-28T14:04:04Z

Yes, on it!

ArthurZucker · 2023-03-29T16:55:46Z

Will finish this tomorrow!

Varal7 · 2023-03-30T00:01:49Z

Hi! Does this PR the decoding part of the tokenizer? Seems like it always prefixes the output with space.

For instance, tokenizer.decode(1) returns <s>,

Waiting for huggingface/transformers#22402 to fix llama tokenizer

ArthurZucker · 2023-03-30T08:10:57Z

Yes, it does: print(f'\'{tokenizer.decode(tokenizer.encode("Hello world"), skip_special_tokens = True)}\'',) outputs `'Hello world' 😉

…to fix-llama-tokenizer

ArthurZucker · 2023-03-30T14:08:08Z

tests/test_tokenization_common.py

-        assert tokenizer_fast.clean_up_tokenization_spaces is False
+        assert tokenizer.clean_up_tokenization_spaces is False


this is such a small nit that I included it 😅

After rebasing, this test fails for me :( just reproduced on main:

> assert decoded == "[CLS] this shouldn ' t be! he ' ll go. [SEP]" E assert "[CLS] this s...'ll go. [SEP]" == "[CLS] this s... ll go. [SEP]" E - [CLS] this shouldn ' t be! he ' ll go. [SEP] E ? - - - - E + [CLS] this shouldn't be! he'll go. [SEP]

this is not pointing to the correct part of the test. If the cleanup_tokenization_spaces is indeed False, the fail can happen for cache reasons or anything else (also failed for me at some point).
Will check again

sgugger

Nice, thanks for all the fixes and for adding the tests!

Narsil

LGTM

* draft * update tokenization limma and conversion script * more udpates * initial commit * style * default pad to None * draft tokenization tests * update test * update tokenization tests * nits * update * versioning test * major fix * fix more testst * finish fixing special masks * last nit * more nits * add encode decode tests * add more * fix token type ids * style

ArthurZucker added 4 commits March 24, 2023 12:16

draft

d3b2050

update tokenization limma and conversion script

0da5717

more udpates

a2f2b19

initial commit

7cf369c

ArthurZucker mentioned this pull request Mar 27, 2023

LLaMA Implementation #21955

Closed

5 tasks

yukw777 mentioned this pull request Mar 27, 2023

LlamaTokenizer has no pad token, leading to failure during batch-tokenization #22312

Closed

4 tasks

ArthurZucker added 3 commits March 27, 2023 18:36

Merge branch 'main' of https://github.com/huggingface/transformers in…

ff70e2c

…to fix-llama-tokenizer

style

0cf62ec

default pad to None

1f1a4bf

ArthurZucker marked this pull request as ready for review March 28, 2023 12:04

Narsil reviewed Mar 28, 2023

View reviewed changes

draft tokenization tests

7e3ea59

sgugger mentioned this pull request Mar 29, 2023

stopping_criteria not working with llama #22436

Closed

4 tasks

gante mentioned this pull request Mar 29, 2023

Llama-13B gives nonsensical output past 1024 tokens #22433

Closed

4 tasks

Varal7 added a commit to Varal7/lama.vim that referenced this pull request Mar 30, 2023

Monkey patch

5b02aff

Waiting for huggingface/transformers#22402 to fix llama tokenizer

ArthurZucker added 8 commits March 30, 2023 08:12

Merge branch 'main' of https://github.com/huggingface/transformers in…

33dfe79

…to fix-llama-tokenizer

update test

ea0c834

update tokenization tests

3bd23e6

nits

8edf9b1

update

a334056

versioning test

5b9be49

major fix

1b6532a

fix more testst

d988805

ArthurZucker added 7 commits March 30, 2023 13:26

finish fixing special masks

6bdf1e0

last nit

1ab5d7c

more nits

512f77f

add encode decode tests

4c3c2e2

add more

3175977

fix token type ids

1871030

style

fd60fb6

ArthurZucker requested a review from sgugger March 30, 2023 14:07

ArthurZucker commented Mar 30, 2023

View reviewed changes

sgugger approved these changes Mar 30, 2023

View reviewed changes

ArthurZucker requested a review from Narsil March 30, 2023 15:13

ArthurZucker mentioned this pull request Apr 3, 2023

Update convert_llama_weights_to_hf.py #22525

Merged

5 tasks

Narsil approved these changes Apr 3, 2023

View reviewed changes

gante mentioned this pull request Apr 3, 2023

Llama Tokenizer uses incorrect indices for PAD #22520

Closed

4 tasks

sgugger merged commit c0f99b4 into huggingface:main Apr 3, 2023

Qubitium mentioned this pull request Apr 7, 2023

What are the eos_token_id and bos_token_id tloen/alpaca-lora#279

Open

ArthurZucker mentioned this pull request Apr 24, 2023

RecursionError: maximum recursion depth exceeded while getting the str of an object. #22762

Closed

2 tasks

mnoukhov mentioned this pull request May 8, 2023

StackLlama, which special tokens to use with tokenizer huggingface/trl#348

Closed

vpegasus mentioned this pull request May 9, 2023

merged RL model performed baddly huggingface/trl#347

Closed

mnoukhov mentioned this pull request May 12, 2023

Llama 7B fails for Human Eval bigcode-project/bigcode-evaluation-harness#74

Closed

jayelm mentioned this pull request Jun 16, 2023

Unable to reproduce LLaMA-7B results when training from scratch jayelm/gisting#9

Closed

nkcsjxd mentioned this pull request Aug 10, 2023

RecursionError: maximum recursion depth exceeded SCIR-HI/Huatuo-Llama-Med-Chinese#86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix llama tokenizer #22402

Fix llama tokenizer #22402

ArthurZucker commented Mar 27, 2023

HuggingFaceDocBuilderDev commented Mar 27, 2023 •

edited

Loading

ArthurZucker commented Mar 27, 2023

ArthurZucker commented Mar 28, 2023

Narsil left a comment •

edited

Loading

ArthurZucker commented Mar 28, 2023

ArthurZucker commented Mar 29, 2023

Varal7 commented Mar 30, 2023

ArthurZucker commented Mar 30, 2023

ArthurZucker Mar 30, 2023

NielsRogge Apr 3, 2023

ArthurZucker Apr 3, 2023

sgugger left a comment

Narsil left a comment

		assert tokenizer_fast.clean_up_tokenization_spaces is False
		assert tokenizer.clean_up_tokenization_spaces is False

Fix llama tokenizer #22402

Fix llama tokenizer #22402

Conversation

ArthurZucker commented Mar 27, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 27, 2023 • edited Loading

ArthurZucker commented Mar 27, 2023

ArthurZucker commented Mar 28, 2023

Narsil left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker commented Mar 28, 2023

ArthurZucker commented Mar 29, 2023

Varal7 commented Mar 30, 2023

ArthurZucker commented Mar 30, 2023

ArthurZucker Mar 30, 2023

Choose a reason for hiding this comment

NielsRogge Apr 3, 2023

Choose a reason for hiding this comment

ArthurZucker Apr 3, 2023

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 27, 2023 •

edited

Loading

Narsil left a comment •

edited

Loading