-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode #1343
Comments
Hey! Thanks for opening an issue here. I'll see if this is related to the conversion or the fast tokenizer code. |
Transferred it here as this is not related to the Example scriptIn [4]: from transformers import (
...: PreTrainedTokenizerFast,
...: AutoTokenizer,
...: )
...: from datasets import Dataset
...:
...: # Load a dataset with random text
...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
...: # Load the fast tokenizer
...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
...: "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
...: )
...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
...:
...: # Define a wrapper function that returns the tokenize function
...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
...: def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
...: text_list = [f"{text}</s>" for text in example["text"]]
...: input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
...: return {"input_ids": input_ids}
...:
...: return tokenize
...:
...: # Apply the wrapper function to tokenize the dataset and use 2 processes
...: tokenize = get_tokenize(tokenizer)
...: dataset = dataset.map(tokenize, batched=True, num_proc=1)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.35ba/s]
In [5]: from transformers import (
...: PreTrainedTokenizerFast,
...: AutoTokenizer,
...: )
...: from datasets import Dataset
...:
...: # Load a dataset with random text
...: dataset = Dataset.from_dict({"text": ["random text"] * 100000})
...: # Load the fast tokenizer
...: tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(
...: "codellama/CodeLlama-7b-Instruct-hf", use_fast=True
...: )
...: assert isinstance(tokenizer, PreTrainedTokenizerFast)
...:
...: # Define a wrapper function that returns the tokenize function
...: def get_tokenize(tokenizer: PreTrainedTokenizerFast):
...: def tokenize(example: dict[str, list[str]]) -> dict[str, list[list[int]]]:
...: text_list = [f"{text}</s>" for text in example["text"]]
...: input_ids = tokenizer(text_list, add_special_tokens=False)["input_ids"]
...: return {"input_ids": input_ids}
...:
...: return tokenize
...:
...: # Apply the wrapper function to tokenize the dataset and use 2 processes
...: tokenize = get_tokenize(tokenizer)
...: dataset = dataset.map(tokenize, batched=True, num_proc=2)
#0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.24ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 50.57ba/s]
#1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 51.64ba/s] |
Got it. Thanks Arthur. |
TL;DR@UniverseFly, Can confirm following the following changes resolves the above issue infoJust ran into this bug with another model also using llama fast tokenizer & remembered one of the issues with Mistrals wrt llama tokenizer... @younesbelkada describes here huggingface/transformers#26498 (comment) in huggingface/transformers#26498 @lewtun fix https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/26/files |
Thanks @sparverius! I’ll close this issue then. |
System Info
transformers
version: 4.33.1Who can help?
@ArthurZucker, @younesbelkada, @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here is the minimal reproducible example on my machine. There are several things to note here:
use_fast=False
num_proc=1
get_tokenize
wrapper function and usingtokenizer
as a global variable.Note
It may seem strange to define the
get_tokenize
wrapper here due to demonstration purposes, but my actual use case is more complex andget_tokenize
can make the code more structured.This code will trigger
AssertionError: [4036, 1426, 829, 29879, 29958]
, meaning that the eos token</s>
is separated into 3 tokens with ids[829, 29879, 29958]
. They are mapped to'</', 's', '>']
respectively.Expected behavior
The assertion should pass, i.e., the
</s>
token should be recognized as a single token.The text was updated successfully, but these errors were encountered: