[`SFTTrainer`] Introducing `DataCollatorForCompletionOnlyLM` #445

younesbelkada · 2023-06-16T17:34:18Z

What does this PR do?

Fixes: #426

This PR introduces DataCollatorForCompletionOnlyLM data collator that masks out all the prompts that are before completion, similarly at what is done here: https://github.com/databrickslabs/dolly/blob/master/training/trainer.py#L48-L77

The goal for that data collator is to find where the target response template token is located in the sentence, and mask out all the tokens before the response token to attend only on the completions.

Currently the API looks as follows:

Handy reproducible snippet

from datasets import load_dataset
from trl import SFTTrainer
from trl.trainer import DataCollatorForCompletionOnlyLM
import transformers

dataset = load_dataset("tatsu-lab/alpaca", split="train")

model = transformers.AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
tokenizer.pad_token = tokenizer.eos_token

def formatting_prompts_func(examples):
    output_text = []
    for i in range(len(examples["instruction"])):
        instruction = examples["instruction"][i]
        input_text = examples["input"][i]
        response = examples["output"][i]

        if len(input_text) >= 2:
            text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
            
            ### Instruction:
            {instruction}
            
            ### Input:
            {input_text}
            
            ### Response:
            {response}
            '''
        else:
            text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
            
            ### Instruction:
            {instruction}
            
            ### Response:
            {response}
            '''
        output_text.append(text)

    return output_text

response_template = "### Response:\n"
data_collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer, mlm=False)

trainer = SFTTrainer(
    model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    formatting_func=formatting_prompts_func,
    data_collator=data_collator,
    max_seq_length=1024,
)

trainer.train()

Currently, for some reason the data collator cannot find the response token because of the issue I describe in the snippet below:

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125m")
tokenizer.pad_token = tokenizer.eos_token

print(tokenizer("### Response:"))
>>> {'input_ids': [21017, 18261, 25], 'attention_mask': [1, 1, 1]}
print(tokenizer("some random text\n ### Response:"))
>>> {'input_ids': [11246, 4738, 2420, 198, 44386, 18261, 25], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

As you can see the first token of ### Response: (21017) is replace by 44386.

EDIT: the issue appeared to be quite straightforward, one needs to replace the response template to ### Response: instead of ### Response: since the tokenizer will encode ### differently from ###

cc @vwxyzjn

HuggingFaceDocBuilderDev · 2023-06-16T17:38:37Z

The documentation is not available anymore as the PR was closed or merged.

vwxyzjn · 2023-06-16T21:21:33Z

Nice PR @younesbelkada!! There are two issues.

Prompt leading spaces

First, the response token issue can be resolved by removing the leading spaces with the promps. Note that the \ in text = f'''\ is important.

        if len(input_text) >= 2:
            text = f'''\
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:
{response}
            '''

instead of

        if len(input_text) >= 2:
            text = f'''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
            
            ### Instruction:
            {instruction}
            
            ### Input:
            {input_text}
            
            ### Response:
            {response}
            '''

Identify the location of response tokens

The reference implementation from dolly seems incorrect. Its implementation word instead matches ### Instruction:\n given the formats that we have.

This is because it breaks when the first token matches, but '### Response:\n' is encoded with [21017, 18261, 25, 198]., but it matches ### Instruction:\n ([21017, 46486, 25, 198]) instead.

To resolve the issue, 864948f ensures all four tokens match.

I gave it a quick run, but wandb is not recoding anything... https://wandb.ai/costa-huang/huggingface/runs/dcld1hg6/overview?workspace=user-costa-huang. Am I missing some configuration like the log_with="wandb"?

lvwerra · 2023-06-19T15:43:18Z

Small nit: I would add it to the main init so one can import it via from trl import DataCollator....

younesbelkada · 2023-06-19T15:55:59Z

Thanks a lot @vwxyzjn for digging deeper into that!
I made some tiny changes from your commit and added some tests to make sure that collator doesn't get broken by future commits.
I suggest we add a script to reproduce stanford alpaca in examples/ folder (c.f.: #439) and educate users on how to use the collator once we build that example, on the documentation section. How does that sound?

vwxyzjn · 2023-06-20T13:01:51Z

Thanks a lot @vwxyzjn for digging deeper into that! I made some tiny changes from your commit and added some tests to make sure that collator doesn't get broken by future commits. I suggest we add a script to reproduce stanford alpaca in examples/ folder (c.f.: #439) and educate users on how to use the collator once we build that example, on the documentation section. How does that sound?

The changes LGTM. It would be great to add some docs and potentially have a stanford alpaca example; if we have the bandwidth we can probably run some tracked experiments and make the tracked metrics and HF models available.

younesbelkada

Thanks, makes sense! I was thinking maybe we can merge this PR and do a follow up PR to add the stanford alpaca reproduction as @Lyken17 was interested to dive into it #439

BramVanroy · 2023-06-21T18:49:33Z

Hello

Is there an example somewhere of how to use this new collator? I see in the source code that it inherits from DataCollatorForLanguageModeling, but this has mlm set to True by default and the mlm probability is 0.15. So to just use the data collator for completion, should we initialize it like so? Crucially disabling mlm?

DataCollatorForCompletionOnlyLM("### Response:\n", tokenizer, mlm=False)

EDIT: I see now that the new collator overwrites the torch_call method so mlm is never done. But I don't think that that is intuitive for the user because self.mlm = True. Maybe DataCollatorForCompletionOnlyLM can also pass mlm=False to the init of super? That makes things clearer.

younesbelkada · 2023-06-21T19:20:33Z

Yes you are right, we should probably set the default mlm to False and leave the option to change it to True for superusers. Do you want to open a PR for that? The changes would be very minimal

younesbelkada · 2023-06-21T19:21:54Z

Regarding the documentation we're currently working on reproducing stanford alpaca using that collator, but for now you should just create the data collator in your main script and pass it as a positional argument on the SFTTrainer's init

BramVanroy · 2023-06-21T19:39:25Z

I can do a PR tomorrow.

Another issue that I encountered: in some cases you do not have the Response because the text is too long and the tokenizer truncates it. What should happen in those cases? Maybe a preprocessing function should already filter those cases out?

younesbelkada · 2023-06-21T19:42:01Z

Yeah I imagine we can have a preprocessing function inside SFTTrainer that takes care of that indeed. It would be really great if you can add that to the PR as well. Otherwise happy to do it !

MatousAc · 2023-10-11T06:03:08Z

At the very top, we are warned to add a space in our response template, otherwise the tokenizer will not produce the same tokens. I found this to be slightly insufficient, as I had to also prefix my response template with a <s>. When I included the leading separator in the argument for the data collator AND I included it in my training prompts (with no space before <s>), only then the collator was able to find the right token sequence and mask properly.

Hopes this helps someone else.

v1 of alpaca datacollator

2bc14fd

younesbelkada mentioned this pull request Jun 16, 2023

How to Instruction Tune with SFTTrainer? #426

Closed

make sure to match the response tokens

864948f

vwxyzjn mentioned this pull request Jun 16, 2023

Matching the response tokens databrickslabs/dolly#197

Open

younesbelkada added 3 commits June 19, 2023 15:50

add test

90df711

add it in main init

b51792d

add check

692c85b

younesbelkada marked this pull request as ready for review June 19, 2023 15:52

adapt test

40a1c68

younesbelkada requested review from vwxyzjn and lvwerra June 19, 2023 15:56

younesbelkada commented Jun 20, 2023

View reviewed changes

lvwerra approved these changes Jun 20, 2023

View reviewed changes

younesbelkada merged commit 7705daa into main Jun 20, 2023

younesbelkada deleted the add-alpaca-dc branch June 20, 2023 15:51

This was referenced Jun 20, 2023

SFT Trainer to reproduce Stanford Alpaca #439

Closed

Bug when using epochs? #455

Closed

This was referenced Jun 28, 2023

Disable mlm by default in DataCollatorForCompletionOnlyLM, add ignore_index and docstring #476

Merged

Filter out items that do not contain 'Response' after truncation #477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`SFTTrainer`] Introducing `DataCollatorForCompletionOnlyLM` #445

[`SFTTrainer`] Introducing `DataCollatorForCompletionOnlyLM` #445

younesbelkada commented Jun 16, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 16, 2023 •

edited

Loading

vwxyzjn commented Jun 16, 2023

lvwerra commented Jun 19, 2023

younesbelkada commented Jun 19, 2023 •

edited

Loading

vwxyzjn commented Jun 20, 2023 •

edited

Loading

younesbelkada left a comment

BramVanroy commented Jun 21, 2023 •

edited

Loading

younesbelkada commented Jun 21, 2023

younesbelkada commented Jun 21, 2023

BramVanroy commented Jun 21, 2023

younesbelkada commented Jun 21, 2023

MatousAc commented Oct 11, 2023 •

edited

Loading

[SFTTrainer] Introducing DataCollatorForCompletionOnlyLM #445

[SFTTrainer] Introducing DataCollatorForCompletionOnlyLM #445

Conversation

younesbelkada commented Jun 16, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 16, 2023 • edited Loading

vwxyzjn commented Jun 16, 2023

Prompt leading spaces

Identify the location of response tokens

lvwerra commented Jun 19, 2023

younesbelkada commented Jun 19, 2023 • edited Loading

vwxyzjn commented Jun 20, 2023 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

BramVanroy commented Jun 21, 2023 • edited Loading

younesbelkada commented Jun 21, 2023

younesbelkada commented Jun 21, 2023

BramVanroy commented Jun 21, 2023

younesbelkada commented Jun 21, 2023

MatousAc commented Oct 11, 2023 • edited Loading

[`SFTTrainer`] Introducing `DataCollatorForCompletionOnlyLM` #445

[`SFTTrainer`] Introducing `DataCollatorForCompletionOnlyLM` #445

younesbelkada commented Jun 16, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 16, 2023 •

edited

Loading

younesbelkada commented Jun 19, 2023 •

edited

Loading

vwxyzjn commented Jun 20, 2023 •

edited

Loading

BramVanroy commented Jun 21, 2023 •

edited

Loading

MatousAc commented Oct 11, 2023 •

edited

Loading