Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to pre-train BART model #4151

Closed
omerarshad opened this issue May 5, 2020 · 21 comments
Closed

How to pre-train BART model #4151

omerarshad opened this issue May 5, 2020 · 21 comments
Assignees
Labels
Ex: LM (Finetuning) Related to language modeling fine-tuning Ex: LM (Pretraining) Related to language modeling pre-training wontfix

Comments

@omerarshad
Copy link

How to pre-train BART model in an unsupervised manner. any example?

@patrickvonplaten patrickvonplaten self-assigned this May 5, 2020
@patrickvonplaten patrickvonplaten added Ex: LM (Pretraining) Related to language modeling pre-training Ex: LM (Finetuning) Related to language modeling fine-tuning labels May 5, 2020
@patrickvonplaten
Copy link
Contributor

We still need to provide a good docstring/notebook for this. It's on our ToDo-List. :-)

Or @sshleifer - is there already something for Bart?

@sshleifer
Copy link
Contributor

Nothing yet, would be good to add!

@stale
Copy link

stale bot commented Jul 17, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 17, 2020
@shamanez
Copy link
Contributor

I have seen the same issue in fairseq BART!.

@stale stale bot removed the wontfix label Jul 23, 2020
@cahya-wirawan
Copy link
Contributor

Hi, any news about bart pre-training?

@zy329jy
Copy link

zy329jy commented Jul 29, 2020

who can tell me how to pre-train the bart on my own dataset? I am so confused ....
thank you so much

@patrickvonplaten
Copy link
Contributor

Maybe this comment can help: #5096 (comment)

@stale
Copy link

stale bot commented Oct 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 3, 2020
@stale stale bot closed this as completed Oct 10, 2020
@dhruvramani
Copy link

Any news on this please?

@cahya-wirawan
Copy link
Contributor

not so far, would be great to have it. Thanks.

@myechona
Copy link

myechona commented Mar 12, 2021

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

@banditelol
Copy link

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

Thanks for the code example, I am also planning on implementing pretrained from scratch, and I've got several questions for the code

  • I noticed that you use pretrained bart tokenizer, how can I pretrain it for different language?
  • How much compute did you use for your implementation?

@myechona
Copy link

myechona commented Apr 6, 2021

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

Thanks for the code example, I am also planning on implementing pretrained from scratch, and I've got several questions for the code

  • I noticed that you use pretrained bart tokenizer, how can I pretrain it for different language?
  • How much compute did you use for your implementation?

For the first question, just like this:

from tokenizers import (ByteLevelBPETokenizer,SentencePieceBPETokenizer,BertWordPieceTokenizer)

tokenizer = ByteLevelBPETokenizer()
paths = ['./data/corpus.txt']
tokenizer.train(files=paths, vocab_size = 15000, min_frequency=6, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.save_model("./data/dic/")

For the other question, i trained it with 12G gpu memory,but it may be completed with samller gpu memory. And also you could adjust you parameters to your server environment.

@Martine307
Copy link

@myechona Thanks for your code. I have a question about it. There are some tasks like text-filling and sentence-permutation during pretrain stage, i want to know whether the "input_ids" is for masked sentence and the "labels" is for origin sentence?

@prajdabre
Copy link

If anyone wants to train their MBART model then feel free to use this.
https://github.com/prajdabre/yanmtt

Contributions are welcome!

@thomas-li-sjtu
Copy link

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

Thanks for your code, it really helps.

@jbmaxwell
Copy link

I'm most interested in sentence infilling, which this script doesn't really seem to address (though my understanding was that BART training generally involves masking and permutation). Is there an additional step I need to add for the infilling functionality?

@sajastu
Copy link

sajastu commented Apr 6, 2022

We still need to provide a good docstring/notebook for this. It's on our ToDo-List. :-)

Or @sshleifer - is there already something for Bart?

Hi, any update on this? @vanpelt

@jbmaxwell
Copy link

I actually decided to jump over to T5 and use the run_t5_mlm_flax.py script. Seems to be working so far, though it's very new, so missing some conveniences.... it sounds like that stuff is underway!

@sajastu
Copy link

sajastu commented Apr 6, 2022

I actually decided to jump over to T5 and use the run_t5_mlm_flax.py script. Seems to be working so far, though it's very new, so missing some conveniences.... it sounds like that stuff is underway!

Great, I was initially looking at those scripts to get some ideas about the pre-training script, but since then thought the Huggingface guys might have come up with a resource to do this. Apparently, it's still underway! :)

@PiotrNawrot
Copy link

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ex: LM (Finetuning) Related to language modeling fine-tuning Ex: LM (Pretraining) Related to language modeling pre-training wontfix
Projects
None yet
Development

No branches or pull requests