Can I training a bart model from scratch by transformers? #5096

ScottishFold007 · 2020-06-18T04:46:37Z

Can I training a bart model from scratch by transformers?

patrickvonplaten · 2020-06-18T07:55:10Z

Yes

ScottishFold007 · 2020-06-18T07:57:08Z

Yes

That' s awesome!Can you give a code to show? I'm grateful!

patrickvonplaten · 2020-06-18T08:18:13Z

So from the paper: https://arxiv.org/pdf/1910.13461.pdf, you can see that Bart is trained on denoising input sequences in almost any possible way.

One way could be for BartForConditionalGeneration:

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

tok = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration(BartConfig())

input_string = "My dog is <mask> </s>"
decoder_input_string = "<s> My dog is cute"
labels_string = "My dog is cute </s>"

input_ids = tok(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids =tok(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tok(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
 
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

patrickvonplaten · 2020-06-18T08:23:14Z

Pinging @sshleifer to make sure I did not forget anything

ScottishFold007 · 2020-06-18T08:26:43Z

Pinging @sshleifer to make sure I did not forget anything

Actually, I was going to ask. how train a model from zero to one. For example, I want to train a Chinese bart model.

tomhosking · 2020-09-02T09:51:57Z

Here's a working example for this, including batching:

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

tok = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration(BartConfig())

input_batch = ["My dog is <mask></s>", "It loves to play in the <mask></s>"]
decoder_input_batch = ["<s>My dog is cute", "<s>It loves to play in the park"]
labels_batch = ["My dog is cute</s>", "It loves to play in the park</s>"]

input_ids = tok.batch_encode_plus(input_batch, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
decoder_input_ids = tok.batch_encode_plus(decoder_input_batch, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
labels = tok.batch_encode_plus(labels_batch, add_special_tokens=False, return_tensors="pt", padding=True).input_ids

loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

>>> tensor(10.9981, device='cuda:0', grad_fn=<NllLossBackward>)

ScottishFold007 · 2020-09-03T03:37:17Z

Here's a working example for this, including batching:

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

tok = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration(BartConfig())

input_batch = ["My dog is <mask></s>", "It loves to play in the <mask></s>"]
decoder_input_batch = ["<s>My dog is cute", "<s>It loves to play in the park"]
labels_batch = ["My dog is cute</s>", "It loves to play in the park</s>"]

input_ids = tok.batch_encode_plus(input_batch, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
decoder_input_ids = tok.batch_encode_plus(decoder_input_batch, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
labels = tok.batch_encode_plus(labels_batch, add_special_tokens=False, return_tensors="pt", padding=True).input_ids

loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

>>> tensor(10.9981, device='cuda:0', grad_fn=<NllLossBackward>)

input_batch = ["My dog is ", "It loves to play in the "]
decoder_input_batch = ["My dog is cute", "It loves to play in the park"]
labels_batch = ["My dog is cute", "It loves to play in the park"]

If I have a text document, each line of a paragraph, how do I rewrite the data input on it? Thanks!

swethmandava · 2020-12-17T20:30:33Z

@tomhosking the paper indicates that it uses both sentence permutation (loss is propagated from all tokens instead of only masked tokens) and infilling (include only one mask token for multiple consecutive masks). would this be a correct input?

input_batch = ["<s>It is <mask> retriever. My dog is <mask></s>", "<s>There <mask> in SF. It loves to play in the <mask></s>"]
decoder_input_batch = ["</s><s>My dog is cute. It is a golden retriever", "</s><s>It loves to play in the park. There are many parks in SF."]
labels_batch = ["<s>My dog is cute. It is a golden retriever</s>", "<s>It loves to play in the park. There are many parks in SF.</s>"]

(Note: decoder_input_batch starts with </s><s> due to shift_tokens_right #7961)

jonatasgrosman · 2021-01-09T02:43:42Z

Sorry for the intrusion, but I think your values are almost correct @swethmandava, except for the masking absence

input_batch = ["<s>It <mask> retriever. My <mask> cute </s>", ... ]
decoder_input_batch = ["</s><s>My dog is cute. It is a golden retriever", ...]
labels_batch = ["<s>My dog is cute. It is a golden retriever</s>", ...]

BTW: This </s> token at the beginning of decode's input is kind of weird to me, but it's inherited from the fairseq original code. If you wanna train the model from scratch with random weights I think you can go without this... or maybe this trick is important for convergence, we never know 😁

HuipengXu · 2021-04-28T02:47:34Z

Will only 15% mask in the encoder input cause some kind of leakage? The language model in the decoder cannot learn correctly

prajdabre · 2021-06-15T15:11:29Z

If anyone wants to train their MBART model then feel free to use this.
https://github.com/prajdabre/yanmtt

Contributions are welcome!

jbmaxwell · 2022-05-29T17:47:04Z

Sorry for the intrusion, but I think your values are almost correct @swethmandava, except for the masking absence
input_batch = ["<s>It <mask> retriever. My <mask> cute </s>", ... ]
decoder_input_batch = ["</s><s>My dog is cute. It is a golden retriever", ...]
labels_batch = ["<s>My dog is cute. It is a golden retriever</s>", ...]
BTW: This </s> token at the beginning of decode's input is kind of weird to me, but it's inherited from the fairseq original code. If you wanna train the model from scratch with random weights I think you can go without this... or maybe this trick is important for convergence, we never know 😁

I have a non-natural language dataset where I haven't actually been including <s> and </s> since they don't add any value (and need to be removed later anyway). To work with that, should I insert a pad token at the start of the decoder_input representation (and truncate to max_length)?

Haiming94 · 2022-07-14T07:49:55Z

So from the paper: https://arxiv.org/pdf/1910.13461.pdf, you can see that Bart is trained on denoising input sequences in almost any possible way.

One way could be for BartForConditionalGeneration:

from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

tok = BartTokenizer.from_pretrained("facebook/bart-large")
model = BartForConditionalGeneration(BartConfig())

input_string = "My dog is <mask> </s>"
decoder_input_string = "<s> My dog is cute"
labels_string = "My dog is cute </s>"

input_ids = tok(input_string, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids =tok(decoder_input_string, add_special_tokens=False, return_tensors="pt").input_ids
labels = tok(labels_string, add_special_tokens=False, return_tensors="pt").input_ids
 
loss = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels)[0]

Hi, do you have a script to build the training dataset of BART pertain, thanks

BramVanroy · 2022-09-02T22:15:25Z

@patrickvonplaten @sshleifer Did anyone ever come around to creating a notebook/script for BART pretraining? (In a linked issue you mentioned it was on the to-do list.)

The core difficulty is having a canonical implementation for the data preprocessing (BART is more than just token masking, I believe: e.g.,span masking, shuffling). But a full pretrain pipeline here or in fairseq is also sorely missing.

patrickvonplaten · 2022-09-05T15:08:12Z

Sadly not :-/ We now have on for Flax in #18297 - could you try to copy-paste the preprocessing logic into a PyTorch one maybe?

BramVanroy · 2022-09-05T15:42:34Z

@patrickvonplaten I've been porting the fairseq implementation to a PyTorch dataloader format. I found that the Flax implementation in HF lacks adding noise for 0-length spans and has some slightly diverging implementation so it was more straightforward to start from the fairseq implementation. I am now especially testing the data processing to get it as close as possible to fairseq's implementation (although it is my believe that there's a bug in their code).

I would like to add a full pytorch example for DLM training of BART in the coming days/weeks but I could use some code reviews in doing that to feel more comfortable. Would that be possible?

patrickvonplaten · 2022-09-06T12:13:08Z

Sure, happy to take a look!

prajdabre · 2022-09-06T13:03:03Z

Hi

I remember posting this a year ago but I've written an entire toolkit for this purpose. Feel free to use it. https://github.com/prajdabre/yanmtt

I've also created a simple notebook for the same (scroll to the pretraining part): https://colab.research.google.com/drive/1ovlA_h0ggblawqR-yCgRs3uRjxFJ8K0l?usp=sharing

BramVanroy · 2022-09-06T13:50:16Z

Hi Raj, thank you for this. I had come across it but your script seems to have a lot of additional things going on so that it is hard to extract the basics. I also found that you implement word/span masking but not the other things like adding noise or randomly swap a masked token for a random token, so not completely like the original implementation (but correct me if I'm wrong!) .

I think your library can be very useful to be used as a separate library, thanks! In addition I'll try add a PR in transformers for an succinct example to use within transformers with the Trainer, with data processing close the fairseq implementation.

prajdabre · 2022-09-06T13:56:23Z

Hi,

My focus was more on mbart and mt5 which looked only at span masking and reordering. I'm not sure if token replacement will have that big of an impact but can be easily implemented in 1 line. To my understanding, span masking is responsible for majority of the gains. The notebook contains a more watered down version of the masking method in my toolkit. You could consider that version and build on top of it easily.

CountingMstar · 2023-02-06T06:37:22Z

Hey guys, I would want to know how to pre-training BART model from scratch. Anyone who know about this? BART, pegasus or other text summarization models are okay for me.

patrickvonplaten assigned patrickvonplaten and unassigned patrickvonplaten Jun 18, 2020

patrickvonplaten closed this as completed Jun 18, 2020

patrickvonplaten mentioned this issue Aug 3, 2020

How to pre-train BART model #4151

Closed

patrickvonplaten mentioned this issue Aug 26, 2020

BART for Pre-Training #6743

Closed

alphadl mentioned this issue Sep 24, 2020

BART pretraining instructions facebookresearch/fairseq#1614

Closed

Skylixia mentioned this issue Apr 17, 2021

Pretrain PEGASUS from scratch #8536

Closed

ionicsolutions mentioned this issue Jun 26, 2021

[WIP] DataCollatorForTextInfilling #12370

Closed

5 tasks

duongna21 mentioned this issue Jul 26, 2022

Add Flax BART pretraining script #18297

Merged

3 tasks

BramVanroy mentioned this issue Sep 6, 2022

Add BART DLM PyTorch pretraining example #18904

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I training a bart model from scratch by transformers? #5096

Can I training a bart model from scratch by transformers? #5096

ScottishFold007 commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020

ScottishFold007 commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020 •

edited

Loading

patrickvonplaten commented Jun 18, 2020

ScottishFold007 commented Jun 18, 2020 •

edited

Loading

tomhosking commented Sep 2, 2020 •

edited

Loading

ScottishFold007 commented Sep 3, 2020

swethmandava commented Dec 17, 2020 •

edited

Loading

jonatasgrosman commented Jan 9, 2021

HuipengXu commented Apr 28, 2021

prajdabre commented Jun 15, 2021

jbmaxwell commented May 29, 2022

Haiming94 commented Jul 14, 2022

BramVanroy commented Sep 2, 2022

patrickvonplaten commented Sep 5, 2022

BramVanroy commented Sep 5, 2022

patrickvonplaten commented Sep 6, 2022

prajdabre commented Sep 6, 2022

BramVanroy commented Sep 6, 2022

prajdabre commented Sep 6, 2022

CountingMstar commented Feb 6, 2023

Can I training a bart model from scratch by transformers? #5096

Can I training a bart model from scratch by transformers? #5096

Comments

ScottishFold007 commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020

ScottishFold007 commented Jun 18, 2020

patrickvonplaten commented Jun 18, 2020 • edited Loading

patrickvonplaten commented Jun 18, 2020

ScottishFold007 commented Jun 18, 2020 • edited Loading

tomhosking commented Sep 2, 2020 • edited Loading

ScottishFold007 commented Sep 3, 2020

swethmandava commented Dec 17, 2020 • edited Loading

jonatasgrosman commented Jan 9, 2021

HuipengXu commented Apr 28, 2021

prajdabre commented Jun 15, 2021

jbmaxwell commented May 29, 2022

Haiming94 commented Jul 14, 2022

BramVanroy commented Sep 2, 2022

patrickvonplaten commented Sep 5, 2022

BramVanroy commented Sep 5, 2022

patrickvonplaten commented Sep 6, 2022

prajdabre commented Sep 6, 2022

BramVanroy commented Sep 6, 2022

prajdabre commented Sep 6, 2022

CountingMstar commented Feb 6, 2023

patrickvonplaten commented Jun 18, 2020 •

edited

Loading

ScottishFold007 commented Jun 18, 2020 •

edited

Loading

tomhosking commented Sep 2, 2020 •

edited

Loading

swethmandava commented Dec 17, 2020 •

edited

Loading