Add BART DLM PyTorch pretraining example #18904

BramVanroy · 2022-09-06T16:00:26Z

Implements a pretraining example for BART (denoising language model). Big focus on getting the data denoising as close to the original fairseq as possible but instead of on the dataset level on the dataloader level.

Heavily inspired by the fairseq implementation and the FLAX implementation. (See HF (Flax), fairseq, and current implementation.) Looking for some feedback. Please see Questions/Uncertainties.

Some notes

Default values

The defaults are set to the given BART args. This differs from the Flax defaults in one respect, namely poisson_lambda, which is now set to 3.5 instead of 3.0.

HF (Flax), fairseq, and current implementation

There are some differences in implementation between fairseq, the HF FLAX example, and this PyTorch implementation.

argwhere in the Flax example
in this position
is not the same as what is happening in fairseq. In fairseq
we check explicitly that the previous token was not a "full stop" (padding token) but in HF we just check whether the
current token is a full stop. In the current example I also explicitly check that the next token is not a full stop,
in case of padding. (However, in practice that should be a non-issue since all batches/samples should have the
same sequence length and there should not be any padding.)
I found that the result of sentence permutation was not consistent in terms of where the separating pad token ended
up (bug report), so I have reimplemented that method so
that sentences in a sequence are still separated by a padding token, even after permutation.
In HF FLAX, the token_mask is restricted to non-special and non-padding tokens.
In Fairseq, by default, only the first and last tokens are excluded and all others
are prone to masking. The HF implementation seems sensible so I follow that. get_special_tokens_mask includes the
padding token, though, so no need to add that separately.
The Flax example does not include methods to add more noise. I have ported those as well.
However, I did not adapt add_insertion_noise to work well with padded sequences. So the inserted noise may occur
ANYWHERE. It is unclear whether this is intended behavior.

Alternatively, we could implement all this processing on the dataset level and use Dataset.map. This has some
advantages:

more true to fairseq implementation (sample level rather than batch level);
cached.

... and disadvantages:

potentially slower (not batched), although we can integrate a batched approach. But as discussed above, this will be
less true to the original fairseq implementation in add_insertion_noise
every sample is always processed the same. So in small datasets which are seen multiple times by the model, the
same sample will always be processed the same. In a dataloader, that will not be the case because the processing
occurs on every iteration rather than once before training.

Questions/Uncertainties

Do the padding tokens still serve a purpose after permutation? (Teaching the model to learn to detect sentence boundaries?) They can get masked and noised.
It seems that add_insertion_noise can insert noise anywhere (also in fairseq), which means that it will also overwrite special
tokens and that sequence don't necessarily end with a EOS token. Is that a problem?
I have now added auxiliary scripts for config/tokenizer creation when pre-training. Should I remove those? In the FLAX example, these steps are described inline but without a given script. So we could also just do that.
I have explicitly added fingerprints (hashed) because in the past I've come to encounter issues when using spaCy and Dataset.map (every time you load a spaCy model, it has a different hash so the processing will happen every time). I don't see a better way but feel free to share ideas. Maybe someone of the datasets team can chime in, too.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Can I training a bart model from scratch by transformers? #5096 (comment)
Did you make sure to update the documentation with your changes?

Who can review?

bart: @patrickvonplaten @patil-suraj
maintained examples (not research project or legacy): @patil-suraj
flax implementation authors: @sanchit-gandhi @duongna21

HuggingFaceDocBuilderDev · 2022-09-06T16:14:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sanchit-gandhi · 2022-09-12T14:56:05Z

Hey @BramVanroy! Thanks for making a start on this PR. In general, we aim to mirror the original repo's functionality as closely as possible. In this case, porting from fairseq is the way to go! So great to see your comments regarding consistency with fariseq, and yes to all of them! If indeed these changes are required, we'll need to update the Flax example accordingly.

We can batch samples with datasets.map by passing the num_workers arg. To pre-process samples on a specified number of CPU workers concurrently:

dataset = dataset.map(map_fn, num_workers=data_args.preprocessing_num_workers)

This I think is the way to go for processing the dataset being the closest to fariseq.

Adding auxiliary scripts for config/tokenizer creation is a great idea - all for it! Makes it far easier to reproduce and run the example :-)

Bram Vanroy added 3 commits September 6, 2022 17:41

add bart dlm

64937ea

fix typos

76219e8

make style

06fb817

Bram Vanroy added 2 commits September 6, 2022 18:41

remove f-string

8f339c3

train tokenizer based on bart tokenizer to be sure

cdbe31e

BramVanroy closed this by deleting the head repository Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BART DLM PyTorch pretraining example #18904

Add BART DLM PyTorch pretraining example #18904

BramVanroy commented Sep 6, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 6, 2022

sanchit-gandhi commented Sep 12, 2022

Add BART DLM PyTorch pretraining example #18904

Add BART DLM PyTorch pretraining example #18904

Conversation

BramVanroy commented Sep 6, 2022 • edited Loading

Some notes

Default values

HF (Flax), fairseq, and current implementation

Questions/Uncertainties

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Sep 6, 2022

sanchit-gandhi commented Sep 12, 2022

BramVanroy commented Sep 6, 2022 •

edited

Loading