[WIP] DataCollatorForTextInfilling #12370

ionicsolutions · 2021-06-26T10:02:58Z

What does this PR do?

A DataCollator for the BART "Text Infilling" pre-training task.

The implementation borrows ideas from fairseq's more complex DenoisingDataset.

Fixes #5428
(Addresses #5096)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

github-actions · 2021-07-26T15:06:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ionicsolutions · 2021-07-29T12:31:48Z

It's still on my agenda to brush this up

github-actions · 2021-08-22T15:02:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

salrowili · 2022-03-19T17:05:06Z

This is a wonderful effort. any update on this? also if you can add TF call that would be great.

ionicsolutions · 2022-03-30T14:10:33Z

@salrowili Sadly, I didn't find time for it. I'm also not sure whether this still fits with the library, there might have been some updates to the data collators in the meantime.

I'm still interested in working on this but realistically I won't have time to do that unless I need it for an ongoing project. Would be up for a collaboration?

salrowili · 2022-03-30T18:37:54Z

@ionicsolutions Thanks for replying back. What about BartForConditionalGeneration? is it enough to train BART from scratch like in this example https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_mlm_flax.py#L241 . However, as you can see it uses FlaxDataCollatorForLanguageModeling which i am not sure if it uses text in filling task?
maybe you can check this repo also https://github.com/cosmoquester/transformers-bart-pretrain . He already implemented text in filling task but with tensorflow dataset. However, this repo does not work with HF >4.11 because of some logit issue. Maybe you can contact the author of this repo and asks his permission to use his function and collaboration if he is willing to do. He is probably better than me in pushing this project forward. However, what I can help is that I can test any function you develop in this project in scale (e.g. pre-training it on BART-large from scratch) and see how it will perform and share colab example with research community. What i like about BART over T5 is the inference time and memory usage during fine-tuning and it also can achieve SOTA on SQuAD and GLUE in addition to generative tasks (e.g. summarization) so i think this project is much needed from research community.

jbmaxwell · 2022-04-21T00:32:44Z

@salrowili I'm also interested in infilling generation and was wondering if you've made any progress? I see your last post was three weeks ago, so I'm wondering if maybe you found an alternative approach?

salrowili · 2022-04-22T21:48:03Z

@jbmaxwell I try out BART implementation of FLAX, XLA with TPU and Keras BART @ https://github.com/cosmoquester/transformers-bart-pretrain . Keras BART is my best model among those and hence that why i was looking for textinfliing. I think also the implementation of BART is not optimal with the hugging face library, especially for BART large. I am also working with fairseq now and torch xla and I think this will be the best among all variety that I tried out. I suggest for you ask for TPU access from google https://sites.research.google/trc/ and try out fairseq xla with BART but fix the dynamic shape by using pre-defined input shape in my frok https://github.com/salrowili/fairseq. You can see latest commits to see what changes I made. with TPUv3-8 and BART will get a speed of ~100k wps but you need to keep the log interval 10 and num_bucket=5. I run BART on my 3090 and it gives me a speed of 30K wps. 100k wps translate to ~20K steps/day which is slow compared to BERT with TF (~125K stepts/day) with batch size of 256 and max. seq. length of 512. which means it will take you around one month to finish 500K steps with BART (:
If you find an alternative solution or you are willing to improve BART implementation with text filling and JAX, TF it would be good if you share your solution as i share mine (:

jbmaxwell · 2022-04-22T21:59:48Z

I hadn't seen this before—thanks or the link!
I'll give it a try.
I'm working with compact, non-natural language inputs and small datasets (for now), and generally reduce model sizes significantly from the stock versions, so I'm not too worried about training resources. Faster is better, of course, but not a deal-breaker for me.

DataCollatorForTextInfilling (WIP)

b2194ea

LouisCastricato mentioned this pull request Jul 4, 2021

Data collator is working. Needs further testing but it should be fine morganmcg1/rotobart#4

Merged

github-actions bot closed this Aug 30, 2021

ccdv-ai mentioned this pull request Jul 24, 2022

Pretraining BART language model #18030

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] DataCollatorForTextInfilling #12370

[WIP] DataCollatorForTextInfilling #12370

ionicsolutions commented Jun 26, 2021 •

edited

Loading

github-actions bot commented Jul 26, 2021

ionicsolutions commented Jul 29, 2021

github-actions bot commented Aug 22, 2021

salrowili commented Mar 19, 2022 •

edited

Loading

ionicsolutions commented Mar 30, 2022

salrowili commented Mar 30, 2022 •

edited

Loading

jbmaxwell commented Apr 21, 2022

salrowili commented Apr 22, 2022 •

edited

Loading

jbmaxwell commented Apr 22, 2022

[WIP] DataCollatorForTextInfilling #12370

[WIP] DataCollatorForTextInfilling #12370

Conversation

ionicsolutions commented Jun 26, 2021 • edited Loading

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Jul 26, 2021

ionicsolutions commented Jul 29, 2021

github-actions bot commented Aug 22, 2021

salrowili commented Mar 19, 2022 • edited Loading

ionicsolutions commented Mar 30, 2022

salrowili commented Mar 30, 2022 • edited Loading

jbmaxwell commented Apr 21, 2022

salrowili commented Apr 22, 2022 • edited Loading

jbmaxwell commented Apr 22, 2022

ionicsolutions commented Jun 26, 2021 •

edited

Loading

salrowili commented Mar 19, 2022 •

edited

Loading

salrowili commented Mar 30, 2022 •

edited

Loading

salrowili commented Apr 22, 2022 •

edited

Loading