Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips for training base model from scratch on smaller amount of datasets #11

Closed
VictorAtPL opened this issue Aug 4, 2022 · 9 comments
Closed

Comments

@VictorAtPL
Copy link

VictorAtPL commented Aug 4, 2022

Hello @gwkrsrch ,

I am very excited about this model and an e2e approach it implements.

For my master thesis, I'd like to make an experiment to compare your method of generating synthetic documents with mine. I am only interested in evaluating the model on the Document Information Extraction downstream task with the CORD dataset and my proprietary one (let's call it PolCORD).

I'd like to train the Donut model on the (Psuedo) Text Reading Task with:
1/ naver-clova-ix/synthdog-en; synthdog-id; synthdog-pl (total 1.5M examples)
2/ my-method-en, my-method-id, my-method-pl (total 1.2M examples)

Could you give me a hand and share you experience:

  1. how can I generated/prepare corpus for Indonesian and Polish language in the same way how you prepared here: https://github.com/clovaai/donut/tree/master/synthdog/resources/corpus
  2. if I am going to train the model on 1.2-1.5M examples instead of 13M, do you have any gut feeling if I need, and to what values I should, downsize the model defined here: https://huggingface.co/naver-clova-ix/donut-base/blob/main/config.json?
  3. How many examples were you able to fit into single A100 GPU card? I've a 40Gb version and I'm going to use 16 of them.
@gwkrsrch
Copy link
Collaborator

Hi @VictorAtPL

For (1), as explained in Section 2.3 and Appendix A.2 (https://arxiv.org/abs/2111.15664), we sampled words and phrases from Wikipedia. The following links would be helpful to you.

To process the dump files, you may consider using WikiExtractor or other relevant tools/scripts.

For (2) and (3), you may need to control input_size of model architecture to meet the environmental condition.
Appendix A.6 (https://arxiv.org/abs/2111.15664) will also be helpful to you.

I hope this is useful to you :)

@gwkrsrch
Copy link
Collaborator

gwkrsrch commented Aug 12, 2022

One more general tip. To train a model for a new language, you may need to change some codes regarding token vocabulary/tokenizer. For example, see this block. This depends on the letters of the target language.
+) Or, just adding some new tokens for the target language would be enough.

@Vadkoz
Copy link

Vadkoz commented Feb 15, 2023

Hello!
I faced a problem with letters that do not exist in the model's tokenizer. It's just greek\arabian\cyrillic\math symbols etc which sometimes appear in different wiki articles. Also, there are some useful letters like Lithuanian umlauts.
So, what do I need to do? What is the best way to work with missing letters? Do I need to add to the tokenizer only useful and just skip all other letters and bear in mind that I will have some misrecognizing with such letters? Or do I need to add to the tokenizer all these letters? Or maybe I need to skip all these letters?

Is it a problem, that Asian MBert was not trained while "only text" phase (with text encoder) with letters that I want to add?

If I need to add tokens, is it enough just to do something like this?
tokenizer.add_tokens(new_tokens)
Do I need to do some additional steps, maybe train an MBart (with text encoder) on these letters first, or something?

@VictorAtPL
Copy link
Author

I faced a problem with letters that do not exist in the model's tokenizer.

+1. If anyone knows how to train model and tokenizer like Asian MBart for other languages in a way that it can easily replace the current Donut decoder, please share this knowledge with us.

@Vadkoz
Copy link

Vadkoz commented Feb 15, 2023

@VictorAtPL Did you try to add tokens like tokenizer.add_tokens(new_tokens)? If yes, is it works properly?

@VictorAtPL
Copy link
Author

@Vadkoz Haven't tried yet. I think the proper approach is to use Wikipedia corpus of the languages you care the most and retrain the whole Decoder on this corpus.

Not sure what kind of tokens should I add. Just letters, or sub-words or most common words? I'd rather leave it up to the Tokenizer to determine how tokens should be derived from e.g. Polish Wikipedia Corpus.

@balabis
Copy link

balabis commented Feb 20, 2023

Hello! I faced a problem with letters that do not exist in the model's tokenizer. It's just greek\arabian\cyrillic\math symbols etc which sometimes appear in different wiki articles. Also, there are some useful letters like Lithuanian umlauts. So, what do I need to do? What is the best way to work with missing letters? Do I need to add to the tokenizer only useful and just skip all other letters and bear in mind that I will have some misrecognizing with such letters? Or do I need to add to the tokenizer all these letters? Or maybe I need to skip all these letters?

Is it a problem, that Asian MBert was not trained while "only text" phase (with text encoder) with letters that I want to add?

If I need to add tokens, is it enough just to do something like this? tokenizer.add_tokens(new_tokens) Do I need to do some additional steps, maybe train an MBart (with text encoder) on these letters first, or something?

Hey @Vadkoz, have you figured it out yet? I am looking how can I use Donut with different languages too

@VictorAtPL
Copy link
Author

@balabis It looks like we need to retrain the tokenizer: https://huggingface.co/course/chapter6/2

And then train (M)Bart from scratch on (multiple) language(s) corpora using e.g.:

  1. https://github.com/prajdabre/yanmtt - MBart - unofficial but PyTorch
  2. https://github.com/ayaka14732/bart-base-jax - Bart - unofficial, JAX
  3. fairseq library - BART Pretraining Script facebookresearch/fairseq#1899 (comment) - Bart - official by facebook
  4. How to pre-train BART model huggingface/transformers#4151 (comment) - Bart - huggingface, custom training loop
  5. https://github.com/duongna21/transformers/blob/a5914fa94fd6172f3336e4d05270b138d288e47b/examples/flax/language-modeling/README.md#bart-denoising-language-modeling - Bart - huggingface (JAX)

Some of the links are for BART so just for one language. I guess that the MBart tokenizer must be used then to prepare training and inference examples with language tokens and in appropriate format.

Another v. interesting thing is this quote:

We made asian-bart using mBart by embedding layer pruning.

https://github.com/hyunwoongko/asian-bart

Maybe the asian-bart which is used as decoder in Donut isn't trained from scratch, but it's fine-tuned MBart25 or MBart50 model with reduced vocab size?

I think the first thing I will try is yanmtt but I am not ready yet to do this experiment.

@PiotrNawrot
Copy link

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants