"Fast" version #3

Lundez · 2021-03-23T09:05:22Z

Hi @stefan-it thanks for the awesome job of providing all these embeddings.

I'm wondering how you trained them and if I could perhaps create a "fast" version of the Swedish ones myself?
I'm in need of a little smaller size to increase the inference time 😄

codemaster-22 · 2021-06-21T14:16:45Z

Hi @Lundez how do we get a fast version ? Did you figure this out? and @stefan-it can you please help regarding this as soon as possible.

Lundez · 2021-06-21T14:25:32Z

I never got a response and didn't get started on training it myself.
I think I saw the width of 'fast model mentioned somewhere. And Stefan has mentioned he used wiki + opus + opensubtitles I believe.
That should get you started. If you complete a small model please share!

(p.s. for me it was enough to quantize model post training my NER)

stefan-it · 2021-06-25T14:26:09Z

Hi @codemaster-22 and @Lundez ,

unfortunately, I have no plans to re-train fast models.

You're right, if you want to have fast models, then you need to train the model from scratch.

You would need to train fast versions with a decent training corpus on your own; however, if you want to try smaller models you could use e.g. the distilled version of multilingual BERT, provided from Hugging Face: https://huggingface.co/distilbert-base-multilingual-cased.

stefan-it · 2021-06-25T19:43:59Z

@Lundez if you want to train a fast version, then you just need to use a smaller hidden states size. The "fast" then usually refers to models with a hidden size of 1024 instead of 2048.

You can use this example as orientation:

https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md#training-the-language-model

stefan-it · 2021-06-25T19:48:59Z

Here are the scripts (forward and backward lm training) that I've used for training e.g. the Swedish Flair Embeddings:

Forward lm:

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

from pathlib import Path

# are you training a forward or backward LM?
is_forward_lm = True 

# load the default character dictionary
#dictionary: Dictionary = Dictionary.load('chars')
dictionary: Dictionary = Dictionary.load_from_file('dictionary.pkl')

# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('./corpus'),
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=2048,
                               dropout=0.1,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model_forward',
              sequence_length=250,
              mini_batch_size=50,
              max_epochs=1,
              checkpoint=True)

Backward lm:

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

from pathlib import Path

# are you training a forward or backward LM?
is_forward_lm = False 

# load the default character dictionary
#dictionary: Dictionary = Dictionary.load('chars')
dictionary: Dictionary = Dictionary.load_from_file('dictionary.pkl')

# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('./corpus'),
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=2048,
                               dropout=0.1,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model_backward',
              sequence_length=250,
              mini_batch_size=50,
              max_epochs=1,
              checkpoint=True)

The dictionary.pkl was created with this script:

import sys

from flair.data import Dictionary
char_dictionary: Dictionary = Dictionary()

# counter object
import collections
counter = collections.Counter()

processed = 0


file = sys.argv[1]

with open(file, 'r', encoding='utf-8') as f:
    tokens = 0
    for line in f:

        processed += 1            
        chars = list(line)
        tokens += len(chars)

        # Add chars to the dictionary
        counter.update(chars)

        # comment this line in to speed things up (if the corpus is too large)
        # if tokens > 50000000: break


total_count = 0
for letter, count in counter.most_common():
    total_count += count

print(total_count)
print(processed)

sum = 0
idx = 0
for letter, count in counter.most_common():
    sum += count
    percentile = (sum / total_count)

    # comment this line in to use only top X percentile of chars, otherwise filter later
    # if percentile < 0.00001: break

    char_dictionary.add_item(letter)
    idx += 1
    print('%d\t%s\t%7d\t%7d\t%f' % (idx, letter, count, sum, percentile))

print(char_dictionary.item2idx)

import pickle

output = sys.argv[2]

with open(output, 'wb') as f:
    mappings = {
        'idx2item': char_dictionary.idx2item,
        'item2idx': char_dictionary.item2idx
    }
    pickle.dump(mappings, f)

So if you want to train a faster model, just use hidden_size=1024 instead of hidden_states=2048 and you should be able to use the scripts above 🤗

codemaster-22 · 2021-06-26T03:35:42Z

Thanks a lot @stefan-it I am looking for this answer !! Love you bruh for the amazing repo and quick responses!!

codemaster-22 · 2021-06-26T06:00:52Z

I have a doubt , Like I have a small Corpus close to 1 million words , it is Hinglish , by Hinglish I mean Hindi Language written in English word , for eg : Mein kya karu , u can see it's english text but it sounds Hindi , so Shall I fine tune the existing English flair embedding model like news-X or do I have to train from scratch !! @stefan-it , @Lundez

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Fast" version #3

"Fast" version #3

Lundez commented Mar 23, 2021

codemaster-22 commented Jun 21, 2021

Lundez commented Jun 21, 2021

stefan-it commented Jun 25, 2021

stefan-it commented Jun 25, 2021

stefan-it commented Jun 25, 2021

codemaster-22 commented Jun 26, 2021

codemaster-22 commented Jun 26, 2021 •

edited

Loading

"Fast" version #3

"Fast" version #3

Comments

Lundez commented Mar 23, 2021

codemaster-22 commented Jun 21, 2021

Lundez commented Jun 21, 2021

stefan-it commented Jun 25, 2021

stefan-it commented Jun 25, 2021

stefan-it commented Jun 25, 2021

codemaster-22 commented Jun 26, 2021

codemaster-22 commented Jun 26, 2021 • edited Loading

codemaster-22 commented Jun 26, 2021 •

edited

Loading