-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Fast" version #3
Comments
Hi @Lundez how do we get a fast version ? Did you figure this out? and @stefan-it can you please help regarding this as soon as possible. |
I never got a response and didn't get started on training it myself. (p.s. for me it was enough to quantize model post training my NER) |
Hi @codemaster-22 and @Lundez , unfortunately, I have no plans to re-train fast models. You're right, if you want to have fast models, then you need to train the model from scratch. You would need to train fast versions with a decent training corpus on your own; however, if you want to try smaller models you could use e.g. the distilled version of multilingual BERT, provided from Hugging Face: https://huggingface.co/distilbert-base-multilingual-cased. |
@Lundez if you want to train a fast version, then you just need to use a smaller hidden states size. The "fast" then usually refers to models with a hidden size of 1024 instead of 2048. You can use this example as orientation: |
Here are the scripts (forward and backward lm training) that I've used for training e.g. the Swedish Flair Embeddings: Forward lm: from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
from pathlib import Path
# are you training a forward or backward LM?
is_forward_lm = True
# load the default character dictionary
#dictionary: Dictionary = Dictionary.load('chars')
dictionary: Dictionary = Dictionary.load_from_file('dictionary.pkl')
# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('./corpus'),
dictionary,
is_forward_lm,
character_level=True)
# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
is_forward_lm,
hidden_size=2048,
dropout=0.1,
nlayers=1)
# train your language model
trainer = LanguageModelTrainer(language_model, corpus)
trainer.train('resources/taggers/language_model_forward',
sequence_length=250,
mini_batch_size=50,
max_epochs=1,
checkpoint=True) Backward lm: from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus
from pathlib import Path
# are you training a forward or backward LM?
is_forward_lm = False
# load the default character dictionary
#dictionary: Dictionary = Dictionary.load('chars')
dictionary: Dictionary = Dictionary.load_from_file('dictionary.pkl')
# get your corpus, process forward and at the character level
corpus = TextCorpus(Path('./corpus'),
dictionary,
is_forward_lm,
character_level=True)
# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
is_forward_lm,
hidden_size=2048,
dropout=0.1,
nlayers=1)
# train your language model
trainer = LanguageModelTrainer(language_model, corpus)
trainer.train('resources/taggers/language_model_backward',
sequence_length=250,
mini_batch_size=50,
max_epochs=1,
checkpoint=True) The import sys
from flair.data import Dictionary
char_dictionary: Dictionary = Dictionary()
# counter object
import collections
counter = collections.Counter()
processed = 0
file = sys.argv[1]
with open(file, 'r', encoding='utf-8') as f:
tokens = 0
for line in f:
processed += 1
chars = list(line)
tokens += len(chars)
# Add chars to the dictionary
counter.update(chars)
# comment this line in to speed things up (if the corpus is too large)
# if tokens > 50000000: break
total_count = 0
for letter, count in counter.most_common():
total_count += count
print(total_count)
print(processed)
sum = 0
idx = 0
for letter, count in counter.most_common():
sum += count
percentile = (sum / total_count)
# comment this line in to use only top X percentile of chars, otherwise filter later
# if percentile < 0.00001: break
char_dictionary.add_item(letter)
idx += 1
print('%d\t%s\t%7d\t%7d\t%f' % (idx, letter, count, sum, percentile))
print(char_dictionary.item2idx)
import pickle
output = sys.argv[2]
with open(output, 'wb') as f:
mappings = {
'idx2item': char_dictionary.idx2item,
'item2idx': char_dictionary.item2idx
}
pickle.dump(mappings, f) So if you want to train a faster model, just use |
Thanks a lot @stefan-it I am looking for this answer !! Love you bruh for the amazing repo and quick responses!! |
I have a doubt , Like I have a small Corpus close to 1 million words , it is Hinglish , by Hinglish I mean Hindi Language written in English word , for eg : Mein kya karu , u can see it's english text but it sounds Hindi , so Shall I fine tune the existing English flair embedding model like news-X or do I have to train from scratch !! @stefan-it , @Lundez |
Hi @stefan-it thanks for the awesome job of providing all these embeddings.
I'm wondering how you trained them and if I could perhaps create a "fast" version of the Swedish ones myself?
I'm in need of a little smaller size to increase the inference time 😄
The text was updated successfully, but these errors were encountered: