-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanish LM #80
Comments
@iamyihwa I would also like to train a language model in the near future. Could you please give some information about the size of your training, valid and test data set? What hardware did you use and how long did the training run? Many thanks, Stefan |
@stefan-it
So the values I used were I haven't tried any other .. neither do i have any baseline to compare .. |
So ppl seems to mean perplexity According to this language model , the perplexity is around 30 when the model finished.. Does this mean, perplexity of 3 is an incredibly good one??? Sorry I am a beginner in this field, and not have so good clue about baselines of a language model .. |
What GPU did you use? I tried training with 4 GB of GPU RAM but this was not successful. |
@stefan-it I am not sure which one was used. I used an AWS machine, but now it seems the gpu isn't working well (>nvidia-smi gives nothing.. ), and things speeded down .. |
Hello Yihwa, a perplexity of 30 is normal for a word-level language model, but our LMs are at the character-level, where you normally see much lower perplexity values - on our corpora, we typically see between 2 and 3 as perplexity. However, you cannot directly compare perplexity values unless you compute them both over (a) exactly the same holdout data and (b) using exactly the same character dictionary. Nevertheless, on our datasets 3.8 would be quite high, so perhaps you can do more training until you get below 2. One big problem is probably the size of your corpus, which is really small. We always use corpora of around 1 billion words, chopped into 100 chunks for processing. Maybe you can find a much larger dataset, such as a full Spanish Wikipedia dump? I also like to use Web crawls and movie subtitles (click on 'es' to get the Spanish portion), but that depends on your application domain. Then, let it ron for 1-2 weeks to get perplexity down. |
@alanakbik Thanks a lot for insights and also ideas of getting more data! I will try them today! :-) |
We find 2048 to be a bit better for downstream tasks, but 1024 would train a lot faster and is almost as good. So either is good! |
thanks @alanakbik ! I will then try it with 1024! :-) |
@stefan-it I am using 12 GB of GPU RAM (p2.xlarge machine from AWS). (61 GB was the total memory) |
@alanakbik I have trained the LM for a few days now. Look at the below two screen captures taken at different learning time: I don't see so much the point of going on training LM further .. since it seems to alternate .. Or is there any other parameters that i should have adjusted further? |
This looks good. It is normal that the perplexity fluctuates from split to split, but normally there is still a very slow downward trend that continues for a long while and makes a big difference. Over the days, perplexity will slowly decrease. Best let it run for a few days / weeks until the learning rate anneals. After one annealing step, the learning rate should be changed to 5.0. The Otherwise everything looks good, now it just takes time to train! |
Thanks @alanakbik for the suggestions! |
@alanakbik I have trained a language model (Forward and Backward) over the weekend.
|
@alanakbik did you have to make any language specific adjustments? e.g. German sentences tends to be longer than english sentences so your sequence length parameter or number of hidden neurons need be increased? (spanish sentences seems to be slightly longer than english sentences also.. ) |
We basically used the same parameters across languages, but sometimes used a specific character dictionary if the language uses uncommon characters. To check if your language model is good, you can use the language model to generate some text. If the text is close to natural language, then the model is generally good. You can use this code to generate text: import torch
from flair.models import LanguageModel
dataset = 'path/to/your/language/model/best-lm.pt'
# load your language model
state = torch.load(dataset, map_location={'cuda:0': 'cpu'})
model: LanguageModel = LanguageModel(state['dictionary'],
state['is_forward_lm'],
state['hidden_size'],
state['nlayers'],
state['embedding_size'],
state['nout'],
state['dropout'])
model.load_state_dict(state['state_dict'])
model.eval()
idx2item = model.dictionary.idx2item
# initial hidden state
hidden = model.init_hidden(1)
input = torch.rand(1, 1).mul(len(idx2item)).long()
# generate text character by character
characters = []
number_of_characters_to_generate = 2000
for i in range(number_of_characters_to_generate):
prediction, rnn_output, hidden = model.forward(input, hidden)
word_weights = prediction.squeeze().data.div(1.0).exp().cpu()
word_idx = torch.multinomial(word_weights, 1)[0]
input.data.fill_(word_idx)
word = idx2item[word_idx].decode('UTF-8')
characters.append(word)
if i % 100 == 0:
print('| Generated {}/{} chars'.format(i, number_of_characters_to_generate))
if 'backward' in dataset:
characters.reverse()
# print generated text
print(''.join(characters)) Does this work? Could you paste the text generated from the model? |
Hi @alanakbik Sorry for the late reply! Was away for one week. Language model was training also for one week.. I just tested language model .. using your script.
|
I have replaced line
into
and the error goes away. |
These are some example results:
|
Hello @iamyihwa this looks good! The drop may not seem like so much, but it tends to make a lot of difference. A perplexity of around 2,75 is good, we get that for our language models as well. I notice the learning rate has not yet annealed - when that happens you typically get better perplexity. So you could try training even longer, but you can also probably already use the LM now. Also, the generated text looks ok (though my Spanish is a bit shaky). You could now try applying it to Spanish NER. Do you have a backward language model trained as well? You get best results with both a forward + backward language model, as well as classic word embeddings for Spanish. |
@alanakbik I was doubting it would get better .. because within one week there was only reducing of perplexity by .1 within one week (and i thought this was too slow and had already hit a bottom).. Yes I used the FastText Spanish Embedding + Forward LM (character) + Backward LM (character) I ran yesterday spanish NER model with 100 epochs. I got F1 for validation 84.77 and for test 86.85. I expected the accuracy to go up since you got around 88.x for the German model. I share the details for you.
If I look at loss.txt file, for last 20, there is no increase in F1 value, so I guess the NER model has gotten as best as it could ..
This is the parameters I have used for learning ..
|
@alanakbik I have three main questions:
|
Hello @iamyihwa thanks for sharing the results!
model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)
# pass corpus and your pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)
# train with your favorite parameters
last_learning_rate = 5
trainer.train('resources/taggers/language_model', learning_rate=last_learning_rate) We never switched corpora, but we continued training a model on the same corpus a few times. In this case, you should pass the last learning rate (after annealing) to the trained. This means that if the trained annealed the learning rate to 5 at some point, you should restart the trained with this learning rate. However, we never tried to use another corpus on a pre-trained language model. I think this would work, but in this case, I would begin with the same learning as usual, i.e. 20. The first few epochs will probably exhibit strange behavior until the learning stabilizes for the new domain. If you try this, please share your experimence - we'd be happy to learn how this works! |
Hello @iamyihwa, a few more ideas to increase the F1 for Spanish:
trainer.train('resources/taggers/es-ner',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
patience=3,
)
There is an increasing number of works that found that GloVe is somehow better for NER, for instance our own experiments or the work by @Borchmann on Polish NER found nearly a full point difference. @Borchmann trained his own GloVe embeddings on Polish CommonCrawl. You could try something similar - perhaps Spanish Wikipedia would also work well. |
@alanakbik Thanks a lot for suggesting different things to try out!
Yes you are right! I didn't know that column was learning rate! These days I had been working on some other project, so had been training in the background LM with large hidden layer ( n = 2048). It indeed takes long time to train this large network! Also the idea of training with another input such as newspaper.
I didn't know that different embeddings can have influence on the downstream tasks! Thanks for sharing the insights! Thanks @alanakbik for the help! I will share with you as soon as results come out! |
@alanakbik Language Model (with number of hidden neurons =2048, Forward and Backward both of which reached perplexity of about 2.50 at the end) I couldn't use the parameter sets that you suggested because when I have bigger mini batch size, I ran into memory error. trainer.train('resources/taggers/es-ner-long-glove', learning_rate=0.1, mini_batch_size=14, max_epochs=150, patience = 4 ) 147 (19:23:14) 0.287881 2 0.000000 DEV 1195 acc: 97.74% p: 85.83% r: 86.01% FB1: 85.92 TEST 811 acc: 98.43% p: 87.36% r: 87.81% FB1: 87.58 This one also got into learning rate 0 towards the end .. What I haven't tested is Parameters used for the language model: tail result from loss.txt of language model: Strangely, here although I have trained for more than 2 weeks, learning rate is 20.00 at the end.. and didn't decrease .. (in the case of NER it decreased towards the end to 0.) (2) Training with FastText embedding (3) Changes in parameter: When I decrease mini-batch-size then I should increase patience perhaps? The machine I had used for testing was AWS p2.xlarge that had 12 GB gpu memory. |
Just a quick heads up: we've just pushed an update into master that includes pre-trained FastText embeddings for Spanish. You can load them with: embeddings = WordEmbeddings('es') For now only available through master branch, but we're planning another version release (0.3.2) in a few days - then they'll also be available if you install from pip. |
Thanks @alanakbik! |
@alanakbik Hi! I have seen lots of languages added to the Flair! I would also like to also make contribution and paying back the help that i received, byf adding the Spanish Language model and the Spanish NER which seem to give good results! I guess one of the first steps is giving the numbers, for this I would like to test again with the test set data/ and validation set data, in order so that these numbers can also be added (however I only see this at the end after the training, and I haven't found a way to test with the test set after the model is complete.. I only saw the 'predict' function within the model file, but not the evaluate with either validation / test set .. ) |
Hi @iamyihwa - that would be great, thank you! If you can send us the models we can include them in the next release. For instance, can you put them on AWS and send us the link? Could you also let us know on which corpus you trained the LM? |
@alanakbik you're welcome! Thanks to you guys' help! Sure ! I will do! I have trained it with wiki dumps. Should I upload only the language model ? Or also NER? |
@alanakbik The character based Spanish LM embedding is uploaded here. I think I in the end used the one with larger hidden size, but I am attaching both of them. language_model_es_foward, language_model_es_backward : hidden_size = 1024 |
Hi @alanakbik,I just wanted to quickly check if everything was okay with the process of getting the file and adding it to flair. Could you access the file? Or should I have put it on github, better? |
Hello @iamyihwa thanks for following up - we downloaded the models but haven't added them to Flair yet. I'll open a PR to do this! Thanks for sharing the models - are they working well for you in downstream tasks? |
Hi @alanakbik You're welcome! I was wondering if there was any problem. I have done Spanish NER with this, as well as sentiment classifier task and other text classification task. For Spanish NER, it could outperform state-of-the-art (back when I checked in the 2018, it was 85.77 by Yang or 85.75 of Lample ). For sentiment classifier and other text classification tasks, it was giving reasonably good results also, but a bit below that of state of the art. |
I would be thankfull, if you let me know how your result shows the cross-validation result, mine is only like this , as you see there is not split part
|
Can some one please tell what is es-X-fast ,@iamyihwa have you done knowledge distillation or it's with less hidden_size? |
Hello,
I just trained a Spanish LM.
I wonder if it is a good enough one.
What are the ways for you to test if it is a good enough LM?
For example, what do you get for loss in the English model? What does ppl stand for?
This is what I got for the very last split.
Split 10 - (08:27:57)
(08:29:14)
| split 10 / 9 | 100/ 555 batches | ms/batch 11655.58 | loss 1.37 | ppl 3.95
| split 10 / 9 | 200/ 555 batches | ms/batch 11570.46 | loss 1.36 | ppl 3.89
| split 10 / 9 | 300/ 555 batches | ms/batch 11550.08 | loss 1.35 | ppl 3.88
| split 10 / 9 | 400/ 555 batches | ms/batch 11563.46 | loss 1.35 | ppl 3.86
| split 10 / 9 | 500/ 555 batches | ms/batch 11523.42 | loss 1.35 | ppl 3.86
training done! (10:16:09)
best loss so far 1.26
| end of split 1 / 9 | epoch 0 | time: 7542.77s | valid loss 1.26 | valid ppl 3.52 | learning rate 20.00
The text was updated successfully, but these errors were encountered: