Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish LM #80

Closed
iamyihwa opened this issue Aug 20, 2018 · 37 comments
Closed

Spanish LM #80

iamyihwa opened this issue Aug 20, 2018 · 37 comments
Labels
language model Related to language model new language New languages

Comments

@iamyihwa
Copy link

Hello,
I just trained a Spanish LM.
I wonder if it is a good enough one.
What are the ways for you to test if it is a good enough LM?
For example, what do you get for loss in the English model? What does ppl stand for?

This is what I got for the very last split.

Split 10 - (08:27:57)
(08:29:14)
| split 10 / 9 | 100/ 555 batches | ms/batch 11655.58 | loss 1.37 | ppl 3.95
| split 10 / 9 | 200/ 555 batches | ms/batch 11570.46 | loss 1.36 | ppl 3.89
| split 10 / 9 | 300/ 555 batches | ms/batch 11550.08 | loss 1.35 | ppl 3.88
| split 10 / 9 | 400/ 555 batches | ms/batch 11563.46 | loss 1.35 | ppl 3.86
| split 10 / 9 | 500/ 555 batches | ms/batch 11523.42 | loss 1.35 | ppl 3.86
training done! (10:16:09)
best loss so far 1.26

| end of split 1 / 9 | epoch 0 | time: 7542.77s | valid loss 1.26 | valid ppl 3.52 | learning rate 20.00

@stefan-it
Copy link
Member

@iamyihwa I would also like to train a language model in the near future. Could you please give some information about the size of your training, valid and test data set? What hardware did you use and how long did the training run?

Many thanks,

Stefan

@iamyihwa
Copy link
Author

iamyihwa commented Aug 20, 2018

@stefan-it
I used suggested values from here

The parameters in this script are very small. We got good results with a hidden size of 1024 or 2048, a sequence length of 250, and a mini-batch size of 100. Depending on your resources, you can try training large models, but beware that you need a very powerful GPU and a lot of time to train a model (we train for > 1 week).

So the values I used were
: hidden size of 1024, sequence length of 250, mini-batch size of 100

I haven't tried any other .. neither do i have any baseline to compare ..
if anyone could provide some baseline would appreciate!

@iamyihwa
Copy link
Author

iamyihwa commented Aug 20, 2018

So ppl seems to mean perplexity

According to this language model , the perplexity is around 30 when the model finished..
Also wikipedia says, A low perplexity indicates the probability distribution is good at predicting the sample....

Does this mean, perplexity of 3 is an incredibly good one??? Sorry I am a beginner in this field, and not have so good clue about baselines of a language model ..

@stefan-it
Copy link
Member

What GPU did you use? I tried training with 4 GB of GPU RAM but this was not successful.

@iamyihwa
Copy link
Author

@stefan-it I am not sure which one was used. I used an AWS machine, but now it seems the gpu isn't working well (>nvidia-smi gives nothing.. ), and things speeded down ..
I talked with my colleagues, and one of them told me it might be due to the corpus size.
I used the files from here : spa_wikipedia_2011_30K-sentences.txt
It might much smaller than the 1 billion word datasets perhaps?
Which dataset did you use for your language model?

@alanakbik
Copy link
Collaborator

Hello Yihwa,

a perplexity of 30 is normal for a word-level language model, but our LMs are at the character-level, where you normally see much lower perplexity values - on our corpora, we typically see between 2 and 3 as perplexity. However, you cannot directly compare perplexity values unless you compute them both over (a) exactly the same holdout data and (b) using exactly the same character dictionary. Nevertheless, on our datasets 3.8 would be quite high, so perhaps you can do more training until you get below 2.

One big problem is probably the size of your corpus, which is really small. We always use corpora of around 1 billion words, chopped into 100 chunks for processing. Maybe you can find a much larger dataset, such as a full Spanish Wikipedia dump? I also like to use Web crawls and movie subtitles (click on 'es' to get the Spanish portion), but that depends on your application domain. Then, let it ron for 1-2 weeks to get perplexity down.

@iamyihwa
Copy link
Author

@alanakbik
Thanks for the feedback!
Would you suggest any other changes as well for ppl to be lower? e.g. increasing the hidden neuron size from 1024 to 2048? Validation loss seems to be slightly lower than the Training loss at the end of the training.

Thanks a lot for insights and also ideas of getting more data! I will try them today! :-)

@alanakbik
Copy link
Collaborator

We find 2048 to be a bit better for downstream tasks, but 1024 would train a lot faster and is almost as good. So either is good!

@iamyihwa
Copy link
Author

thanks @alanakbik ! I will then try it with 1024! :-)

@iamyihwa
Copy link
Author

iamyihwa commented Aug 29, 2018

@stefan-it I am using 12 GB of GPU RAM (p2.xlarge machine from AWS). (61 GB was the total memory)
Language model works in my case. However when I do the training on the NER task, it fails.

@iamyihwa
Copy link
Author

iamyihwa commented Aug 30, 2018

@alanakbik I have trained the LM for a few days now.
However it seems to alternate.. instead of being reduced ..

Look at the below two screen captures taken at different learning time:
(Notice not so much change of loss or ppl)
Splits: 33 , 34
image
Split: 43, 44
image

I don't see so much the point of going on training LM further .. since it seems to alternate ..
I see that there is already ReduceLROnPlateau which changes learning rate ..
(Although lr is always displayed as 20.00 ... I guess it is working .. )

Or is there any other parameters that i should have adjusted further?

These were the parameters used:
image

@alanakbik
Copy link
Collaborator

This looks good. It is normal that the perplexity fluctuates from split to split, but normally there is still a very slow downward trend that continues for a long while and makes a big difference. Over the days, perplexity will slowly decrease.

Best let it run for a few days / weeks until the learning rate anneals. After one annealing step, the learning rate should be changed to 5.0. The patience parameter controls how long it waits until it anneals. So if it is taking too long for you, maybe you could reduce the patience to 50 or 25. Normally, you see some improvements after the first annealing step.

Otherwise everything looks good, now it just takes time to train!

@iamyihwa
Copy link
Author

iamyihwa commented Aug 30, 2018

Thanks @alanakbik for the suggestions!
I will try with patience = 50
(I guess this patience value should be changed depending on the number of files one has? or here the standard definition of epoch is used rather than the split? )

@iamyihwa
Copy link
Author

iamyihwa commented Sep 3, 2018

@alanakbik I have trained a language model (Forward and Backward) over the weekend.
Here is the result for the last 60 epochs for the forward model.

| end of split 59 / 85 | epoch 0 | time: 1609.79s | valid loss 1.08 | valid ppl 2.96 | learning rate 20.00
| end of split 60 / 85 | epoch 0 | time: 2473.32s | valid loss 1.08 | valid ppl 2.94 | learning rate 20.00
| end of split 61 / 85 | epoch 0 | time: 2156.08s | valid loss 1.07 | valid ppl 2.92 | learning rate 20.00
| end of split 62 / 85 | epoch 0 | time: 2491.95s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 63 / 85 | epoch 0 | time: 1627.14s | valid loss 1.08 | valid ppl 2.93 | learning rate 20.00
| end of split 64 / 85 | epoch 0 | time: 1627.93s | valid loss 1.08 | valid ppl 2.95 | learning rate 20.00
| end of split 65 / 85 | epoch 0 | time: 2191.05s | valid loss 1.09 | valid ppl 2.97 | learning rate 20.00
| end of split 66 / 85 | epoch 0 | time: 1822.37s | valid loss 1.08 | valid ppl 2.93 | learning rate 20.00
| end of split 67 / 85 | epoch 0 | time: 2166.26s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 68 / 85 | epoch 0 | time: 2189.52s | valid loss 1.08 | valid ppl 2.94 | learning rate 20.00
| end of split 69 / 85 | epoch 0 | time: 1536.36s | valid loss 1.07 | valid ppl 2.90 | learning rate 20.00
| end of split 70 / 85 | epoch 0 | time: 1819.07s | valid loss 1.10 | valid ppl 3.01 | learning rate 20.00
| end of split 71 / 85 | epoch 0 | time: 2158.34s | valid loss 1.06 | valid ppl 2.90 | learning rate 20.00
| end of split 72 / 85 | epoch 0 | time: 1537.54s | valid loss 1.07 | valid ppl 2.92 | learning rate 20.00
| end of split 73 / 85 | epoch 0 | time: 1999.21s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 74 / 85 | epoch 0 | time: 2496.14s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 75 / 85 | epoch 0 | time: 2190.20s | valid loss 1.07 | valid ppl 2.92 | learning rate 20.00
| end of split 76 / 85 | epoch 0 | time: 2470.55s | valid loss 1.10 | valid ppl 2.99 | learning rate 20.00
| end of split 77 / 85 | epoch 0 | time: 1815.53s | valid loss 1.10 | valid ppl 3.01 | learning rate 20.00
| end of split 78 / 85 | epoch 0 | time: 2491.52s | valid loss 1.06 | valid ppl 2.88 | learning rate 20.00
| end of split 79 / 85 | epoch 0 | time: 2488.02s | valid loss 1.06 | valid ppl 2.89 | learning rate 20.00
| end of split 80 / 85 | epoch 0 | time: 1537.74s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 81 / 85 | epoch 0 | time: 1998.90s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 82 / 85 | epoch 0 | time: 2490.48s | valid loss 1.12 | valid ppl 3.07 | learning rate 20.00
| end of split 83 / 85 | epoch 0 | time: 1628.33s | valid loss 1.08 | valid ppl 2.94 | learning rate 20.00
| end of split 84 / 85 | epoch 0 | time: 1806.39s | valid loss 1.07 | valid ppl 2.90 | learning rate 20.00
| end of split 85 / 85 | epoch 0 | time: 2162.61s | valid loss 1.07 | valid ppl 2.90 | learning rate 20.00
| end of split 1 / 85 | epoch 0 | time: 1822.39s | valid loss 1.10 | valid ppl 2.99 | learning rate 20.00
| end of split 2 / 85 | epoch 0 | time: 2190.93s | valid loss 1.08 | valid ppl 2.94 | learning rate 20.00
| end of split 3 / 85 | epoch 0 | time: 1541.73s | valid loss 1.06 | valid ppl 2.90 | learning rate 20.00
| end of split 4 / 85 | epoch 0 | time: 2336.08s | valid loss 1.06 | valid ppl 2.88 | learning rate 20.00
| end of split 5 / 85 | epoch 0 | time: 2488.37s | valid loss 1.11 | valid ppl 3.04 | learning rate 20.00
| end of split 6 / 85 | epoch 0 | time: 1622.11s | valid loss 1.06 | valid ppl 2.89 | learning rate 20.00
| end of split 7 / 85 | epoch 0 | time: 2155.81s | valid loss 1.06 | valid ppl 2.88 | learning rate 20.00
| end of split 8 / 85 | epoch 0 | time: 2478.12s | valid loss 1.06 | valid ppl 2.87 | learning rate 20.00
| end of split 9 / 85 | epoch 0 | time: 2475.01s | valid loss 1.06 | valid ppl 2.89 | learning rate 20.00
| end of split 10 / 85 | epoch 0 | time: 1984.90s | valid loss 1.06 | valid ppl 2.89 | learning rate 20.00
| end of split 11 / 85 | epoch 0 | time: 2161.92s | valid loss 1.06 | valid ppl 2.90 | learning rate 20.00
| end of split 12 / 85 | epoch 0 | time: 1619.97s | valid loss 1.06 | valid ppl 2.88 | learning rate 20.00
| end of split 13 / 85 | epoch 0 | time: 1530.37s | valid loss 1.05 | valid ppl 2.86 | learning rate 20.00
| end of split 14 / 85 | epoch 0 | time: 2149.72s | valid loss 1.05 | valid ppl 2.85 | learning rate 20.00
| end of split 15 / 85 | epoch 0 | time: 2322.07s | valid loss 1.05 | valid ppl 2.87 | learning rate 20.00
| end of split 16 / 85 | epoch 0 | time: 2161.40s | valid loss 1.05 | valid ppl 2.85 | learning rate 20.00
| end of split 17 / 85 | epoch 0 | time: 2339.15s | valid loss 1.05 | valid ppl 2.85 | learning rate 20.00
| end of split 18 / 85 | epoch 0 | time: 1998.39s | valid loss 1.05 | valid ppl 2.86 | learning rate 20.00
| end of split 19 / 85 | epoch 0 | time: 1622.60s | valid loss 1.05 | valid ppl 2.86 | learning rate 20.00
| end of split 20 / 85 | epoch 0 | time: 1994.35s | valid loss 1.05 | valid ppl 2.87 | learning rate 20.00
| end of split 21 / 85 | epoch 0 | time: 2001.34s | valid loss 1.05 | valid ppl 2.86 | learning rate 20.00
| end of split 22 / 85 | epoch 0 | time: 2166.28s | valid loss 1.05 | valid ppl 2.85 | learning rate 20.00
| end of split 23 / 85 | epoch 0 | time: 2331.76s | valid loss 1.05 | valid ppl 2.87 | learning rate 20.00
| end of split 24 / 85 | epoch 0 | time: 2179.95s | valid loss 1.06 | valid ppl 2.88 | learning rate 20.00
| end of split 25 / 85 | epoch 0 | time: 1819.15s | valid loss 1.08 | valid ppl 2.96 | learning rate 20.00
| end of split 26 / 85 | epoch 0 | time: 2186.40s | valid loss 1.06 | valid ppl 2.88 | learning rate 20.00
| end of split 27 / 85 | epoch 0 | time: 2338.25s | valid loss 1.05 | valid ppl 2.85 | learning rate 20.00
| end of split 28 / 85 | epoch 0 | time: 1816.05s | valid loss 1.05 | valid ppl 2.86 | learning rate 20.00
| end of split 29 / 85 | epoch 0 | time: 1628.93s | valid loss 1.07 | valid ppl 2.91 | learning rate 20.00
| end of split 30 / 85 | epoch 0 | time: 2493.63s | valid loss 1.08 | valid ppl 2.93 | learning rate 20.00
| end of split 31 / 85 | epoch 0 | time: 2488.12s | valid loss 1.10 | valid ppl 3.00 | learning rate 20.00
| end of split 32 / 85 | epoch 0 | time: 2477.08s | valid loss 1.05 | valid ppl 2.84 | learning rate 20.00
| end of split 33 / 85 | epoch 0 | time: 1537.85s | valid loss 1.05 | valid ppl 2.85 | learning rate 20.00


the patience i set to 50,

I wonder if it is going in good direction ..
My impression is that it is really fluctuating a lot ..

Validation perplexity seems to be lower than the training set perplexity depending on split.

I also wonder why the perplexity tends to be lower or higher for each split.
(See here for split 119, loss and perplexity tends to be lower than split 118).

image

@iamyihwa
Copy link
Author

iamyihwa commented Sep 3, 2018

@alanakbik did you have to make any language specific adjustments? e.g. German sentences tends to be longer than english sentences so your sequence length parameter or number of hidden neurons need be increased? (spanish sentences seems to be slightly longer than english sentences also.. )

@alanakbik
Copy link
Collaborator

We basically used the same parameters across languages, but sometimes used a specific character dictionary if the language uses uncommon characters.

To check if your language model is good, you can use the language model to generate some text. If the text is close to natural language, then the model is generally good. You can use this code to generate text:

import torch
from flair.models import LanguageModel

dataset = 'path/to/your/language/model/best-lm.pt'

# load your language model
state = torch.load(dataset, map_location={'cuda:0': 'cpu'})
model: LanguageModel  = LanguageModel(state['dictionary'],
                                     state['is_forward_lm'],
                                     state['hidden_size'],
                                     state['nlayers'],
                                     state['embedding_size'],
                                     state['nout'],
                                     state['dropout'])
model.load_state_dict(state['state_dict'])
model.eval()

idx2item = model.dictionary.idx2item


# initial hidden state
hidden = model.init_hidden(1)
input = torch.rand(1, 1).mul(len(idx2item)).long()

# generate text character by character
characters = []
number_of_characters_to_generate = 2000
for i in range(number_of_characters_to_generate):
    prediction, rnn_output, hidden = model.forward(input, hidden)
    word_weights = prediction.squeeze().data.div(1.0).exp().cpu()
    word_idx = torch.multinomial(word_weights, 1)[0]
    input.data.fill_(word_idx)
    word = idx2item[word_idx].decode('UTF-8')
    characters.append(word)

    if i % 100 == 0:
        print('| Generated {}/{} chars'.format(i, number_of_characters_to_generate))

if 'backward' in dataset:
    characters.reverse()

# print generated text
print(''.join(characters))

Does this work? Could you paste the text generated from the model?

@iamyihwa
Copy link
Author

iamyihwa commented Sep 10, 2018

Hi @alanakbik Sorry for the late reply! Was away for one week. Language model was training also for one week..
It got slightly better over one week .. but not so impressive ..
Perplexity dropped from about 2.86 to 2.78. (Above 10 lines are from 1 week ago, and below 10 lines are from now)
image

I just tested language model .. using your script.
However I get some pytorch error.. any ideas?

[ec2-user@ip-172-31-28-32 flair]$ python language_generation.py
Traceback (most recent call last):
  File "language_generation.py", line 29, in <module>
    prediction, rnn_output, hidden = model.forward(input, hidden)
  File "/home/ec2-user/ner-flair/flair/flair/models/language_model.py", line 66, in forward
    encoded = self.encoder(input)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 108, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/ec2-user/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1076, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'

@iamyihwa
Copy link
Author

I have replaced line

input = torch.rand(1, 1).mul(len(idx2item)).long()

into

input = torch.rand(1, 1).mul(len(idx2item)).long().cuda()

and the error goes away.

@iamyihwa
Copy link
Author

These are some example results:

ror, muchas veces repermando a observar su unidad y se consuma y trataba de más difícil ver 4,900 revistas basadas en elementos visuales.
Gata descubrió su evagración pues nuento activo canónico, teinocuarte, los isaías y el apestino y posibles comandos fotografías de justicia literaria.
Para la construcción de las playas naturales, en 2009 tiene clase de politeas nivel siinúmenos producidos para modernizar la construcción de estando transmitidos tras la tecnología eléctrica.
Sin embargo, esto fue creentemente agresivo, ya que en su estatuarizastria, Custodio ejerció caso familiar a escritores españoles como Valentín García Obelino (Córdoba), Gregorio Ferentito y la Victyrina.
El 27 de enero de 1991 fue calificado por esto contribuían a crear las dos primeras producciones de corinte, o como encima de las Teorías, transformado en el único invento de carros.
Ante un trastorno de la tendencia puramente de planta real, la aladoso ha alcanzado en toda ese sistema y a ser desulada la ferrería.
La radioelíptica es el desaderado productor del endopartido situado en Áceros.
Son tres o. en una asamblea con una ralpa de un elemento lerta de coral para umbelas.
Es un dioses, que contiene una área de gran conformación a 40 cm de longitud total.
Es una auvalecida de origen urbano aunque el último árbol va por delante del río Duero, durante más de 100 años, es decir, representa algunos de los dan aparentemente la escuela de mesa de vibori baja (lista) o barrio de "Nebreta", en el que se refugia la San José de La Dihaimub con forma de metal.
Es uno de los principales agronomistas que se aceptan con la celebración del espacio literario más exclusivo de la península, que ninguna familia es hecha como un lugar de duelo.
En general, el río queda bajo los barcos exteriores este complejo tolerante a cumbre a su vez de reposo en la intensidad del templo con los valles de Plínico.
3 meses, registrando una presión determinada; la consumos es proporcional en su provincia internacional.
En e

@alanakbik
Copy link
Collaborator

Hello @iamyihwa this looks good! The drop may not seem like so much, but it tends to make a lot of difference. A perplexity of around 2,75 is good, we get that for our language models as well. I notice the learning rate has not yet annealed - when that happens you typically get better perplexity. So you could try training even longer, but you can also probably already use the LM now.

Also, the generated text looks ok (though my Spanish is a bit shaky).

You could now try applying it to Spanish NER.

Do you have a backward language model trained as well? You get best results with both a forward + backward language model, as well as classic word embeddings for Spanish.

@iamyihwa
Copy link
Author

iamyihwa commented Sep 11, 2018

@alanakbik I was doubting it would get better .. because within one week there was only reducing of perplexity by .1 within one week (and i thought this was too slow and had already hit a bottom)..
Is there a way to continue learning LM from the point I have left?

Yes I used the FastText Spanish Embedding + Forward LM (character) + Backward LM (character)

I ran yesterday spanish NER model with 100 epochs. I got F1 for validation 84.77 and for test 86.85.
As far as I know it broke state of the art result. (I have seen test as high as 85.77 before - by (Yang et al. 2016) Multi-Task Cross-Lingual Sequence Tagging from Scratch Link - but not going above 86. I guess mostly one looks at the test F1 alone, right? )
Congratulations to you!!

I expected the accuracy to go up since you got around 88.x for the German model.
Do you think there is something else I could try to increase further the F1? I wonder if it would make real difference in real use.. but want to get opinions from you.

I share the details for you.

........................ evaluating... dev...
processed 52923 tokens with 4352 phrases; found: 4319 phrases; correct: 3675.
accuracy: 97.61%; precision: 85.09%; recall: 84.44%; FB1: 84.77
LOC: precision: 80.55%; recall: 86.19%; FB1: 83.28 1054
MISC: precision: 71.92%; recall: 56.40%; FB1: 63.22 349
ORG: precision: 84.06%; recall: 84.06%; FB1: 84.06 1700
PER: precision: 94.24%; recall: 93.78%; FB1: 94.01 1216

test...
processed 51533 tokens with 3559 phrases; found: 3575 phrases; correct: 3098.
accuracy: 98.35%; precision: 86.66%; recall: 87.05%; FB1: 86.85
LOC: precision: 88.42%; recall: 84.50%; FB1: 86.42 1036
MISC: precision: 74.63%; recall: 58.82%; FB1: 65.79 268
ORG: precision: 83.70%; recall: 90.57%; FB1: 87.00 1515
PER: precision: 94.44%; recall: 97.14%; FB1: 95.77 756

If I look at loss.txt file, for last 20, there is no increase in F1 value, so I guess the NER model has gotten as best as it could ..

[ec2-user@ip-172-31-28-32 es-ner]$ tail -20 loss.txt
80 (19:54:01) 0.427294 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
81 (19:58:51) 0.447506 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
82 (20:03:50) 0.448811 1 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
83 (20:08:52) 0.460885 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
84 (20:13:52) 0.460377 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
85 (20:18:52) 0.461925 1 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
86 (20:23:51) 0.463220 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
87 (20:28:55) 0.443868 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
88 (20:33:55) 0.442564 1 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
89 (20:38:53) 0.455681 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
90 (20:43:57) 0.454987 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
91 (20:48:50) 0.455849 1 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
92 (20:53:48) 0.448138 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
93 (20:58:53) 0.439628 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
94 (21:03:54) 0.463588 1 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
95 (21:08:58) 0.456307 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
96 (21:13:56) 0.466916 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
97 (21:18:59) 0.461347 1 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
98 (21:24:02) 0.473770 2 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85
99 (21:29:05) 0.447555 0 0.000000 DEV 1263 acc: 97.61% p: 85.09% r: 84.44% FB1: 84.77 TEST 848 acc: 98.35% p: 86.66% r: 87.05% FB1: 86.85

This is the parameters I have used for learning ..

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)


from flair.trainers.sequence_tagger_trainer import SequenceTaggerTrainer

trainer: SequenceTaggerTrainer = SequenceTaggerTrainer(tagger, corpus, test_mode=False)

trainer.train('resources/taggers/es-ner', learning_rate=0.1, mini_batch_size=16, max_epochs=100)

@iamyihwa
Copy link
Author

@alanakbik
I have trained a LM with larger number of hidden neurons ( n = 2048 ) , and I could reduce the perplexity below 2.60. (Before I got around 2.78, n = 1024).
I saw that you might have used newspaper data (news_forward, news_backward) for language model, and suggesting previously to train with different sources of data (e.g. movie scripts etc.) .

I have three main questions:

  1. Do you see different performance in the trained model depending on the dataset that you have used to construct the language model?
    I can imagine movie scripts data might perform better with conversation like data, since it is more like conversations than the wikipedia data.
    I have only used Wikipedia data here. However I wonder if it would give better performance in NER if I used newspaper data.
    In that case, should i use it on top of wikipedia data? Since wikipedia data is very large?
    Or use only newspaper data?

  2. What does it mean to have perplexity 2.8 vs. 2.6 ? Does it make big difference in the performance of models that are built on top of language models ?

  3. Is there a way to continue training on top of existing language model or existing models?
    In that case I guess one could train on top of already trained model .. e.g. continue training a language model trained on Wikipedia with newspaper data.

@alanakbik
Copy link
Collaborator

Hello @iamyihwa thanks for sharing the results!

  1. Yes, the dataset used to train the LM influences the performance on the downstream task. On CoNLL-03 for instance, there is about 0.5-1. F1 point difference depending on whether we use the 'news' LM for English or the 'mixed' LM. Generally, the mode in-domain the data is, the better. I think Wikipedia is probably already very close to newspaper data, but if you train a LM only on Spanish news, it should get even better.

  2. We found it really hard to correlate the perplexity value to the downstream task performance. Generally, larger LM with lower perplexity are better, but we never know by how much until we run the experiment.

  3. Yes, it is possible to continue training an existing language model. You can do this by loading your saved language model and passing this model to the language model trainer, e.g.:

model = LanguageModel.load_language_model('your/saved/model.pt')
# make sure to use the same dictionary from saved model
dictionary = model.dictionary
corpus = Corpus('path/to/your/corpus', dictionary, forward, character_level=True)

# pass corpus and your pre-trained language model to trainer
trainer = LanguageModelTrainer(language_model, corpus)

# train with your favorite parameters
last_learning_rate = 5
trainer.train('resources/taggers/language_model', learning_rate=last_learning_rate)

We never switched corpora, but we continued training a model on the same corpus a few times. In this case, you should pass the last learning rate (after annealing) to the trained. This means that if the trained annealed the learning rate to 5 at some point, you should restart the trained with this learning rate.

However, we never tried to use another corpus on a pre-trained language model. I think this would work, but in this case, I would begin with the same learning as usual, i.e. 20. The first few epochs will probably exhibit strange behavior until the learning stabilizes for the new domain.

If you try this, please share your experimence - we'd be happy to learn how this works!

@alanakbik
Copy link
Collaborator

Hello @iamyihwa,

a few more ideas to increase the F1 for Spanish:

  • You could try a higher patience value, such as 3 or 4 for the trainer. It could be that the learning rate anneals too fast since in your log output it is always 0 towards the end. You could also increase the mini_batch_size to 32. You can increase patience like this:
trainer.train('resources/taggers/es-ner', 
                    learning_rate=0.1, 
                    mini_batch_size=32, 
                    max_epochs=150, 
                    patience=3, 
)
  • Another idea to try to increase F1 score is to use GloVe embeddings instead of FastText embeddings.

There is an increasing number of works that found that GloVe is somehow better for NER, for instance our own experiments or the work by @Borchmann on Polish NER found nearly a full point difference. @Borchmann trained his own GloVe embeddings on Polish CommonCrawl. You could try something similar - perhaps Spanish Wikipedia would also work well.

@iamyihwa
Copy link
Author

@alanakbik Thanks a lot for suggesting different things to try out!

You could try a higher patience value, such as 3 or 4 for the trainer. It could be that the learning rate anneals too fast since in your log output it is always 0 towards the end. You could also increase the mini_batch_size to 32. You can increase patience like this:

Yes you are right! I didn't know that column was learning rate!
So although it was continuing to train since the learning rate was 0, effectively nothing was happening! Important to know! :-) Thank you!
So default patience value seems to be 2. Yes i will increase it to 3 or 4 like you suggested.
( I saw in the language model, you used 50, is it because training language model is much bigger task, and you want to make sure that it explores more before reducing the learning rate? )

These days I had been working on some other project, so had been training in the background LM with large hidden layer ( n = 2048). It indeed takes long time to train this large network!
I saw that the Perplexity went down close to 2.51 however it went up again ..
I guess best-lm.pt saves the model that gave lowest loss (perplexity) so far. Does it mean that if it once reached around 2.51 I just use that? Or should I have until I check loss.txt file, and loss is more stable?
I guess if loss.txt values stays low for sometime, it is in terms of parameter space in a zone that is more flat and not a narrow valley.

Also the idea of training with another input such as newspaper.
Since training a large language model takes very long time, I don't know if i should pursue that idea or not ..

Another idea to try to increase F1 score is to use GloVe embeddings instead of FastText embeddings.

I didn't know that different embeddings can have influence on the downstream tasks! Thanks for sharing the insights!
I just found one Glove embedding for Spanish. Will try it out, as soon as the language model gets a bit more stable!

Thanks @alanakbik for the help! I will share with you as soon as results come out!

@iamyihwa
Copy link
Author

iamyihwa commented Oct 2, 2018

@alanakbik
So the most recent result I have of Spanish NER :
F1 Validation : 85.92 Test: 87.58 (with CONLL2002 dataset)
Definitely better!! Thanks @alanakbik for suggestions and helps!

Language Model (with number of hidden neurons =2048, Forward and Backward both of which reached perplexity of about 2.50 at the end)
using Glove embeddings

I couldn't use the parameter sets that you suggested because when I have bigger mini batch size, I ran into memory error.
So these are the ones I had used:
hidden_size=128

trainer.train('resources/taggers/es-ner-long-glove', learning_rate=0.1, mini_batch_size=14, max_epochs=150, patience = 4 )

147 (19:23:14) 0.287881 2 0.000000 DEV 1195 acc: 97.74% p: 85.83% r: 86.01% FB1: 85.92 TEST 811 acc: 98.43% p: 87.36% r: 87.81% FB1: 87.58
148 (19:25:37) 0.273636 3 0.000000 DEV 1195 acc: 97.74% p: 85.83% r: 86.01% FB1: 85.92 TEST 811 acc: 98.43% p: 87.36% r: 87.81% FB1: 87.58
149 (19:27:56) 0.290003 4 0.000000 DEV 1195 acc: 97.74% p: 85.83% r: 86.01% FB1: 85.92 TEST 811 acc: 98.43% p: 87.36% r: 87.81% FB1: 87.58

This one also got into learning rate 0 towards the end ..

What I haven't tested is
(1) Continue training Language model with newspaper text:

Parameters used for the language model:
trainer.train('resources/taggers/language_model_es_forward',
sequence_length=250,
mini_batch_size=100,
max_epochs=2000,
patience = 50)

tail result from loss.txt of language model:
[ec2-user@ip-172-31-28-32 language_model_es_forward_long]$ tail -20 loss.txt
| end of split 44 / 85 | epoch 0 | time: 7387.21s | valid loss 0.91 | valid ppl 2.49 | learning rate 20.00
| end of split 45 / 85 | epoch 0 | time: 8475.62s | valid loss 0.94 | valid ppl 2.57 | learning rate 20.00
| end of split 46 / 85 | epoch 0 | time: 7506.35s | valid loss 0.92 | valid ppl 2.51 | learning rate 20.00
| end of split 47 / 85 | epoch 0 | time: 8469.27s | valid loss 0.92 | valid ppl 2.51 | learning rate 20.00
| end of split 48 / 85 | epoch 0 | time: 7399.37s | valid loss 0.92 | valid ppl 2.50 | learning rate 20.00
| end of split 49 / 85 | epoch 0 | time: 8483.34s | valid loss 0.92 | valid ppl 2.50 | learning rate 20.00
| end of split 50 / 85 | epoch 0 | time: 5588.10s | valid loss 0.92 | valid ppl 2.52 | learning rate 20.00
| end of split 51 / 85 | epoch 0 | time: 7475.75s | valid loss 0.92 | valid ppl 2.51 | learning rate 20.00
| end of split 52 / 85 | epoch 0 | time: 5580.07s | valid loss 0.92 | valid ppl 2.51 | learning rate 20.00
| end of split 53 / 85 | epoch 0 | time: 8493.31s | valid loss 0.92 | valid ppl 2.50 | learning rate 20.00
| end of split 54 / 85 | epoch 0 | time: 6853.97s | valid loss 0.95 | valid ppl 2.60 | learning rate 20.00
| end of split 55 / 85 | epoch 0 | time: 7415.46s | valid loss 0.91 | valid ppl 2.50 | learning rate 20.00
| end of split 56 / 85 | epoch 0 | time: 7424.63s | valid loss 0.91 | valid ppl 2.49 | learning rate 20.00
| end of split 57 / 85 | epoch 0 | time: 6244.67s | valid loss 0.92 | valid ppl 2.51 | learning rate 20.00
| end of split 58 / 85 | epoch 0 | time: 8478.29s | valid loss 0.93 | valid ppl 2.54 | learning rate 20.00
| end of split 59 / 85 | epoch 0 | time: 6256.29s | valid loss 0.92 | valid ppl 2.50 | learning rate 20.00
| end of split 60 / 85 | epoch 0 | time: 8477.44s | valid loss 0.92 | valid ppl 2.50 | learning rate 20.00
| end of split 61 / 85 | epoch 0 | time: 6886.76s | valid loss 0.92 | valid ppl 2.51 | learning rate 20.00
| end of split 62 / 85 | epoch 0 | time: 6859.34s | valid loss 0.92 | valid ppl 2.50 | learning rate 20.00
| end of split 63 / 85 | epoch 0 | time: 5319.80s | valid loss 0.91 | valid ppl 2.48 | learning rate 20.00

Strangely, here although I have trained for more than 2 weeks, learning rate is 20.00 at the end.. and didn't decrease .. (in the case of NER it decreased towards the end to 0.)
Which learning rate should I set if I continue training this Language Model in this case ? 20?

(2) Training with FastText embedding
: Since this doesn't require so much changes and not too much time I just let it run since this morning ..

(3) Changes in parameter:
I run into memory error when I use parameters you suggested above:
trainer.train('resources/taggers/es-ner',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=150,
patience=3,
)

When I decrease mini-batch-size then I should increase patience perhaps?

The machine I had used for testing was AWS p2.xlarge that had 12 GB gpu memory.
Did you use much larger memory? Or did I miss something to get memory error when i ran with mini_batch_size = 32 (I could not go above 16 .. had to be something slightly below .. )

@tabergma tabergma added the language model Related to language model label Oct 4, 2018
@alanakbik
Copy link
Collaborator

Just a quick heads up: we've just pushed an update into master that includes pre-trained FastText embeddings for Spanish. You can load them with:

embeddings = WordEmbeddings('es')

For now only available through master branch, but we're planning another version release (0.3.2) in a few days - then they'll also be available if you install from pip.

@iamyihwa
Copy link
Author

Thanks @alanakbik!
Great! Thank you for letting me know!
I will use it!
Thanks a lot for all the efforts and supports guys!

@tabergma tabergma added the new language New languages label Dec 12, 2018
@iamyihwa
Copy link
Author

iamyihwa commented Dec 13, 2018

@alanakbik Hi! I have seen lots of languages added to the Flair! I would also like to also make contribution and paying back the help that i received, byf adding the Spanish Language model and the Spanish NER which seem to give good results!
What are the steps necessary to add those?

I guess one of the first steps is giving the numbers, for this I would like to test again with the test set data/ and validation set data, in order so that these numbers can also be added (however I only see this at the end after the training, and I haven't found a way to test with the test set after the model is complete.. I only saw the 'predict' function within the model file, but not the evaluate with either validation / test set .. )
Also to which branch I need to add?

@alanakbik
Copy link
Collaborator

Hi @iamyihwa - that would be great, thank you!

If you can send us the models we can include them in the next release. For instance, can you put them on AWS and send us the link?

Could you also let us know on which corpus you trained the LM?

@iamyihwa
Copy link
Author

@alanakbik you're welcome! Thanks to you guys' help!

Sure ! I will do!

I have trained it with wiki dumps.
I have used following link which doesn't seem to work now. (It seems now there is an updated version )
Then I have used this tool to extract it.

Should I upload only the language model ?

Or also NER?

@iamyihwa
Copy link
Author

@alanakbik The character based Spanish LM embedding is uploaded here.
Note that there are 2 different versions of forward and backward model.
Due to the fact that I changed the number of hidden neurons.

I think I in the end used the one with larger hidden size, but I am attaching both of them.

language_model_es_foward, language_model_es_backward : hidden_size = 1024
language_model_es_foward_long, language_model_es_backward_long: hidden_size=2048

@iamyihwa
Copy link
Author

Hi @alanakbik,I just wanted to quickly check if everything was okay with the process of getting the file and adding it to flair. Could you access the file? Or should I have put it on github, better?

@alanakbik
Copy link
Collaborator

Hello @iamyihwa thanks for following up - we downloaded the models but haven't added them to Flair yet. I'll open a PR to do this!

Thanks for sharing the models - are they working well for you in downstream tasks?

@iamyihwa
Copy link
Author

iamyihwa commented Jan 16, 2019

Hi @alanakbik You're welcome! I was wondering if there was any problem.
Please let me know if I can help in anyway.

I have done Spanish NER with this, as well as sentiment classifier task and other text classification task.

For Spanish NER, it could outperform state-of-the-art (back when I checked in the 2018, it was 85.77 by Yang or 85.75 of Lample ).
F1 87.58 (test) 85.92 (validation)

For sentiment classifier and other text classification tasks, it was giving reasonably good results also, but a bit below that of state of the art.

@myeghaneh
Copy link

myeghaneh commented Apr 14, 2021

I would be thankfull, if you let me know how your result shows the cross-validation result, mine is only like this , as you see there is not split part


.
.
.

2021-03-15 07:22:19,632 epoch 64 - iter 5/50 - loss 3.98211269 - samples/sec: 2.91 - lr: 0.000195
2021-03-15 07:23:09,832 epoch 64 - iter 10/50 - loss 3.57232342 - samples/sec: 3.19 - lr: 0.000195
2021-03-15 07:24:00,433 epoch 64 - iter 15/50 - loss 3.58684916 - samples/sec: 3.16 - lr: 0.000195
2021-03-15 07:24:53,792 epoch 64 - iter 20/50 - loss 3.54252130 - samples/sec: 3.00 - lr: 0.000195
2021-03-15 07:25:44,368 epoch 64 - iter 25/50 - loss 3.48069695 - samples/sec: 3.16 - lr: 0.000195
2021-03-15 07:26:36,043 epoch 64 - iter 30/50 - loss 3.39033549 - samples/sec: 3.10 - lr: 0.000195
2021-03-15 07:27:31,455 epoch 64 - iter 35/50 - loss 3.38885815 - samples/sec: 2.89 - lr: 0.000195
2021-03-15 07:28:19,718 epoch 64 - iter 40/50 - loss 3.34705150 - samples/sec: 3.32 - lr: 0.000195
2021-03-15 07:29:10,731 epoch 64 - iter 45/50 - loss 3.36659287 - samples/sec: 3.14 - lr: 0.000195
2021-03-15 07:29:58,395 epoch 64 - iter 50/50 - loss 3.43718199 - samples/sec: 3.36 - lr: 0.000195
2021-03-15 07:29:58,396 ----------------------------------------------------------------------------------------------------
2021-03-15 07:29:58,397 EPOCH 64 done: loss 3.4372 - lr 0.0001953
2021-03-15 07:30:36,466 DEV : loss 3.8003571033477783 - score 0.6686
2021-03-15 07:30:36,482 BAD EPOCHS (no improvement): 4

2021-03-15 07:29:58,397 EPOCH 64 done: loss 3.4372 - lr 0.0001953
2021-03-15 07:30:36,466 DEV : loss 3.8003571033477783 - score 0.6686
2021-03-15 07:30:36,482 BAD EPOCHS (no improvement): 4
2021-03-15 07:30:36,485 ----------------------------------------------------------------------------------------------------
2021-03-15 07:30:36,487 ----------------------------------------------------------------------------------------------------
2021-03-15 07:30:36,488 learning rate too small - quitting training!
2021-03-15 07:30:36,490 ----------------------------------------------------------------------------------------------------
2021-03-15 07:30:40,566 ----------------------------------------------------------------------------------------------------
2021-03-15 07:30:40,567 Testing using best model ...
2021-03-15 07:30:40,569 loading file resources\taggers\example-BIOBert\best-model.pt
2021-03-15 07:32:12,096 0.7161	0.7291	0.7225
2021-03-15 07:32:12,097 
Results:
- F1-score (micro) 0.7225
- F1-score (macro) 0.7141

By class:
C          tp: 314 - fp: 147 - fn: 143 - precision: 0.6811 - recall: 0.6871 - f1-score: 0.6841
P          tp: 609 - fp: 219 - fn: 200 - precision: 0.7355 - recall: 0.7528 - f1-score: 0.7440

@codemaster-22
Copy link

codemaster-22 commented Jun 19, 2021

Can some one please tell what is es-X-fast ,@iamyihwa have you done knowledge distillation or it's with less hidden_size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language model Related to language model new language New languages
Projects
None yet
Development

No branches or pull requests

6 participants