Skip to content

Latest commit

 

History

History
28 lines (21 loc) · 1.78 KB

mlm.md

File metadata and controls

28 lines (21 loc) · 1.78 KB

mlm task-type

back to main README

The task-type mlm can be used for Masked Language Modeling. In MaChAmp, we follow the strategy of the original BERT paper (Devlin et al, 2019), except that we do not use the next sentence prediction. The input data for this task should be in raw txt format, like this (this is a sample from Wikipedia):

Alvestêdetocht
De Alvestêdetocht is in reedriderstocht fan sawat 200 kilometer by de alve stêden fan Fryslân lâns. 
Ljouwert, de haadstêd fan Westerlauwersk Fryslân, is fan âlds it start- en finishplak. 
De tocht bringt de dielnimmers fan Ljouwert nei Snits, Drylts, Sleat, Starum, Hylpen, Warkum, Boalsert, Harns, Frjentsjer en Dokkum en dan wer werom nei Ljouwert, dêr't op de Bonke de einstreek leit. 
Hûnderttûzenen taskôgers wiene by de lêste tochten de kjeld treast en moedigen oan 'e rûte de reedriders oan.

The only supported evaluation metric for this task is perplexity. Because in multi-task learning we pick the model with the highest sum of all metrics and most metrics have a range 0-1, we invert the perplexity by doing: 1-1/ppl. This can easily be edited in the bottom of machamp/models/machamp_model.py, and we welcome suggestions on how to do this more correct/principled.

This task type has a special handling of the data, it uses only a portion of the data each epoch. Because MLM is more prone to overfitting and data is cheap, we divide the dataset by the number of epochs, and thus ensure that we see each instance only once. If you do not have enough data with this strategy we recommend that you just multiply it before training. You can disable this by setting "split_mlm": false, in the dataset configuration (at the dataset level, not the task level).