Improving Language Models by Jointly Modelling Language as Distributions over Characters and Bigrams
More information on this project on my website and in the expose
This is a WIP and will not accept PRs for now.
Model | Emebdding Size | Hidden Size | BPTT | Batch Size | Epochs | Layers | Dataset | LR | NGram | Test PPL | Test BPC |
---|---|---|---|---|---|---|---|---|---|---|---|
CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 1 | 3.76 | 1.91 |
N-Gram CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 1 | 3.72 | 1.89 |
N-Gram CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 2 | 11.72 | 8.12 |
N-Gram CL LSTM | 128 | 128 | 35 | 50 | 30 | 2 | Wikitext-2 | 20 (1/4 decay) | 2 | 1.96 (only unigrams) | 0.47 (only unigrams) |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
N-Gram CL LSTM | 512 | 512 | 200 | 50 | 34 | 2 | Wikitext-103 | 20 (1/4 decay) | 2 | 7.96 | 2.98 |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
N-Gram CL LSTM | 400 | 1840 | 200 | 128 | 23 | 3 | enwiki-8 | 10 (1/4 decay) | 2 | 1.63 | 0.69 |
Paper LSTM 1 2 | 400 | 1840 | 200 | 128 | 50 | 3 | enwiki-8 | 0.001 (1/10 decay) | 1 | - | 1.232 |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
N-Gram CL LSTM | 200 | 1000 | 150 | 128 | 500 | 3 | ptb | 4 (1/4 decay) | 2 | 8.01 | 3.00 |
N-Gram CL LSTM | 200 | 1000 | 150 | 128 | 500 | 3 | ptb | 4 (1/4 decay) | 2 | 1.60 (only unigrams) | 0.68 (only unigrams) (no optimizer) |
N-Gram CL LSTM | 200 | 1000 | 150 | 128 | 69 | 3 | ptb | 0.002 | 1.56 (only unigrams) | 0.64 (only unigrams) | |
Paper LSTM 1 2 | 400 | 1000 | 150 | 128 | 500 | 3 | ptb | 0.001 (1/10 decay) | 1 | 1.232 |
| N-Gram CL LSTM | 256 | 1024 | 250 | 100 | 500 | 2 | obw | 0.001 (1/10 decay) | 1 | | | N-Gram CL LSTM | 256 | 1024 | 250 | 100 | 500 | 2 | obw | 0.001 (1/10 decay) | 2 | |
Investigate the dictionary size, the models size aswell as memory usage and training time for different n-grams and UNK threshold
Hyperparamter:
Hidden size: 128
emsize: 128
nlayers: 2
batch size: 50
lr: 20
Dataset | UNK | Ngram | sec/epoch | dictionary size | GPU memory usage | Params |
---|---|---|---|---|---|---|
Wiki-2 | 0 | 1 | 43 | 1156 | 2553 MiB | 561284 |
Wiki-2 | 5 | 2 | 67 | 2754 | 2817 MiB | 971713 |
Wiki-2 | 20 | 2 | 53 | 1751 | 2761 MiB | 714199 |
Wiki-2 | 40 | 2 | 48 | 1444 | 2763 MiB | 635300 |
Wiki-2 | 40 | 3 | 143 | 9827 | 3301 MiB | 2789721 |
Wiki-2 | 40 | 4 | 412 | 34124 | 4723 MiB | 9034060 |
Wiki-2 | 40 | 5 | 782 | 72745 | 6765 MiB | 18959657 |
Wiki-2 | 100 | 10 | 965 | 90250 | 8635 MiB | 23458442 |
ptb | 20 | 2 | 31 | 704 | 4893 MiB | 21669504 |
enwik8 | 20 | 2 | 4493 | 13 GiB | 120M |
- Corpus aus generate.py raus
- Make CPP extension work
- Build flair language model
- Compare epoch time between bigram and uni gram model