-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch KenLM to trie based language model #1236
Comments
@kdavis-mozilla what would be the benefit of switching to trie based language model? |
@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB. |
This reverts commit e34c52f.
Revert "Fixes #1236 (Switch KenLM to trie based language model)"
Reopening since we reverted the fixes. |
The code snippet below builds a pruned, quantized 5-gram language model that is significantly better than the "quick-fix" language model. The corpus used is described in section With little to no optimisation or hyper-parameter tuning we get a Note that this code was written in a Jupyter notebook and uses the import gzip
import io
import os
from urllib import request
# Grab corpus.
url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
data_upper = '/tmp/upper.txt.gz'
request.urlretrieve(url, data_upper)
# Convert to lowercase and cleanup.
data_lower = '/tmp/lower.txt'
with open(data_lower, 'w', encoding='utf-8') as lower:
with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:
for line in upper:
lower.write(line.lower())
os.remove(data_upper)
# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
--temp_prefix /tmp/ \
--memory 50% \
--text {data_lower} \
--arpa {lm_path} \
--prune 0 0 0 1
# Quantize and produce trie binary.
binary_path = '/tmp/lm.binary'
!build_binary -a 255 \
-q 8 \
trie \
{lm_path} \
{binary_path}
os.remove(lm_path) Example output:
|
Are there going to be tools to extend the new language model with custom corpus data or individual phrases? |
@pvanickova You'll be able to use all the features of KenLM to extend the language model. |
@pvanickova You can do that, following |
@lissyx perfect, thanks - so basically rebuilding the language model from scratch using the librivox corpus + my own corpus |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
No description provided.
The text was updated successfully, but these errors were encountered: