-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
109 lines (91 loc) · 3.94 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
Make it compatible with both python 2.x and 3.x.
This is an extremely simple Python wrapper for SRILM:
http://www.speech.sri.com/projects/srilm/
Install SRILM
- `vim Makefile`
- SRILM = $(PWD) # or a specific direcotry
- MACHINE_TYPE := i686-m64 # output from `uname -i`
- `vim common/Makefile.machine.MACHINE_TYPE`
- `NO_TCL = 1` -> `NO_TCL = X`
- `GAWK = /usr/bin/awk` -> `GAWK = /usr/bin/gawk` # output from `which gawk`
- compile with `-fPIC`
- `ADDITIONAL_CFLAGS = -fopenmp` -> `ADDITIONAL_CFLAGS = -fopenmp -fPIC`
- `ADDITIONAL_CXXFLAGS = -fopenmp` -> `ADDITIONAL_CXXFLAGS = -fopenmp -fPIC`
- `make World` & `make test`
- `export SRILM/bin:SRILM/bin/i686-m64:$PATH`
- `which ngram-count` # test
Basically it lets you load a SRILM-format ngram model into memory, and
then query it directly from Python.
Right now this is extremely bare-bones, just enough to do what I
needed, no fancy infrastructure at all. Feel free to send patches
though if you extend it!
Requirements:
- SRILM
- Cython
Installation:
- Edit setup.py so that it can find your SRILM build files.
- To install in your Python environment, use:
python setup.py install
To just build the interface module:
python setup.py build_ext --inplace
which will produce srilm.so, which can be placed on your
PYTHONPATH and accessed as 'import srilm'.
Usage:
from srilm import LM
# Use lower=True if you passed -lower to ngram-count. lower=False is
# default.
lm = LM("path/to/model/from/ngram-count", lower=True)
# Compute log10(P(brown | the quick))
#
# Note that the context tokens are in *reverse* order, as per SRILM's
# internal convention. I can't decide if this is a bug or not. If you
# have a model of order N, and you pass more than (N-1) words, then
# the first (N-1) entries in the list will be used. (I.e., the most
# recent (N-1) context words.)
lm.logprob_strings("brown", ["quick", "the"])
# We can also compute the probability of a sentence (this is just
# a convenience wrapper):
# log10 P(The | <s>)
# + log10 P(quick | <s> The)
# + log10 P(brown | <s> The quick)
lm.total_logprob_strings(["The", "quick", "brown"])
# Internally, SRILM interns tokens to integers. You can convert back
# and forth using the .vocab attribute on an LM object:
idx = lm.vocab.intern("brown")
print idx
assert lm.vocab.extern(idx) == "brown"
# .extern() returns None if an idx is unused for some reason.
# There's a variant of .logprob_strings that takes these directly,
# which is probably not really any faster, but sometimes is more
# convenient if you're working with interned tokens anyway:
lm.logprob(lm.vocab.intern("brown"),
[lm.vocab.intern("quick"),
lm.vocab.intern("the"),
])
# There are detect "magic" tokens that don't actually represent anything
# in the input stream, like <s> and <unk>. You can detect them like
assert lm.vocab.is_non_word(lm.intern("<s>"))
assert not lm.vocab.is_non_word(lm.intern("brown"))
# Sometimes it's handy to have two models use the same indices for the
# same words, i.e., share a vocab table. This can be done like:
lm2 = LM("other/model", vocab=lm.vocab)
# This gives the index of the highest vocabulary word, useful for
# iterating over the whole vocabulary. Unlike the Python convention
# for describing ranges, this is the *inclusive* maximum:
lm.vocab.max_interned()
# And finally, let's put it together with an example of how to find
# the max-probability continuation:
# argmax_w P(w | the quick)
# by querying each word in the vocabulary in turn:
context = [lm.vocab.intern(w) for w in ["quick", "the"]]
best_idx = None
best_logprob = -1e100
# Don't forget the +1, because Python and SRILM disagree about how
# ranges should work...
for i in xrange(lm.vocab.max_interned() + 1):
logprob = lm.logprob(i, context)
if logprob > best_logprob:
best_idx = i
best_logprob = logprob
best_word = lm.vocab.extern(best_idx)
print "Max prob continuation: %s (%s)" % (best_word, best_logprob)