-
Notifications
You must be signed in to change notification settings - Fork 4
/
README
118 lines (87 loc) · 5.12 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
Author: Karl Stratos (stratos@cs.columbia.edu)
Release version: 1.0
Requirements: python (2.7), numpy, scipy, sparsesvd, Matlab
This program is an implementation of canoncial correlation analysis (CCA) in
the context of deriving word embeddings. A theoretical justification of this
implementation is provided in:
A spectral algorithm for learning class-based n-gram models of natrual language
Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel Hsu.
In Proceedings of UAI (2014).
v------------------------------------------------------------------------------v
| Setup |
^------------------------------------------------------------------------------^
First, make sure your machine has all the required programs listed above. Also,
to be able to run Matlab on your machine, you need to change the line in the
call_matlab function in src/call_matlab.py to the path to Matlab on that
machine. For example, for me it's:
matlab = '/Applications/MATLAB_R2013b.app/bin/matlab'
The easiest way to check everything is good is to run debug.py:
$ python debug.py
v------------------------------------------------------------------------------v
| Preparing input data |
^------------------------------------------------------------------------------^
We assume a raw (but properly tokenized) text corpus as an input. There is no
restriction such as 'one sentence per line'---we don't need sentence boundaries.
But sentence boundaries can be incorporated as special tokens. For example,
there is a toy corpus input/example/example.corpus:
the dog saw the cat
the dog barked
the cat meowed
You can put boundary markers, as in:
_START_ the dog saw the cat _END_
_START_ the dog barked _END_
_START_ the cat meowed _END_
v------------------------------------------------------------------------------v
| Step 1: Deriving statistics |
^------------------------------------------------------------------------------^
In step 1, we extract co-occurrence statistics. For example, running:
python cca.py --corpus input/example/example.corpus --cutoff 1
will create a directory input/example/example.cutoff1.window3/ that contains
statistics of example.corpus. The command line arguments for step 1 are
the following:
--corpus CORPUS count words from this corpus
--cutoff CUTOFF cut off words appearing <= this number
--vocab VOCAB size of the vocabulary
--window WINDOW size of the sliding window
--want WANT want words in this file
--rewrite rewrite the (processed) corpus, not statistics
In particular, you can decide the context (window)---the default is 3, i.e.,
previous/next words. You can control the size of the vocabulary by discarding
rare words (cutoff) or using only a restricted set of vocabulary (vocab).
Rare words are all replaced by a special token "<?>".
v------------------------------------------------------------------------------v
| Step 2: Deriving embeddings Ur |
^------------------------------------------------------------------------------^
In step 2, we run Matlab to perform SVD on the statistics from step 1. Running:
python cca.py --stat input/example/example.cutoff1.window3/ --m 2 --kappa 2
will create a directory output/example.cutoff1.window3.m2.kappa2.matlab.out/
that contains the word embedding file Ur:
4 the -2.3410244894135657e-01 -9.7221193337649348e-01
3 <?> -8.6218169891930729e-01 -5.0659916901690338e-01
2 dog -9.3955297838817597e-01 3.4240356423657153e-01
2 cat -9.6347323867084655e-01 2.6780462722871301e-01
where the format of each line is <frequency>, <word>, <val_1>, <val_2>, ...,
<val_m>. Also, the rows are ordered in decreasing frequency.
The command line arguments for step 2 are the following:
--stat STAT directory containing statistics
--m M number of dimensions
--kappa KAPPA smoothing parameter
--quiet quiet mode
--no_matlab do not call matlab - use python sparsesvd
In particular, m is the dimensionality of CCA, and kappa is a "pseudocount".
The value of kappa needs to be tuned for the given corpus. Try experimenting
with 50, 100, 200, ... (or if your data is huge like Google Ngram, 1000, 2000,
...) until the performance on your problem stops improving. Matlab's SVD is
very fast, so you can try many parameter values with ease.
v------------------------------------------------------------------------------v
| Optional post processing |
^------------------------------------------------------------------------------^
Depending on your problem, it might be a good idea to use only the top subspace
of your word embeddings. You can derive lower dimensional embeddings via
principal component analysis (PCA), e.g.:
python src/pca.py --embedding_file output/example.cutoff1.window3.m2.kappa2.matlab.out/Ur --pca_dim 1
Now you have a file Ur.pca1 that looks like:
4 the 0.906265637029
3 <?> 0.20812022154
2 dog -0.585143449361
2 cat -0.529242409207