-
Notifications
You must be signed in to change notification settings - Fork 33
/
README
34 lines (24 loc) · 1.23 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Word2Vec in Torch
Yoon Kim
yhk255@nyu.edu
Only has the skip-gram architecture with negative sampling. See https://code.google.com/p/word2vec/ for more details.
Note: This is considerably slower than the word2vec toolkit and gensim implementations.
Input file is a text file where each line represents one sentence (see corpus.txt for an example)
Arguments are mostly self-explanatory (see main.lua for default arguments)
-corpus : text file with the corpus
-window : max window size
-dim : dimensionality of word embeddings
-alpha : exponent to smooth out unigram distribution
-table_size : unigram table size. if you have plenty of RAM, bring this up to 10^8
-neg_samples : number of negative samples for each valid word-context pair
-minfreq : minimum frequency to be included in the vocab
-lr : starting learning rate
-min_lr : minimum learning rate--lr will linearly decay to this value
-epochs : number of epochs to run
-stream : whether to stream text data from HD or store in memory (1 = stream, 0 = not)
-gpu : whether to use gpu (1 = use gpu, 0 = not)
For example:
CPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 1 -gpu 0
GPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 0 -gpu 1