GitHub - rmmilewi/word2vec_torch: Word2Vec implementation in Torch

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README		README
corpus.txt		corpus.txt
main.lua		main.lua
word2vec.lua		word2vec.lua

Repository files navigation

Word2Vec in Torch 
Yoon Kim
yhk255@nyu.edu

Only has the skip-gram architecture with negative sampling. See https://code.google.com/p/word2vec/ for more details.

Note: This is considerably slower than the word2vec toolkit and gensim implementations.

Input file is a text file where each line represents one sentence (see corpus.txt for an example)

Arguments are mostly self-explanatory (see main.lua for default arguments)

-corpus : text file with the corpus
-window : max window size
-dim : dimensionality of word embeddings
-alpha : exponent to smooth out unigram distribution 
-table_size : unigram table size. if you have plenty of RAM, bring this up to 10^8
-neg_samples : number of negative samples for each valid word-context pair
-minfreq : minimum frequency to be included in the vocab
-lr : starting learning rate
-min_lr : minimum learning rate--lr will linearly decay to this value
-epochs : number of epochs to run
-stream : whether to stream text data from HD or store in memory (1 = stream, 0 = not)
-gpu : whether to use gpu (1 = use gpu, 0 = not)

For example:

CPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 1 -gpu 0 

GPU:
th main.lua -corpus corpus.txt -window 3 -dim 100 -minfreq 10 -stream 0 -gpu 1