-
Notifications
You must be signed in to change notification settings - Fork 1
FastText Training
FastText is a library for learning word embeddings based on the skip-gram model, where each word is represented as a bag of character n-grams. One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. Indeed, fastText word vectors are built from vectors of substrings of characters contained in them. This allows building vectors even for misspelled words or concatenation of words [1].
We have used gensim
implementation of FastText. As stated in the documentation, models.fasttext
contains a fast native C implementation of fastText with Python interfaces [2]. You can use both .ipynb
notebook version or .py
script to train your FastText model. According to the documentation, we have to provide the following arguments:
-
sentences
(iterable of iterables, optional): The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way. -
vector_size
(int, optional): Dimensionality of the word vectors. -
window
(int, optional): Maximum distance between the current and predicted word within a sentence. -
min_count
(int, optional): Ignores all words with total frequency lower than this. -
workers
(int, optional): Use these many worker threads to train the model (=faster training with multicore machines). -
sg
({0, 1}, optional): Training algorithm: 1 for skip-gram; otherwise CBOW. -
hs
({0, 1}, optional): If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used. negative (int, optional): If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used. -
min_n
(int, optional) – Min length of char ngrams to be used for training word representations. -
max_n
(int, optional) – Max length of char ngrams to be used for training word representations. -
word_ngrams
(int, optional): If 1, uses enriches word vectors with subword(n-gram) information. If 0, this is equivalent to Word2Vec. If > 1, this parameter is ignored and subwords are used.
You can run the .py
script as follows:
python fasttext/fasttext.py -i "corpus/bounwebcorpus.txt" --emb 300 -ep 5 --neg 10 -o "fasttext.model"
For different training algorithms, please refer to Word2Vec Training.