Skip to content

FastText Training

Karahan Sarıtaş edited this page Apr 3, 2023 · 2 revisions

FastText is a library for learning word embeddings based on the skip-gram model, where each word is represented as a bag of character n-grams. One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones. Indeed, fastText word vectors are built from vectors of substrings of characters contained in them. This allows building vectors even for misspelled words or concatenation of words [1].

Gensim FastText

We have used gensim implementation of FastText. As stated in the documentation, models.fasttext contains a fast native C implementation of fastText with Python interfaces [2]. You can use both .ipynb notebook version or .py script to train your FastText model. According to the documentation, we have to provide the following arguments:

  • sentences (iterable of iterables, optional): The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpus, Text8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
  • vector_size (int, optional): Dimensionality of the word vectors.
  • window (int, optional): Maximum distance between the current and predicted word within a sentence.
  • min_count (int, optional): Ignores all words with total frequency lower than this.
  • workers (int, optional): Use these many worker threads to train the model (=faster training with multicore machines).
  • sg ({0, 1}, optional): Training algorithm: 1 for skip-gram; otherwise CBOW.
  • hs ({0, 1}, optional): If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used. negative (int, optional): If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • min_n (int, optional) – Min length of char ngrams to be used for training word representations.
  • max_n (int, optional) – Max length of char ngrams to be used for training word representations.
  • word_ngrams (int, optional): If 1, uses enriches word vectors with subword(n-gram) information. If 0, this is equivalent to Word2Vec. If > 1, this parameter is ignored and subwords are used.

You can run the .py script as follows:

python fasttext/fasttext.py -i "corpus/bounwebcorpus.txt" --emb 300 -ep 5 --neg 10 -o "fasttext.model"

For different training algorithms, please refer to Word2Vec Training.

References:

  1. https://fasttext.cc/docs/en/faqs.html#content
  2. https://radimrehurek.com/gensim/models/fasttext.html