This repository consists of preprocessing and evaluation scripts used in the paper entitled Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. The preprocessing script cleaned corpora, tokenized and sentenced it. Evaluation scripts can be used to measure the representativeness of a word embedding model.
Paper can be read: https://arxiv.org/abs/1708.06025
Trained embeddings models: http://nilc.icmc.usp.br/embeddings
Word embeddings have been found to provide meaningful representations for words in an efficient way; therefore, they have become common in Natural Language Processing systems. In this paper, we evaluated different word embedding models trained on a large Portuguese corpus, including both Brazilian and European variants. We trained 31 word embedding models using FastText, GloVe, Wang2Vec and Word2Vec. We evaluated them intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks. The obtained results suggest that word analogies are not appropriate for word embedding evaluation; task-specific evaluations appear to be a better option.
pip install -r requirements.txt
in order to train embedding models
python preprocessing.py <input_file.txt> <output_file.txt>
Sentence Similarity
python evaluate.py <embedding_model.txt> --lang
Parameter --lang can be set depending on portuguese variant chosen.
Brazilian Portuguese
br
European Portuguese
eu
POS tagging evaluator is not available yet. In order to do so, please use source code from nlpnet.
This method is similar to that one developed by nlx-group
python analogies.py -m <embedding_model.txt> -t <testset.txt>
Only syntactic analogies
python analogies.py -m <embedding_model.txt> -t analogies/testset/LX-4WAnalogiesBr_syntactic.txt
Only semantic analogies
python analogies.py -m <embedding_model.txt> -t analogies/testset/LX-4WAnalogiesBr_semantic.txt
All analogies
python analogies.py -m <embedding_model.txt> -t analogies/testset/LX-4WAnalogiesBr.txt
Only syntactic analogies
python analogies.py -m <embedding_model.txt> -t analogies/testset/LX-4WAnalogies_syntactic.txt
Only semantic analogies
python analogies.py -m <embedding_model.txt> -t analogies/testset/LX-4WAnalogies_semantic.txt
All analogies
python analogies.py -m <embedding_model.txt> -t analogies/testset/LX-4WAnalogies.txt