procrustes is an open source framework to induce cross-lingual word embeddings (weakly) supervised by smaller seeding dictionaries (500-2K pairs). The framework is inspired by vecmap, though improves on specific aspects of the implementation, namely speed-ups via vectorization, self-learning restricted to mutual nearest neighbors for higher quality expansion, and optional preprocessing via iterative normalization. The framework does not support fully unsupervised mappings, as lack thereof has been shown to collapse BLI performance for distant language pairs [1]. Instead, we opt for expanding seeding supervision, for instance sourced by plexy or MUSE, via self-learning.
- Transparent: Clean implementation of self-learning-based induction of cross-lingual embedding spaces
- Quality: Self-learning via extending seeding dictionary only with mutual nearest neighbors
- Fast: All computation heavy aspects of pipeline vectorized and accelerated
- Feature-rich: Evaluate, expand seeding dictionary, and map embeddings in single execution
Since the framework does not include a fully unsupervised mode, seeding supervision is required to induce a mapping. To that end, two sources are available:
- MUSE provides 110 uncased ground-truth bilingual dictionaries
- plexy constructs a bilingual lexicon by querying PanLex, a panlinugal lexicon comprising 2,500 dictionaries of 5,700 languages
See Vulic et al., EMNLP 2019, for a detailed walkthrough of the self-learning-based pipeline:
- Load embeddings and seeding dictionary for source and target language
- Optional: apply iterative normalization to boost BLI performance
- Self-learning - for each iteration do:
- Induce orthogonal mapping with current dictionary
- Expand current dictionary with unique mutual nearest neighbors
- Evaluation (MRR, HITS@{1,5,10}), dictionary extraction, and mapping full embeddings
- Python 3
- NumPy
- Numba
(Weakly) supervised bilingual lexicon induction
positional arguments:
PATH Path to training dictionary
PATH Path to source embeddings, stored word2vec style
PATH Path to target embeddings, stored word2vec style
optional arguments:
--src_output PATH Path to store mapped source embeddings (default: None)
--trg_output PATH Path to store mapped target embeddings (default: None)
--vocab_limit N Limit vocabularies to top N entries, -1 for all (default: -1)
--dico_delimiter PATH Delimiter in dictionary terms (default: tab-delimited)
--eval_dico PATH Path to evaluation dictionary (default: None)
--write_dico PATH Write inferred dictionary to path (default: None)
--self_learning N Number of self-learning iterations (default: 20)
--iter_norm Perform iterative normalization (default: False)
--vocab_cutoff k [k ...] Restrict self-learning to k most frequent tokens (default: 20000)
--log PATH Store log at given path (default: debug)
- See
evaluation.sh
for an example configuration to run evaluation andko-eo.example.txt
for an illustrative output log.
- write_dictionary: two dictionaries are written to disk --
- Dictionary expanded by self-learning, prefixed with "SL"
- Expand seeding dictionary anew by mutual neearest neighbors with learned mapping
- vocab_cutoff: can be set as a list with cutoff ramp-up, e.g. 500 2500 5000, where remaining iterations use last value
- PanLex, the world's largest lexical database establishing a panlingual dictionary covering 5,700 languages
- Ivan Vulić, Goran Glavaš, Roi Reichart, Anna Korhonen, Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?, EMNLP 2019
- Mozhi Zhang, Keyulu Xu, Ken-ichi Kawarabayashi, Stefanie Jegelka, Jordan Boyd-Graber Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization
Author: Fabian David Schmidt
Affiliation: University of Mannheim
E-Mail: fabian.david.schmidt@hotmail.de