procrustes

procrustes is an open source framework to induce cross-lingual word embeddings (weakly) supervised by smaller seeding dictionaries (500-2K pairs). The framework is inspired by vecmap, though improves on specific aspects of the implementation, namely speed-ups via vectorization, self-learning restricted to mutual nearest neighbors for higher quality expansion, and optional preprocessing via iterative normalization. The framework does not support fully unsupervised mappings, as lack thereof has been shown to collapse BLI performance for distant language pairs [1]. Instead, we opt for expanding seeding supervision, for instance sourced by plexy or MUSE, via self-learning.

Features

Transparent: Clean implementation of self-learning-based induction of cross-lingual embedding spaces
Quality: Self-learning via extending seeding dictionary only with mutual nearest neighbors
Fast: All computation heavy aspects of pipeline vectorized and accelerated
Feature-rich: Evaluate, expand seeding dictionary, and map embeddings in single execution

Seeding Dictionaries

Since the framework does not include a fully unsupervised mode, seeding supervision is required to induce a mapping. To that end, two sources are available:

MUSE provides 110 uncased ground-truth bilingual dictionaries
plexy constructs a bilingual lexicon by querying PanLex, a panlinugal lexicon comprising 2,500 dictionaries of 5,700 languages

Walkthrough

See Vulic et al., EMNLP 2019, for a detailed walkthrough of the self-learning-based pipeline:

Load embeddings and seeding dictionary for source and target language
Optional: apply iterative normalization to boost BLI performance
Self-learning - for each iteration do:
1. Induce orthogonal mapping with current dictionary
2. Expand current dictionary with unique mutual nearest neighbors
Evaluation (MRR, HITS@{1,5,10}), dictionary extraction, and mapping full embeddings

Requirements

Python 3
NumPy
Numba

Command-line Interface

(Weakly) supervised bilingual lexicon induction

positional arguments:
  PATH                     Path to training dictionary
  PATH                     Path to source embeddings, stored word2vec style
  PATH                     Path to target embeddings, stored word2vec style

optional arguments:
  --src_output PATH        Path to store mapped source embeddings (default: None)
  --trg_output PATH        Path to store mapped target embeddings (default: None)
  --vocab_limit N          Limit vocabularies to top N entries, -1 for all (default: -1)
  --dico_delimiter PATH    Delimiter in dictionary terms (default: tab-delimited)
  --eval_dico PATH         Path to evaluation dictionary (default: None)
  --write_dico PATH        Write inferred dictionary to path (default: None)
  --self_learning N        Number of self-learning iterations (default: 20)
  --iter_norm              Perform iterative normalization (default: False)
  --vocab_cutoff k [k ...] Restrict self-learning to k most frequent tokens (default: 20000)
  --log PATH               Store log at given path (default: debug)

Usage

See evaluation.sh for an example configuration to run evaluation and ko-eo.example.txt for an illustrative output log.

CLI Comments

write_dictionary: two dictionaries are written to disk --
1. Dictionary expanded by self-learning, prefixed with "SL"
2. Expand seeding dictionary anew by mutual neearest neighbors with learned mapping
vocab_cutoff: can be set as a list with cutoff ramp-up, e.g. 500 2500 5000, where remaining iterations use last value

References

Resources

PanLex, the world's largest lexical database establishing a panlingual dictionary covering 5,700 languages

Papers

Ivan Vulić, Goran Glavaš, Roi Reichart, Anna Korhonen, Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?, EMNLP 2019
Mozhi Zhang, Keyulu Xu, Ken-ichi Kawarabayashi, Stefanie Jegelka, Jordan Boyd-Graber Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization

Contact

Author: Fabian David Schmidt
Affiliation: University of Mannheim
E-Mail: fabian.david.schmidt@hotmail.de

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
LICENSE.md		LICENSE.md
README.md		README.md
evaluation.sh		evaluation.sh
ko-eo.example.txt		ko-eo.example.txt
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

procrustes

Features

Seeding Dictionaries

Walkthrough

Requirements

Command-line Interface

Usage

CLI Comments

References

Resources

Papers

Contact

About

Releases

Packages

Languages

License

fdschmidt93/procrustes

Folders and files

Latest commit

History

Repository files navigation

procrustes

Features

Seeding Dictionaries

Walkthrough

Requirements

Command-line Interface

Usage

CLI Comments

References

Resources

Papers

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages