DynaDict is a simple tool to induce a source-to-target language n-gram phrase table from a list of phrases and cross-lingual word embeddings. Candidate phrases are first pre-ranked by cosine similarity of (averaged) phrase word embeddings and then re-ranked with DynaMax-Jaccard in both directions. The dictionary then constitutes inferred mutual nearest neighbors for source and target phrases.
- Train word embeddings with word2vec or fastText for source and target language corpora
- Map embeddings to a joint space with Procrustes or vecmap
- Extract n-grams of source and target language corpora
- Infer n-gram phrase table(s) with DynaDict
- Optional: iteratively resolve multiple phrase tables
DynaDict segments step 4. of the walkthrough as follows:
- Pre-ranking: Pre-rank top-5,000 candidates of the respective target language by phrase with cosine similarity of (averaged) phrase embeddings
- Re-ranking: Re-rank top-5,000 candidates with fuzzy Jaccard similarity per DynaMax-Jaccard
- Candidate resolution: Perform step 1 and 2 in both directions and add mutual nearest neighbors to dictionary
DynaDict is able to jointly infer any n-gram embeddings. In practice, we first inferred a joint dictionary for the top {50,100,100}K {uni,bi,tri}-grams, respectively, and, subsequently repeat the algorithm for all {uni,bi,tri}-grams stand-alone. The resulting four dictionaries are then resolved by sequentially adding phrase candidates to the joint dictionary from the n-gram dictionaries, for which none of the candidate phrases is yet included.
n-gram Dictionary Induction using DynaMax Jaccard
positional arguments:
PATH Path to mapped source embeddings, stored word2vec style
PATH Path to mapped target embeddings, stored word2vec style
PATH Path to source n-grams
PATH Path to target n-grams
PATH Path to store inferred dictionary
N top-k candidates for to retrieve during pre-ranking
optional arguments:
-h, --help show this help message and exit
--src_unigram_counts PATH
Path to source unigram counts for smooth inverse frequency weighting (default: None)
--trg_unigram_counts PATH
Path to target unigram counts for smooth inverse frequency weighting (default: None)
- Usage: See
induce_dico.sh
andmerge_dico.sh
for guidance.merge_dico.sh
enables to iteratively increment additional dictionaries, only adding candidates for which none of the phrases were included in the prior iteration(s). - Input format: Phrases should be in a UTF-8 encoded file with one phrase per line and single tokens per phrase being whitespace-separated. If pre-ranking shall use SIF, then uni-grams and uni-gram probabilities should be tab-delimited (cf.
./samples/input/en.phrases.txt
)
This code is written in Python 3. The requirements are listed in requirements.txt.
pip3 install -r requirements.txt
In particular, DynaDict requires NumPy and Numba to perform fast retrieval. To that end, DynaDict includes a JIT-compiled version of DynaMax-Jaccard to enable large-scale parallelization.
- Why DynaMax-Jaccard?: DynaMax-Jaccard is a high-performing non-parametric word embedding aggregator that performs strongly in cross-lingual retrieval scenarios, see SEAGLE for a comparative evaluation
- Why Pre-Ranking:? DynaMax-Jaccard requires many expensive operations (dynamic vector creation, multiple pooling operations) that become prohibitively expensive with quadratic complexity
- Why Numba?: Numba is a JIT compiler that allows to straightforwardly parallelize non-vectorized operations
- Why Iterative Candidate Resolution?: Iterative resolution of dictionaries balances dictionary quality and size; mutual nearest neighbors are more regularized for larger candidate sets when inferring a joint dictionary
- Vitalii Zhelezniak, Aleksandar Savkov, April Shen, Francesco Moramarco, Jack Flann, and Nils Y. Hammerla, Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, ICLR 2019.
- Fabian David Schmidt, Markus Dietsche, Simone Ponzetto, Goran Glavas, SEAGLE: A Platform for Comparative Evaluation of Semantic Encoders for Information Retrieval, EMNLP 2019
- Sanjeev Arora, Yingyu Liang, Tengyu Ma, A Simple But Tough-To-Beat Baseline for Sentence Embeddings, ICLR 2017
Author: Fabian David Schmidt
Affiliation: University of Mannheim
E-Mail: fabian.david.schmidt@hotmail.de