Cross-Lingual Word Embedding Alignment using Procrustes Method

This repository provides an implementation for aligning cross-lingual word embeddings between English and Hindi using the Procrustes method. The project includes data preparation, embedding alignment, and evaluation of translation accuracy.

Introduction

This project focuses on implementing and evaluating a supervised cross-lingual word embedding alignment system. We use the Procrustes method to map word vectors from English to Hindi while preserving semantic similarities.

Data Preparation

Training Monolingual Embeddings:
- English: Train FastText embeddings on 10,000 English Wikipedia articles or use pre-trained FastText embeddings.
- Hindi: Train FastText embeddings on 10,000 Hindi Wikipedia articles or use pre-trained FastText embeddings.
- Limit vocabulary to the top 100,000 most frequent words in each language.
Bilingual Lexicon:
- Obtain a list of word translation pairs from the MUSE dataset. This lexicon is used for supervised alignment.

Embedding Alignment

Procrustes Alignment:
- Implement the Procrustes alignment method to learn a linear mapping between English and Hindi embeddings using the bilingual lexicon.
- Ensure that the mapping is orthogonal to preserve distances and angles between word vectors.

Evaluation

Translation Accuracy:
- Perform word translation from English to Hindi using the aligned embeddings.
- Evaluate translation accuracy using the MUSE test dictionary.
- Report Precision@1 and Precision@5 metrics for the word translation task.
Cosine Similarity Analysis:
- Compute and analyze cosine similarities between word pairs to assess cross-lingual semantic similarity.
Ablation Study:
- Conduct an ablation study to assess the impact of bilingual lexicon size on alignment quality.
- Experiment with different training dictionary sizes (e.g., 5k, 10k, 20k word pairs).

Unsupervised Approach

Unsupervised Alignment:
- Implement an unsupervised alignment method such as Cross-Domain Similarity Local Scaling (CSLS) combined with adversarial training, as described in the MUSE paper.
- Compare the performance of the unsupervised method with the supervised Procrustes method.

Resources

MUSE dataset and pre-trained embeddings
FastText
Procrustes alignment method (Described in "Word Translation Without Parallel Data" by Conneau et al. (2017))

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Cross_Lingual_Embedding_Alignment.ipynb		Cross_Lingual_Embedding_Alignment.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Lingual Word Embedding Alignment using Procrustes Method

Table of Contents

Introduction

Data Preparation

Embedding Alignment

Evaluation

Unsupervised Approach

Resources

License

About

Releases

Packages

Languages

praj-pawar/aligning-cross-lingual-embeddings

Folders and files

Latest commit

History

Repository files navigation

Cross-Lingual Word Embedding Alignment using Procrustes Method

Table of Contents

Introduction

Data Preparation

Embedding Alignment

Evaluation

Unsupervised Approach

Resources

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages