Skip to content

Cross-lingual word embeddings are crucial for various multilingual NLP tasks. This project focuses on aligning monolingual word embeddings from English and Hindi to create a shared cross-lingual embedding space.

Notifications You must be signed in to change notification settings

praj-pawar/aligning-cross-lingual-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Cross-Lingual Word Embedding Alignment using Procrustes Method

This repository provides an implementation for aligning cross-lingual word embeddings between English and Hindi using the Procrustes method. The project includes data preparation, embedding alignment, and evaluation of translation accuracy.

Table of Contents

  1. Introduction
  2. Data Preparation
  3. Embedding Alignment
  4. Evaluation
  5. Unsupervised Approach
  6. Resources
  7. License

Introduction

This project focuses on implementing and evaluating a supervised cross-lingual word embedding alignment system. We use the Procrustes method to map word vectors from English to Hindi while preserving semantic similarities.

Data Preparation

  • Training Monolingual Embeddings:

    • English: Train FastText embeddings on 10,000 English Wikipedia articles or use pre-trained FastText embeddings.
    • Hindi: Train FastText embeddings on 10,000 Hindi Wikipedia articles or use pre-trained FastText embeddings.
    • Limit vocabulary to the top 100,000 most frequent words in each language.
  • Bilingual Lexicon:

    • Obtain a list of word translation pairs from the MUSE dataset. This lexicon is used for supervised alignment.

Embedding Alignment

  • Procrustes Alignment:
    • Implement the Procrustes alignment method to learn a linear mapping between English and Hindi embeddings using the bilingual lexicon.
    • Ensure that the mapping is orthogonal to preserve distances and angles between word vectors.

Evaluation

  • Translation Accuracy:

    • Perform word translation from English to Hindi using the aligned embeddings.
    • Evaluate translation accuracy using the MUSE test dictionary.
    • Report Precision@1 and Precision@5 metrics for the word translation task.
  • Cosine Similarity Analysis:

    • Compute and analyze cosine similarities between word pairs to assess cross-lingual semantic similarity.
  • Ablation Study:

    • Conduct an ablation study to assess the impact of bilingual lexicon size on alignment quality.
    • Experiment with different training dictionary sizes (e.g., 5k, 10k, 20k word pairs).

Unsupervised Approach

  • Unsupervised Alignment:
    • Implement an unsupervised alignment method such as Cross-Domain Similarity Local Scaling (CSLS) combined with adversarial training, as described in the MUSE paper.
    • Compare the performance of the unsupervised method with the supervised Procrustes method.

Resources

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Cross-lingual word embeddings are crucial for various multilingual NLP tasks. This project focuses on aligning monolingual word embeddings from English and Hindi to create a shared cross-lingual embedding space.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published