This project is inspired by Stanford's CS224N NMT Project
Dataset used in this project: News Commentary v14
This project is more of a learning project to make myself familiar with Pytorch, machine translation, and NLP model training.
To investigate how would various setups of the recurrent layer affect the final performance, I compared Training Efficiency and Effectiveness of different types of RNN layer for encoder by changing one feature each time while controlling all other parameters:
-
RNN types
- GRU
- LSTM
-
Activation Functions on Output Layer
- Tanh
- ReLU
- LeakyReLU
-
Number of layers
- single layer
- double layer
_/
├─ utils.py # utilities
├─ vocab.py # generate vocab
├─ model_embeddings.py # embedding layer
├─ nmt_model.py # nmt model definition
├─ run.py # training and testing
-
source: 相反,这意味着合作的基础应当是共同的长期战略利益,而不是共同的价值观。
- target: Instead, it means that cooperation must be anchored not in shared values, but in shared long-term strategic interests.
- translation: On the contrary, that means cooperation should be a common long-term strategic interests, rather than shared values.
-
source: 但这个问题其实很简单: 谁来承受这些用以降低预算赤字的紧缩措施的冲击。
- target: But the issue is actually simple: Who will bear the brunt of measures to reduce the budget deficit?
- translation: But the question is simple: Who is to bear the impact of austerity measures to reduce budget deficits?
-
source: 上述合作对打击恐怖主义、贩卖人口和移民可能发挥至关重要的作用。
- target: Such cooperation is essential to combat terrorism, human trafficking, and migration.
- translation: Such cooperation is essential to fighting terrorism, trafficking, and migration.
-
source: 与此同时, 政治危机妨碍着政府追求艰难的改革。
- target: At the same time, political crisis is impeding the government’s pursuit of difficult reforms.
- translation: Meanwhile, political crises hamper the government’s pursuit of difficult reforms.
Preprocessing Colab notebook
- using
jieba
to separate Chinese words by spaces
-
Input: training data of Chinese and English
-
Output: a vocab file containing mapping from (sub)words to ids of Chinese and English -- a limited size of vocab is selected using SentencePiece (essentially Byte Pair Encoding of character n-grams) to cover around 99.95% of training data
-
a Seq2Seq model with attention
This image is from the book DIVE INTO DEEP LEARNING
- Encoder
- A Recurrent Layer
- Decoder
- LSTMCell (hidden_size=512)
- Attention
- Multiplicative Attention
- Encoder
Training Colab notebook
- Hyperparameters:
- Embedding Size & Hidden Size: 512
- Dropout Rate: 0.25
- Starting Learning Rate: 5e-4
- Batch Size: 32
- Beam Size for Beam Search: 10
- NOTE: The BLEU score calculated here is based on the
Test Set
, so it could only be used to compare the relative effectiveness of the models using this data
- Dataset: the dataset is split into training set(~260000), validation set(~20000), and testing set(~20000) randomly (they are the same for each experiment group)
- Max Number of Iterations: 50000
- NOTE: I've tried Vanilla-RNN(nn.RNN) in various ways, but the BLEU score turns out to be extremely low for it (absence of
residual connections
might be the issue)- I decided to not include it for comparison until the issue is resolved
Bidirectional 2-Layer LSTM with Tanh, 1024 embed_size & hidden_size, trained 11517.19 sec (44000 iterations), BLEU score 17.95
Traning Time | BLEU Score on Test Set | Training Perplexities | Validation Perplexities | |
---|---|---|---|---|
Best Model | 11517.19 | 17.95 |
- LSTM tends to have better performance than GRU (it has an extra set of parameters)
- Tanh tends to be better since less information is lost
- Making the LSTM deeper (more layers) could improve the performance, but it cost more time to train
- Surprisingly, the training time for A, B, and D are roughly the same
- the issue may be the dataset is not large enough, or the cloud service I used to train models does not perform consistently
- source: 全球目击组织(Global Witness)的报告记录, 光是2015年就有16个国家的185人被杀。
- target: A Global Witness report documented 185 killings across 16 countries in 2015 alone.
- translation: According to the Global eye, the World Health Organization reported that 185 people were killed in 2015.
- problems:
- Information Loss: 16 countries
- Unknown Proper Noun: Global Witness
- source: 大自然给了足以满足每个人需要的东西, 但无法满足每个人的贪婪。
- target: Nature provides enough for everyone’s needs, but not for everyone’s greed.
- translation: Nature provides enough to satisfy everyone.
- problems:
- Huge Information Loss
- source: 我衷心希望全球经济危机和巴拉克·奥巴马当选总统能对新冷战的荒唐理念进行正确的评估。
- target: It is my hope that the global economic crisis and Barack Obama’s presidency will put the farcical idea of a new Cold War into proper perspective.
- translation: I do hope that the global economic crisis and President Barack Obama will be corrected for a new Cold War.
- problems:
- Action Sender And Receiver Exchanged
- Failed To Translate Complex Sentence
- source: 人们纷纷猜测欧元区将崩溃。
- target: Speculation about a possible breakup was widespread.
- translation: The eurozone would collapse.
- problems:
- Significant Information Loss
- Dataset
- The dataset is fairly small, and our model is not being trained thorough all data
- Being a native Chinese speaker, I could not understand what some of the source sentences are saying
- The target sentences are not informational comprehensive; they themselves need context to be understood (e.g. the target sentence in the last "Bad Examples")
- Even for human, some of the source sentence was too hard to translate
- Model Architecture
- CNN & Transformer
- character based model
- Make the model even larger & deeper (... I need GPUs)
- Tricks that might help
- Add a proper noun dictionary to translate unknown proper nouns word-by-word (phrase-by-phrase)
- Initialize (sub)word embedding with pretrained embedding
- Download the dataset you desire, and change all "./zh_en_data" in
run.sh
to the path where your data is stored - To run locally on a CPU (mostly for sanity check, CPU is not able to train the model)
- set up the environment using conda/miniconda
conda env create --file local env.yml
- set up the environment using conda/miniconda
- To run on a GPU
- set up the environment and running process following the Colab notebook
If you have any questions or you have trouble running the code, feel free to contact me via email