This is the repository for the paper "Lost in the Source Language: How Large Language Models Evaluate the Quality of Machine Translation" accepted by ACL 2024 findings. We provide our source code, data and results for easy reimplementation.
- Python >= 3.8.0
- Pytorch >= 2.1.2
- langchain >= 0.1.0
- langchain-core >= 0.1.9
- pandas
- openai
- Optional vllm
We recommend to use vllm to accelerate the inference.
Coarse-grained Score Prediction(GEMBA)
We use GEMBA's source code to predict scores. The results of our experiments are in the gemba_results folder. We compute the correlations between the metrics scores and human scores using mt-metrics-eval.
Fine-grained Error Detection(AutoMQM)
We implement AutoMQM for fine-grained error detection. For example, if you want to use GPT-3.5 in the S-R-T mode, simply run automqm.py as follows:
python automqm.py --model-name gpt-3.5-turbo-0613 --lang-pair en-de --prefix gpt3.5-turbo_ref_stratified_wmt22_ende_3200 --example-selector stratified --has-source --has-reference --prompt-path prompts/prompt_ref_sample.json
To evaluate the output of AutoMQM, use the evaluate.py with the corresponding subcommand like sf1_mf1, mcc, etc., or just use test_all subcommand. If you want to convert the results to MQM scores, use the save_scores subcommand in evaluate.py.
The processed training data is in the data folder, which is derived from wmt21 MQM data. The output format is similar to that of InstructScore. To fine-tune Llama2 model, simply run the finetune_llama2.sh. Don't forget to configure some of the parameters like $MODEL_PATH_OR_NAME.
After training, use the inference.py to generate the answers of the testset.
Finally, use postprocess_inference.py to compute the MQM scores of the answers.