Implementation of our paper "Scaling Back-Translation with Domain Text Generation for Sign Language Gloss Translation". Accepted in EACL 2023.
08/05/2023: Updated pseudo gloss-text pair data in .data/DA_paralle_sample/Zh
. Generated by PGen and mT5, featuring improved scaling and height quality!
(left out in the last version. Sorry).
Sign language gloss translation aims to translate the sign glosses into spoken language texts, which is challenging due to the scarcity of labeled gloss-text parallel data. Back translation (BT), which generates pseudo-parallel data by translating in-domain spoken language texts into sign glosses, has been applied to alleviate the data scarcity problem. However, the lack of large-scale high-quality in-domain spoken language text data limits the effect of BT. In this paper, to overcome the limitation, we propose a Prompt based domain text Generation (PGen) approach to produce the large-scale in-domain spoken language text data. Specifically, PGEN randomly concatenates sentences from the original in-domain spoken language text data as prompts to induce a pre-trained language model (i.e., GPT-2) to generate spoken language texts in a similar style. Experimental results on three benchmarks of sign language gloss translation in varied languages demonstrate that BT with spoken language texts generated by PGEN significantly outperforms the compared methods. In addition, as the scale of spoken language texts generated by PGEN increases, the BT technique can achieve further improvements, demonstrating the effectiveness of our approach. We release the code and data for facilitating future research in this field.
We conduct both intrinsic and extrinsic evaluations for the proposed PGen approach.
A) The word frequencies for the four monolingual corpus which gained by different methods.
Figure 2: The word frequency distribution on different types of monolingual corpora. The X-axis represents different words, while the Y-axis represents word frequency.
B) The performance of the gloss-to-text translation task when scaling the used monolingual data from our PGen and the retrieved approa
Figure 3: The translation performance of back-translation when scaling the used monolingual data from our PGen and the retrieved approach. The red dashed line denotes the baseline model without back-translation.
C) The performance of Gloss-to-text translation on Phoenix2014T, CSL-daily and ASLG-PC12.
This code is based on transformers for PGen and fairseq for gloss-to-text translation. You can follow their pages to install.
You can refer to transformers-language-modeling to finetuning a pre-trained GPT. Simply, you also can follow our script bash finetuning_GPT.sh
to quick start.
You can follow the documents of OpenAI GPT2 to generate the sentences. Also, we give a script bash generate_in_domain_sentences.sh
to support.
There, we give a large in-domain monolingual texts for SLT tasks (Phoenix2014T, CSL-daily and ASLG-PC12).
You can find them in data/PGen_monolingual/*
and get a quick start when apply scaling BT to sign language translation.
You can follow fairseq or mT5 to train the translation models (e.g. text-to-gloss or gloss-to-text)
We give some based script in ./G2T/[ASL | DSL | CSL]
, which are also work for T2G by reversing source language tag and target one.