Skip to content

This repository contains guidelines and tools for SMT system developement researched within the project "EKT63 Eesti masintõlke kvaliteedi parendamine keeleteadmiste abil" of the program "Riiklik programm "Eesti keeletehnoloogia (2011-2017)" ".

License

Notifications You must be signed in to change notification settings

tilde-nlp/et-mt-tools

Repository files navigation

Estonian Machine Translation Tools

This repository contains different linguistic processing tools and guidelines for statistical machine translation (SMT) system training developed within the project EKT63 Eesti masintõlke kvaliteedi parendamine keeleteadmiste abil of the program Riiklik programm "Eesti keeletehnoloogia (2011-2017).

During the project, multiple new technologies were investigated for the improvement of SMT system quality when translating from and to the Estonian language. The scientific activity reports are available as the project's deliverables. This repository lists the publicly available components developed within the project.

The tools have been developed for the Moses SMT system, the Nematus NMT system (for NMT system training) and the AmuNMT decoder (for translation with the NMT models trained with Nematus).

Project period of 2015

The repository contains the following tools from the project's period of 2015:

Further information regarding each of the tools (including execution instructions) can be acquired by following the links above.

SMT system development scenarios for the experiments with compound splitting are described here.

SMT system development scenarios for the experiments with phrase table triangulation are described here.

Project period of 2016

The repository contains the following tools and resources from the project's period of 2016:

  • A modified fork of the AmuNMT decoder that allows extracting word alignment matrices and provides better handling of unknown words. The improvements to the AmuNMT decoder have been described here.
  • Tools for flat word alignment extraction from word alignment matrices that can be used for NMT system integration in SMT system workflows for document translation, smart handling of formatting tags and visualisation of alignments. More details can be found here.
  • A set of regular expression-based pre-processing, normalisation, tokenisation, and post-processing rules that improve word alignment of non-translatable identifiers and numerals for all languages. The rules provide special cases for handling of Estonian - Improvements to Word Alignment and Tokenisation of Estonian Texts.
  • A sample configuration template that describes the Nematus NMT system training set-up for the English-Estonian-English and Russian-Estonian-Russian systems within the project can be found in nematus-nmt-config-template.md.

Project period of 2017

The repository contains the following tools and resources from the project's period of 2017:

  • Tilde Tõlge Android app with the capability to translate selected text from any other application where text is selectable.

About

This repository contains guidelines and tools for SMT system developement researched within the project "EKT63 Eesti masintõlke kvaliteedi parendamine keeleteadmiste abil" of the program "Riiklik programm "Eesti keeletehnoloogia (2011-2017)" ".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published