This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.
Final technical report on http://www.coli.uni-saarland.de/courses/cl4lrl-swp/data/SugaliPoster.pdf
Given a string of text in an arbitrary language, can we train a system to recognize what language the text is written in? The project uses three sources of data: the Universal Declaration of Human Rights, Wikipedia, ODIN, and some portions of the data available from Omniglot. The resulting sytem cover well over 1000 languages with their system.
As a spin-off, we've also produce the SeedLing corpus with data from over a 1000 languages. The corpus is freely available on the SeedLing github repository. The reference paper for the corpus is on https://www.aclweb.org/anthology/W14-2211/
- Susanne Fertmann
- Guy Emerson
- Liling Tan
- Alexis Palmer
- Michaela Regneri
If you would need to refer to the poster or the code, feel free to cite
@misc{sugali,
author = {Susanne Fertmann and Guy Emerson and Liling Tan},
title = {Language Identification for Low-Resource Languages},
year = {2014},
url = "https://github.com/alvations/sugali/",
institution = {Saarland University, Germany},
note = "Technical Report for NLP projects for low-resource languages. Saarland, Germany"
}