Skip to content

D1Doris/MaxentLanguageIdentification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MaxentLanguageIdentification

The language identification module as used in: http://www.clips.ua.ac.be/sites/default/files/coco.pdf

If you use this model, please cite the paper as follows:

@inproceedings {hoogeveen2011,
title = {CorpusCollie - A Web Corpus Mining Tool for Resource-Scarce Languages},
booktitle = {Proceedings of Conference on Human Language Technology for Development},
year = {2011},
pages = {44-49},
publisher = {Bibliotheca Alexandrina},
organization = {Bibliotheca Alexandrina},
address = {Alexandria, Egypt},
attachments = {http://www.clips.ua.ac.be/sites/default/files/coco.pdf},
author = {Hoogeveen, Doris and De Pauw, Guy}
}

The script can be used to perform language identification using a maxent classifier and a set of language models.

USAGE: language_identification.py <inputfile> <maxent_dir>

<inputfile> is the file you would like to know the language off. This should be a plain text file encoded in UTF-8.
<maxent_dir> is the path to the directory that contains maxent.exe

The script will output the iso name of the guessed language.

Example: language_identification.py inputfile /home/hoogeveen/maxent/

Prerequisite: Wine needs to be installed to be able to run the classifier.

The test text in 'inputfile' is a Dutch story by Toon Tellegen, taken from http://www.dbnl.org/tekst/tell003lang01_01/tell003lang01_01_0001.php
I do not own the rights to this text in any way.

About

A language identification model as used in the following paper: http://www.clips.ua.ac.be/sites/default/files/coco.pdf

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages