Language detection for Android: Given a string of text, identify what language the text is written in.
This project is a fork of an excellent Java language detection library (language-detection) written by Nakatani Shuyo. The original git version control history and commit messages are retained in this project.
See also: CLD2 - Compact Language Detection 2
There are no plans to update this project, but feel free to fork the repo and have at it.
I've made two significant changes to the original code:
-
Speed enhancements. As an alternative to using JSON-based text files for storing language profiles, a Python script is used to convert language profiles into Java code that can be bundled with an app. With the resulting performance improvement, language detection is fast enough to run acceptably on Android devices.
-
Additional language profiles:
- Aragonese
- Asturian
- Basque
- Belarusian
- Breton
- Catalan
- Galician
- Haitian
- Icelandic
- Irish
- Malay
- Maltese
- Occitan
- Serbian (Cyrillic alphabet)
- Welsh
- Yiddish
git clone git@github.com:rmtheis/language-detection.git
See the original project on Google Code.
Set up the language profile list in DetectorFactory.java
.
To generate a language profile, download a Wikipedia abstract file to use as a training data set.
For example, click anwiki
and download anwiki-20121227-abstract.xml
to
language-detection/abstracts/
and do:
cd language-detection
mkdir abstracts/profiles
java -jar lib/langdetect.jar --genprofile -d language-detection/abstracts an
python scripts/genprofile.py -i abstracts/profiles/an > AN.java
Maven repository:
<repository>
<id>nitin.public.maven.repository.release</id>
<name>Nitin's Public Release Repository</name>
<url>https://raw.github.com/nitinverma/public.maven.repository/master/releases/</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
Maven dependency:
<dependency>
<groupId>com.cybozu.labs</groupId>
<artifactId>langdetect</artifactId>
<version>${com.cybozu.labs.version}</version>
</dependency>
The hard-coded n-grams hashmap language profiles in this project may lead to memory problems during the build of an Android project containing the language profiles.
To work around this, change the default max heap size setting for dx to
defaultMx="-Xmx2048M"
in android-sdk/platform-tools/dx
(or dx.bat
on Windows).
Building with ant may give better results than Eclipse.